Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter.sf order of magnitude slower compared to filter #1889

Open
bart1 opened this issue Jan 20, 2022 · 5 comments
Open

filter.sf order of magnitude slower compared to filter #1889

bart1 opened this issue Jan 20, 2022 · 5 comments

Comments

@bart1
Copy link
Contributor

bart1 commented Jan 20, 2022

While doing some investigation in to the performance of my code I found that the order of filter and st_as_sf makes an order of magnitude difference in the performance of code. Its not a bug in the sense that something does not work but it seems that this is maybe unnecessarily slow therefore I thought I would report any way. Most of the time seems to be spend in the function st_sfc on a vapply call. In this example case the solution to change the order is easy but that might not always be the case I'm sure not all users are aware of the dramatic difference.

require(sf)
#> Loading required package: sf
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1; sf_use_s2() is TRUE
suppressPackageStartupMessages(require(dplyr))
n <- 100000
d <- data.frame(rr = factor(sample(size = n, c(NA, "a", "b"), replace = T, prob = c(.05, .45, .5))), xx = runif(n), yy = runif(n))
data <- d
b<-bench::mark(min_iterations = 5,
  data |> filter(!is.na(rr)) |> st_as_sf(
    coords = c("xx", "yy"),
    crs = st_crs(4326L), na.fail = FALSE
  ),
  data |> st_as_sf(
    coords = c("xx", "yy"),
    crs = st_crs(4326L), na.fail = FALSE
  ) |> filter(!is.na(rr))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
b%>% select(median,mem_alloc,expression)
#> # A tibble: 2 × 3
#>     median mem_alloc
#>   <bch:tm> <bch:byt>
#> 1  64.84ms    11.9MB
#> 2    1.68s    25.4MB
#> # … with 1 more variable: expression <bch:expr>
plot(b)
#> Loading required namespace: tidyr

profvis::profvis({
     data |> st_as_sf(
         coords = c("xx", "yy"),
         crs = st_crs(4326L), na.fail = FALSE
       ) |> filter(!is.na(rr))
   })

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.7 sf_1.0-5   
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8         tidyr_1.1.4        ps_1.6.0           class_7.3-19      
#>  [5] assertthat_0.2.1   digest_0.6.29      utf8_1.2.2         R6_2.5.1          
#>  [9] backports_1.4.0    reprex_2.0.1       evaluate_0.14      e1071_1.7-9       
#> [13] ggplot2_3.3.5      highr_0.9          pillar_1.6.4       rlang_0.4.12      
#> [17] rstudioapi_0.13    callr_3.7.0        R.utils_2.11.0     R.oo_1.24.0       
#> [21] rmarkdown_2.11     styler_1.6.2       webshot_0.5.2      stringr_1.4.0     
#> [25] htmlwidgets_1.5.4  munsell_0.5.0      proxy_0.4-26       compiler_4.1.2    
#> [29] vipor_0.4.5        xfun_0.28          pkgconfig_2.0.3    ggbeeswarm_0.6.0  
#> [33] htmltools_0.5.2    tidyselect_1.1.1   tibble_3.1.6       fansi_1.0.2       
#> [37] crayon_1.4.2       withr_2.4.3        R.methodsS3_1.8.1  grid_4.1.2        
#> [41] jsonlite_1.7.2     gtable_0.3.0       lifecycle_1.0.1    DBI_1.1.2         
#> [45] magrittr_2.0.1     units_0.8-0        scales_1.1.1       KernSmooth_2.23-20
#> [49] bench_1.1.2        cli_3.1.0.9000     stringi_1.7.6      profmem_0.6.0     
#> [53] farver_2.1.0       fs_1.5.1           ellipsis_0.3.2     generics_0.1.1    
#> [57] vctrs_0.3.8        tools_4.1.2        R.cache_0.15.0     glue_1.6.0        
#> [61] beeswarm_0.4.0     purrr_0.3.4        processx_3.5.2     fastmap_1.1.0     
#> [65] yaml_2.2.1         colorspace_2.0-2   classInt_0.4-3     knitr_1.36        
#> [69] profvis_0.3.7

Created on 2022-01-20 by the reprex package (v2.0.1)

@bart1
Copy link
Contributor Author

bart1 commented Jan 20, 2022

PS I looked for issues documenting this but it does not seem to be there. I'm also not aware of any documentation on this.

@edzer
Copy link
Member

edzer commented Mar 27, 2023

How and where would you suggest this to be documented?

@bart1
Copy link
Contributor Author

bart1 commented Mar 27, 2023

I have not tested but maybe #2059 resolves this?

@kadyb
Copy link
Contributor

kadyb commented Mar 27, 2023

Isn't this currently fixed by #1938? It seems to be much better now:

            median mem_alloc
filter      31.1ms    9.61MB
filter.sf   76.6ms   17.16MB

@edzer
Copy link
Member

edzer commented Mar 27, 2023

Indeed, whereas with #2059

require(sf)
# Loading required package: sf
# Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
#> Loading required package: sf
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1; sf_use_s2() is TRUE
suppressPackageStartupMessages(require(dplyr))
#n <- 100000
n <- 30000
d <- data.frame(rr = factor(sample(size = n, c(NA, "a", "b"), replace = T, prob = c(.05, .45, .5))), xx = runif(n), yy = runif(n))
data <- d
b<-bench::mark(min_iterations = 5, check = FALSE,
  data |> filter(!is.na(rr)) |> st_as_sf(
    coords = c("xx", "yy"),
    crs = st_crs(4326L), na.fail = FALSE
  ),
  data |> st_as_sf(
    coords = c("xx", "yy"),
    crs = st_crs(4326L), na.fail = FALSE
  ) |> filter(!is.na(rr))
)
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled. 
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
b%>% select(median,mem_alloc,expression)
# # A tibble: 2 × 3
#     median mem_alloc
#   <bch:tm> <bch:byt>
# 1   4.16ms    5.27MB
# 2 368.49ms     7.7MB
# # … with 1 more variable: expression <bch:expr>

something to chew on...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants