-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of rows_distinct()
on large databases
#454
Comments
Hi @marianschmidt I know this is almost a year old, but I just ran into this myself. The problem is stemming from the internal function Lines 2942 to 3045 in dc1b917
It appears that the initial query to determine distinct rows Lines 2353 to 2357 in dc1b917
is good (though different than your initial tests). However, in order to collect the counts of how many records exist, passed, and failed result in 3 separate database queries, which is causing the slowdown.
Changing these to be run on a single query would solve this issue. Also, I think there is a typo in your benchmark code where you test library(pointblank)
library(DBI)
library(RSQLite)
library(tidyverse)
library(bench)
#create large synth data
n_pat <- 1E6
n_diag <- 1E7
diag <- tibble(
repid = sample(1:n_pat, size = n_diag, replace = TRUE),
abrq = sample(20101:20224, size = n_diag, replace = TRUE),
icd = sample(c("E01, E02, E11, E12"), size = n_diag, replace = TRUE),
icd_sub = paste0(icd, ".9")
) %>%
arrange(repid)
#connect sql db
sql_loc <- dbConnect(RSQLite::SQLite(), dbname = tempfile())
dbWriteTable(sql_loc, "diag", diag)
#benchmark pointblank::rows_distinct vs. dplyr::distinct
#added another count to make sure that SQL DISTINCT is actually materialized
results <-
bench::press(
rows = c(1E4),
bench::mark(
pointblank_distinct_sql = create_agent(
tbl = {tbl(sql_loc, "diag") %>% head(rows)}) %>%
rows_distinct() %>%
interrogate(extract_failed = FALSE),
pointblank_distinct_tib = create_agent(
tbl = {diag %>% head(rows)}) %>% # <------- changed this!
rows_distinct() %>%
interrogate(extract_failed = FALSE),
dplyr_distinct_sql = tbl(sql_loc, "diag") %>%
head(n = rows) %>%
distinct() %>%
tally() %>%
collect(),
dplyr_distinct_tib = diag %>%
head(n = rows) %>%
distinct() %>%
tally(),
pointblank_source = tbl(sql_loc, "diag") %>%
head(rows) %>%
dplyr::select(everything()) %>%
dplyr::group_by() %>%
dplyr::mutate(`pb_is_good_` = ifelse(dplyr::n() == 1, TRUE, FALSE)) %>%
dplyr::ungroup() %>%
collect(),
check = FALSE,
iterations = 2
)
)
#> Running with:
#> rows
#> 1 10000
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
results
#> # A tibble: 5 × 7
#> expression rows min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 pointblank_distinct_sql 10000 10.03s 10.04s 0.0996 12.84MB 0.299
#> 2 pointblank_distinct_tib 10000 50.19ms 51.9ms 19.3 3.19MB 28.9
#> 3 dplyr_distinct_sql 10000 23.55ms 24.85ms 40.2 212.03KB 20.1
#> 4 dplyr_distinct_tib 10000 1.04ms 1.13ms 887. 1.01MB 0
#> 5 pointblank_source 10000 2.57s 2.58s 0.388 878.67KB 0 Created on 2023-09-22 with reprex v2.0.2 |
I have been testing pointblank on a very large RSQLite database with more than 2 billion rows.
While most build-in validation functions were running quite fast, the validation of
rows_dinstinct()
took about 3.5 hours for 480M rows table. It was crashing on larger tables.So I was checking if this was generally due to SQL DISTINCT being slow, but could find that this only took 22 minutes on the same data.
I could not detect in the code what makes
rows_distinct()
so extremely slow (possibly pre-/post processing?), but honestly I couldn't even see whether the code usesdplyr::distinct()
or dbplyr translations for databases.To test, I have added a benchmark example below. Be careful, it uses about 500 MB of temp local storage and 5-10 minutes to run the benchmarks.
Results are
dplyr::distinct()
on a tibble takes max 1.5 second to runpointblank::rows_distinct()
however needs almost a minute to run (factor 30-40 slower than dplyr and at least factor 10 slower than SQL distinct).Created on 2022-12-20 with reprex v2.0.2
Session info
The text was updated successfully, but these errors were encountered: