-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory consumption with collect()
#434
Comments
Thanks, we need to work on the docs here. Can you try: duck_exec("set memory_limit='1GB'") or a similar value? |
Mmmhhh.... I added that line just after loading the duckplyr library, but I still have the RAM consumption going through the roof of my system. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Thanks, confirming that, on a VM with 4 GB with Debian Bookworm running in OrbStack, the following example is killed: options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(duckplyr)
library(readr)
if (!file.exists("test.csv")) {
dd <- tibble(x=1:100000000, y=rep(LETTERS[1:20], 5000000))
write_csv(dd, "test.csv")
}
duck_exec("set memory_limit='1GB'")
df <- duck_csv("test.csv")
df_stat <- df |>
summarise(total=sum(x), .by = y)
df_out <- df |>
left_join(y=df_stat, by=c("y")) |>
collect()
df_out |
left_join()
I see. I have 8Gb in my machine and I opened the issue because I can run the code with arrow on my machine (with some difficulties), but not with duckplyr. No intention to start a competition between the two tools, but I assumed there may be a memory leak in duckplyr |
This comment has been minimized.
This comment has been minimized.
To finish this off on my side: if I kill almost any other process, duckplyr also gets the job done on my machine, but the memory consumption is significantly higher than under arrow. |
I'm no longer sure that The following example works, even with 4 GB: options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(duckplyr)
library(readr)
if (!file.exists("test.csv")) {
dd <- tibble(x=1:100000000, y=rep(LETTERS[1:20], 5000000))
write_csv(dd, "test.csv")
}
duck_exec("set memory_limit='1GB'")
df <- duck_csv("test.csv")
df_stat <- df |>
summarise(total=sum(x), .by = y)
df_out <-
df |>
left_join(y=df_stat, by=c("y")) |>
compute_parquet("test.parquet")
df_out Perhaps you can also use The large memory consumption is still interesting, though. |
Thanks. I now also have the procedure to convert a csv into an parquet file without ingesting everything into memory. |
left_join()
collect()
I looked at this a little closer. With A modified, scaled-down version of the example that works with current duckplyr, to illustrate the overhead: options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(duckplyr)
#> ✔ Overwriting dplyr methods with duckplyr methods.
#> ℹ Turn off with `duckplyr::methods_restore()`.
library(readr)
res_size_kb <- function() {
pid_switch <- if (Sys.info()[["sysname"]] == "Darwin") "-p" else "-q"
cmd_line <- paste0("ps x -o rss ", pid_switch, " ", Sys.getpid())
as.numeric(system(cmd_line, intern = TRUE)[[2]])
}
N <- 500000
path <- paste0("test-", N, ".csv")
if (!file.exists(path)) {
dd <- tibble(x = seq.int(N * 20), y = rep(1:20, N))
write_csv(dd, path)
}
pillar::num(file.size(path), notation = "si")
#> <pillar_num(si)[1]>
#> [1] 98.9M
db_exec("set memory_limit='1GB'")
df <- read_csv_duckdb(path)
df_stat <- df |>
summarise(total = sum(x), .by = y)
df_join <- df |>
left_join(y = df_stat, by = c("y"))
res_size_before <- res_size_kb()
df_out <-
df_join |>
collect()
# At the very latest, materialization happens here:
rows <- nrow(df_out)
res_size_after <- res_size_kb()
# Approximation, for speed
obj_size_bytes <- object.size(df_out[rep(NA_integer_, 1000), ]) * rows / 1000
obj_size <- as.numeric(obj_size_bytes) / 1024
pillar::num(rows, notation = "si")
#> <pillar_num(si)[1]>
#> [1] 10M
pillar::num(c(res_size_after, res_size_before, obj_size), notation = "si")
#> <pillar_num(si)[3]>
#> [1] 732.k 267.k 245.k
(res_size_after - res_size_before) / obj_size
#> [1] 1.894857 Created on 2025-01-17 with reprex v2.1.1 |
Hello,
Please have a look at the reprex at the end of the file.
I have a seasoned laptop with 8Gb of RAM which runs debian stable.
When I carry out an aggregation and then a left join with arrow, I need to use some swap when I collect the result (a long dataframe), but I can run the computation.
Instead, if just run the part in duckplyr (commented out in the second part of the reprex), my memory is so insufficient that the laptop freezes. Can someone take a look into this? Is there a much higher memory consumption in duckplyr wrt arrow? Thanks a lot.
Created on 2025-01-01 with reprex v2.1.0
The text was updated successfully, but these errors were encountered: