-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed comparison of dplyr and tidierdata #24
Comments
Benchmarking is a tricky issue in Julia because the first time you run code in Julia, it is compiled (leading to a compilation delay). The code usually runs much, much faster after code is compiled. This is true even if you change the underlying dataset or certain parameters, so it's not that Julia is cheating from having cached the answer but is legimately way faster on the second run. This issue is been mitigated in Julia 1.9 by precompilation, which caches compiled code. DataFrames.jl (which TidierData.jl wraps) takes advantage of precompilation workflows, but TidierData.jl for now doesn't do any additional precompilation (which we probably should). I'll move this issue to the TidierData.jl repo, because I agree we should periodically run some basic benchmarks to understand how TidierData.jl stacks up against DataFrames.jl (to understand how much overhead is added) and against R tidyverse. tl;dr: I agree speed is important. Many published benchmarks show DataFrames.jl to be faster than R tidyverse, but we on the first run the compilation can introduce a delay. Two questions:
|
From some initial testing, there is likely room for optimization within TidierData.jl. We will do some profiling on our end to understand the bottlenecks as well as the relationship between data size and overhead (which could be possible if the additional allocations in TidierData are related to inadvertent copies being made). |
Hi Kdpsingh, |
Thanks for sharing. Right now, this package does a lot of extra stuff on top of DataFrames.jl for the sake of user convenience, and I imagine some of that is responsible for the slowdown. However, I do think some of it is fixable because we can avoid certain steps that I think will speed things up. So in summary, the package's main selling point at the moment is the consistent syntax. Hoping in the near future that the speed penalty won't be as large. |
sounds great, while I can not contribute to the development of this package, I will test it when it release again. |
Ok, I did some initial exploration and think I know what's responsible for the slowdown. Some of the functions call an extra We will try the following things in future releases.
|
@Zhaoju-Deng, thanks for bringing this up. This issue is mostly resolved in v0.10.0, which is on the registry now. I haven't yet added support for PrecompileTools (which will minimize differences between the first run of the code and subsequent runs), but otherwise you should see major speed-ups in the performance in v0.10.0. I'll leave this issue open mostly as a placeholder so that we can return to it and add support for PrecompileTools. Feel free to check it out and see if you notice any difference on your end. |
@kdpsingh I just upgraded Julia to v1.10beta1 and tested it again, however, the first compile time increased to 13.9 seconds and the following compile time to be 6-7 seconds (see in the attached screenshot). it is not a big issue for now, but hope it could be solved soon. |
Thanks for sharing the screenshot! I'll try to recreate this on my end. If the dataset happens to be publicly available, please let me know -- otherwise I'll create some synthetic data with similar properties. The precompilation issue will be fixed in a future update. However, we shouldn't be several-fold slower than dplyr so let me look at this carefully. |
I think I know what is going on. Tidier.jl currently points to the old version of TidierData.jl, so you're not seeing the changes from the new version yet. I bet if you go to the package REPL by pressing I'm fixing the Tidier dependencies right now. For TidierData.jl: Feel free to confirm. |
A simple way to fix this is to remove Tidier.jl and to directly update TidierData.jl. I just pushed the updated version of Tidier.jl to the Julia repository, so that should be fixed soon. |
The new version of Tidier.jl is now on the registry. If you update it using |
hi @kdpsingh , I used Tidier v0.7.6 and julia v1.10beta1, while the first compile time ~8 seconds and the following compile time in the range 4.8-5.1 seconds. while the TidierData v10.0.0, the first compile time to be 4.63 second and the following compile time to be 2.6-2.9 seconds. it seems improved a lot! but still slower than dplyr, hope you could re-fine it to be much faster than dplyr! |
We'll keep working on it! Step 1 is for us to try to reproduce this result. I'm surprised it is slower than dplyr here but have some ideas. |
Note to self: My suspicion is that there is still some recompilation happening here because this code isn't wrapped in a function. Will test it out. |
great, I am very intersted to see its lightning fast performance! |
@Zhaoju-Deng Thanks your efforts here! I was wondering, to keep it consistent with the style of benchmarking I have been trying, would it be too much trouble for you to try the following:
This will keep the methods consistent with what I was trying. I'll have a benchmark.jl in the git later this week. Thank you! |
I'm fairly convinced that the Wrapping it in a function at @drizk1 suggests is the easiest way to check this. @drizk1, if we work on a I don't want to assume that this is the issue (until we check it), so we'll revisit further optimization until after we generate a set of benchmarks and explain the implications of benchmarking within functions vs. global scope. |
I just ran a quick test with @time vs @benchmark on the file I've been working off. @time took over twice as long as @benchmark with 80% of @time being in recompilation. Very curious to see what @Zhaoju-Deng might find. |
I just ran @benchmark and the estimated time was indeed only half of the time reported by @time, however, the @benchmark time was not the total compile time fo the code, the actual running time of the code was much longer than the estimated time by @benchmark. I am not familiar with the underlying algorithm for calculating the compiling time of @benchmark and @time, however, to my feeling, the @time estimated time is more close to the "actual" running time of code |
@Zhaoju-Deng, thanks for doing that. The short answer is this. If you write code like this... @chain dt begin
@group_by(tmvFrmId, tmvLifeNumber)
@mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
mean_scc = mean(skipmissing(tmvMviCelICountUdder))
@ungroup ...in the global scope, then Julia first compiles the code (which takes a second), and then runs it. It sometimes has to do less compilation the second time around, but still has to do compilation. However, if you wrap that same code in a function like this... function analysis()
@chain dt begin
@group_by(tmvFrmId, tmvLifeNumber)
@mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
mean_scc = mean(skipmissing(tmvMviCelICountUdder))
@ungroup
end ...and then you run analysis(), the first time you run it, it compiles, and then it doesn't have to compile again. Now you might wonder, well how does that help in interactive usage? Well, if you redefine the function like this, with the data frame function analysis(dataset)
@chain dataset begin
@group_by(tmvFrmId, tmvLifeNumber)
@mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
mean_scc = mean(skipmissing(tmvMviCelICountUdder))
@ungroup
end ...then you can update the data frame and re-run the function with the updated data frame, and it'll be lightning fast (often 10x+ faster than tidyverse). So to summarize:
I had the same questions as you about Try the following set-up using function analysis(dataset)
@chain dataset begin
@group_by(tmvFrmId, tmvLifeNumber)
@mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
mean_scc = mean(skipmissing(tmvMviCelICountUdder))
@ungroup
end
# the first time, the function compiles
@time analysis(dt) # assuming your dataset is named `dt`
# the second time, there should be no recompilation
@time analysis(dt) |
With the recent updates to TidierData.jl, I was curious to revisit some benchmarks. I benchmarked DF.jl vs TidierData.jl on a dataframe that was about 7.4 mil rows x 11 columns Overall they performed nearly identically, coming within 15-20 ms of each other (different cases would lead one to be faster than the other, but minimally (ie 812ms vs 828ms)). The only significant time difference was when the summarize macro was used, at which point TidierData was notably slower. Overall, the progress and performance of TidierData.jl is incredible ! Just thought I'd share the update here. |
Thanks for that update. This is a great reminder that I need to review the benchmarking page you had prepared for our documentation site, clean it up a bit, and make it public. I'll try to run the the |
Also, at some point we should add precompilation to TidierData to remove any lag from first usage. Even though we are primarily wrapping DataFrames.jl (which already caches precompiled code), the parsing functions should be precompiled. |
Hi Karandeep,
Its nice to have the tidyverse package in Julia!
I tried TidierData to create two new columns for a ~1Gb dataset, I was actually expecting TidierData much more faster than dplyr, but the results showed dplyr was much faster than TidierData (0.9s in dplyr and 5.5-8s in TidierData), would it possible to fine-tune the speed so that TidierData would be much more efficient in manipulating large datasets. Personally I think that matters for data analysis.
kind regards,
Zhaoju
The text was updated successfully, but these errors were encountered: