cached dataframe makes join slower? #19480
Labels
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Checks
Reproducible example
Log output
this the partial output of the explain() method
Issue description
I am trying switch multiple parquets both horizontally (2 parquets) and vertically (hundreds of parquets) into a large dataframe. I join first horizontally and then vertically stack them.
i tried two ways. one way is each time dataframe is unique, and the second one, one of the 2 dataframe is the same for each layer. as expected, polars cached the 2nd dataframe as the explain() output shows.
however, the cache version is 40% slower than the hte uncached version in my experiments. and also, CPU utilization is limited in the cached version, which may indicate some kind of locking.
the code can be found here https://github.com/jackxxu/polars_merge/tree/main.
Expected behavior
i expected the cached version to be faster, but it turns out the opposite.
also, the CPU usage of the cached version is much lower, which explains why it is slower.
Installed versions
The text was updated successfully, but these errors were encountered: