-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Incorrect .join(..., how="left").head(N)
if N <= left_df.height()
and there are duplicate matches
#19422
Conversation
.join(..., how="left").head(N)
if N <= left_df.height()
.join(..., how="left").head(N)
if N <= left_df.height()
and there are duplicate matches
@@ -104,6 +104,7 @@ all = [ | |||
"parquet", | |||
"ipc", | |||
"polars-python/all", | |||
"performant", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-path for left join on sorted keys is also affected, but it is gated behind the performant
feature flag so it doesn't get compiled in debug builds.
I think maybe instead of adding it here, we can add a test step to the Python release workflow that runs through the test suite with the optimized release build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed it was always included in pypolars. It should always be activated.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #19422 +/- ##
=======================================
Coverage 80.03% 80.04%
=======================================
Files 1532 1532
Lines 210749 210766 +17
Branches 2442 2442
=======================================
+ Hits 168676 168701 +25
+ Misses 41518 41510 -8
Partials 555 555 ☔ View full report in Codecov by Sentry. |
Thanks! |
Fixes #19405
Fixes #19403
A long time ago a fast-path was added to left-join that would identify a left-join didn't have any duplicate matches by checking the height of the join_tuples matches the height of the input DF:
2f5f3ec#diff-33db4b397429a6dc2297400e2fe812fb12f17ed977dfeee5f96e34fdb9b05757R728-R730
At some point we added slicing to the input left DF and also the join tuples, meaning that we would incorrectly identify the left-join as having no duplicate matches.