Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Make output dtype known for list.to_struct when fields are passed #19439

Merged
merged 12 commits into from
Oct 27, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Oct 25, 2024

If we receive a list of names in the fields parameter, we will be able to return a dtype in to_field (or if we have a non-zero upper_bound)

Also refactors to define list.to_struct to be a ListFunction in the IR.

#                  Output for Polars 1.11.0 | Output for this branch
#                                           |
print(lf.select(pl.first().list.to_struct(fields=["a", "b"])).collect_schema())
#                 Schema({'a': Struct({})}) | Schema({'a': Struct({'a': Int64, 'b': Int64})})
#                                           |
print(lf.select(pl.first().list.to_struct(fields=str, upper_bound=1)).collect_schema())
# Schema({'a': Struct({'field_0': Int64})}) | Schema({'a': Struct({'0': Int64})})
#                                           |
print(lf.select(pl.first().list.to_struct(upper_bound=1)).collect_schema())
# Schema({'a': Struct({'field_0': Int64})}) | Schema({'a': Struct({'field_0': Int64})})
#                                           |
# Schema changed Unknown instead of Struct({}) when ambiguous:
print(lf.select(pl.first().list.to_struct(fields=str)).collect_schema())
#                 Schema({'a': Struct({})}) | Schema({'a': Unknown})
#                                           |
print(lf.select(pl.first().list.to_struct()).collect_schema())
#                 Schema({'a': Struct({})}) | Schema({'a': Unknown})
# Bugfix - upper_bound was ignored in execution
print(lf.select(pl.first().list.to_struct(upper_bound=1)).collect())
#                            shape: (2, 1)  | shape: (2, 1)
#                            ┌────────────┐ | ┌───────────┐
#                            │ a          │ | │ a         │
#                            │ ---        │ | │ ---       │
#                            │ struct[3]  │ | │ struct[1] │
#                            ╞════════════╡ | ╞═══════════╡
#                            │ {1,2,3}    │ | │ {1}       │
#                            │ {4,5,null} │ | │ {4}       │
#                            └────────────┘ | └───────────┘

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Oct 25, 2024
};

if *max_fields > 0 {
inferred.min(*max_fields)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upper_bound was previously ignored when width was being inferred during execution - it was only used during IR resolving for getting the output names

@@ -343,9 +343,6 @@ impl serde::Serialize for PlCredentialProvider {
{
use serde::ser::Error;

// TODO:
// * Add magic bytes here to indicate a python function
// * Check the Python version on deserialize
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by - outdated todo

@@ -1092,7 +1092,7 @@ def to_array(self, width: int) -> Expr:

def to_struct(
self,
n_field_strategy: ToStructStrategy = "first_non_null",
n_field_strategy: ListToStructWidthStrategy = "first_non_null",
fields: Sequence[str] | Callable[[int], str] | None = None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this functionality where you can pass a Python function that gets called with a field index to generate names, but I'm not sure it's needed since you can just pass a list of field names. Maybe we should remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows you to create fields at runtime. Which is a bit dangerous as it will be dependent of the data. But I think we should keep for now.

I want to add a return_dtype: DataType argument as well that allows you to specify the datatype immediately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a Notes warning that it is a bug in a users query if they don't set fields or return_dtype? Then we can make it required in 2.0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add a return_dtype: DataType argument as well that allows you to specify the datatype immediately.

I think we can already do this by passing a list of names to the fields parameter. For the function case, we also get the output schema if the upper_bound is set - I can add a warning if we get a function without upper_bound being set?

This comment was marked as outdated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I want to add the warnings/notes in a separate PR

@ritchie46
Copy link
Member

Nice! If a user doesn't provide any fields/dtype. We must set the output type to Unknown, not a Struct({}).

Copy link

codecov bot commented Oct 25, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 30 lines in your changes missing coverage. Please review.

Project coverage is 79.95%. Comparing base (7a2f547) to head (eed6f03).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...tes/polars-ops/src/chunked_array/list/to_struct.rs 75.83% 29 Missing ⚠️
crates/polars-plan/src/dsl/function_expr/list.rs 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19439      +/-   ##
==========================================
+ Coverage   79.94%   79.95%   +0.01%     
==========================================
  Files        1534     1534              
  Lines      211059   211121      +62     
  Branches     2444     2444              
==========================================
+ Hits       168727   168799      +72     
+ Misses      41777    41767      -10     
  Partials      555      555              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one Simon. Thanks, this will fix a lot of hairy bugs.

@ritchie46 ritchie46 merged commit 98fcc3f into pola-rs:main Oct 27, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants