Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to disable Physical input schema should be the same as the one converted from logical schema error #13065

Open
alamb opened this issue Oct 22, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 22, 2024

Is your feature request related to a problem or challenge?

This bug, released in DataFusion 42.0.0 ,

Added a new check in the DefaultPhysicalPlanner that the schema of the output plan is the same as the input plan

if physical_input_schema != physical_input_schema_from_logical {
return internal_err!("Physical input schema should be the same as the one converted from logical input schema.");
}

While @jayzhan211 's heroic efforts has this passing in all the DataFusion tests, it turned out this check failed on many downstream implementations:

Downstream in InfluxDB 3.0 we turned the check into a warning in our fork to unblock our upgrade

We even made a patch release to try and get the delta-rs upgrade working:

But it is still failing when I write this (see delta-io/delta-rs#2886 (comment))

Internal error: Failed due to a difference in schemas, original schema: DFSchema { inner: Schema { fields: [Field { name: "id", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "price", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "sold", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "price_float", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "items_in_bucket", data_type: List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "deleted", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "__delta_rs_update_predicate", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None, Some(Bare { table: "target" }), None, None, None, None, None], functional_dependencies: FunctionalDependencies { deps: [] } }, new schema: DFSchema { inner: Schema { fields: [Field { name: "id", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "price", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "sold", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "price_float", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "items_in_bucket", data_type: List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "deleted", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "__delta_rs_update_predicate", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None, Some(Bare { table: "target" }), None, None, None, None, None], functional_dependencies: FunctionalDependencies { deps: [] } }.

Describe the solution you'd like

Note there is at least one open outstanding bug: #13010

I would like some way to disable this check to unblock upgrades in downstream crates.

Describe alternatives you've considered

I propose we add a new config value that lets downstream crates opt in / out of this check, similarly to datafusion.optimizer.skip_failed_rules (see Config Docs)

Something like:

  • datafusion.execution.validate_schema: If true, the DefaultPhysicalPlanner will error if the input plan's schema does not exactly match the output plan.

Additional context

No response

@alamb alamb added the enhancement New feature or request label Oct 22, 2024
@alamb alamb changed the title Disable Option to disable Physical input schema should be the same as the one converted from logical schema error Oct 22, 2024
@alamb
Copy link
Contributor Author

alamb commented Oct 22, 2024

I should clarify -- ideally this check should be enabled by default and it is a goal we should shoot for.

However, as there are clearly bugs in the code that currently prevent it from passing cleanly in all cases (which were previously in the code), I think it is better to relax the check and sort out the errors rather than hard failing plans.

@wiedld
Copy link
Contributor

wiedld commented Oct 22, 2024

However, as there are clearly bugs in the code that currently prevent it from passing cleanly in all cases (which were previously in the code), I think it is better to relax the check and sort out the errors rather than hard failing plans.

Some additional context -- we have definitely not isolated all the bugs this check uncovered. Even with this know bug (#13010) patched for us, we are still encountering this failed check every few minutes. Changing this check to a warning, not an error, has been necessary for us.

Assuming that we are not the only ones, having the feature @alamb proposed here (to also convert to a warning based on configuration) would help unblock others from ungrading datafusion IMO.

@alamb
Copy link
Contributor Author

alamb commented Oct 22, 2024

We (InfluxData) will likely contribute fixes back upstream into DataFusion as we find issues as well

@comphead
Copy link
Contributor

It is probably happens when users utilize only physical planner, although there are schema alignments that happening on logical planner?

@comphead
Copy link
Contributor

I remember bunch of issues on schema comparison when different null flag for the column caused a problem.Since DF reworked the Schema methods this issue might come up again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants