Eager planning errors #198

eredzik · 2024-11-04T22:49:43Z

When one runs some pyspark code it eagerly adds statements to query plan on each method call on DataFrame (select, join, filter etc.) and if expression is invalid then it throws an error. In this library implementation plan is computed on each step but only verified once query is sent for execution and in my case it throws error when duckdb tries to execute query.

It poses problem in development when I have few hundreds lines of pyspark code and don't get precise stack trace error what happened - it points to final line where materialization occurs.

Expected behavior:
Verify whether query plan is valid with respect to input data on each added step to plan and fail adding when something cannot be resolved or is invalid.

eakmanrq · 2024-12-22T19:07:32Z

Thanks for opening this and it is something I have thought about before.

I think the way to do this would be by running explain plans after each operation to ensure the SQL is valid. For local engines, like DuckDB, this would have no negative user impact but for remote engines it would slow things down. So I could see this being on option is that is configurable by the user but enabled by default for DuckDB.

eakmanrq added the enhancement New feature or request label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eager planning errors #198

Eager planning errors #198

eredzik commented Nov 4, 2024

eakmanrq commented Dec 22, 2024

Eager planning errors #198

Eager planning errors #198

Comments

eredzik commented Nov 4, 2024

eakmanrq commented Dec 22, 2024