Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a hint about normalization in error message #14089

Closed
alamb opened this issue Jan 11, 2025 · 5 comments · Fixed by #14113
Closed

Add a hint about normalization in error message #14089

alamb opened this issue Jan 11, 2025 · 5 comments · Fixed by #14113
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Jan 11, 2025

Is your feature request related to a problem or challenge?

There has been significant and long standing confusion about how to refer to columns with capitalization, most recently:

I think the root cause is that SQL is largely case insensitive but many DataFrame like systems are case sensitive.

There is a larger question if we could perhaps have less confusing defaults, but I think we could make the error messages even better

Today if you have a schema with a field named userId (note the capital I), if you run a query like

SELECT userId FROM foo

you will get a seemingly nonsensical error:

Schema error: No field named userid. Valid fields are telemetry_user_events.event, telemetry_user_events.prompt, telemetry_user_events.timestamp, telemetry_user_events.timestamp_utc, telemetry_user_events."teamId", telemetry_user_events."userId", telemetry_user_events.query.\n 1: No field named userid. Valid fields are telemetry_user_events.event, telemetry_user_events.prompt, telemetry_user_events.timestamp, telemetry_user_events.timestamp_utc, telemetry_user_events."teamId", telemetry_user_events."userId", telemetry_user_events.query.

Describe the solution you'd like

I would like to improve the error to add a hint if there is a column name that matches the field exept for case about what to do to fix it.

In the example above something like:

note use double quotes to refer to the "userId" column or set the datafusion.sql_parser.enable_ident_normalization configuration option

So the whole error would be something like

Schema error: No field named userid. You can use double quotes to refer to the "userId" column or set the datafusion.sql_parser.enable_ident_normalization configuration option . Valid fields are telemetry_user_events.event, telemetry_user_events.prompt, telemetry_user_events.timestamp, telemetry_user_events.timestamp_utc, telemetry_user_events."teamId", telemetry_user_events."userId", telemetry_user_events.query.\n 1: No field named userid. Valid fields are telemetry_user_events.event, telemetry_user_events.prompt, telemetry_user_events.timestamp, telemetry_user_events.timestamp_utc, telemetry_user_events."teamId", telemetry_user_events."userId", telemetry_user_events.query.

Describe alternatives you've considered

No response

Additional context

I think this would be super helpful and a nice first issue

@TheBuilderJR
Copy link
Contributor

+1 this would be great! That being said I do believe in the long run BC breaking changes are a good thing. Perhaps it makes sense to aggregate changes like these into a central issue and next time there's a big release that requires BC changes, we can include these small ones in as well.

@alamb
Copy link
Contributor Author

alamb commented Jan 11, 2025

I am not opposed to breaking changes per se :) (see the past history of DataFusion)

However, in this case I think the challege is that either default will be good for some people and bad for others. It is unclear to me how we would decide on the balance if more people are helped or hurt by changing the default (I would guess it is about 50/50 -- for example, it would certainly cause trouble I think for us at InfluxData...)

@cj-zhukov
Copy link
Contributor

take

@findepi
Copy link
Member

findepi commented Jan 12, 2025

I think the root cause is that SQL is largely case insensitive

SQL spec seems to say that unquoted (not delimited) identifiers are equivalent to their upper-case delimited variants (ie abc is equivalent to "abc"). The equivalence relation is what's used for column resolution.

Quoting SQL'16 here

24) For every <identifier body> IB there is exactly one corresponding case-normal form CNF. CNF is an
<identifier body> derived from IB as follows:
Let n be the number of characters in IB. For i ranging from 1 (one) to n, the i-th character Mi of IB is
transliterated into the corresponding character or characters of CNF as follows:
Case:
a) If Mi is a lower case character or a title case character for which an equivalent upper case sequence
U is defined by Unicode, then let j be the number of characters in U; the next j characters of CNF are
U.
b) Otherwise, the next character of CNF is Mi.
28) A <regular identifier> and a <delimited identifier> are equivalent if the case-normal form of the <identifier
body> of the <regular identifier> and the <delimited identifier body> of the <delimited identifier> (with
all occurrences of <quote> replaced by <quote symbol> and all occurrences of <doublequote symbol>
replaced by <double quote>), considered as the repetition of a <character string literal> that specifies a
<character set specification> of SQL_IDENTIFIER and IDC, compare equally according to the comparison
rules in Subclause 8.2, “<comparison predicate>”

However, DataFusion follows PostgreSQL variant of SQL, so my interpretation is that our intended behavior is "like in SQL spec but lower-case".

@alamb
Copy link
Contributor Author

alamb commented Jan 12, 2025

I agree with @findepi -- I think of the automatic lowercasing in DataFusion as the implementation detail of how the case-normalization is implemented.

cj-zhukov pushed a commit to cj-zhukov/datafusion that referenced this issue Jan 13, 2025
comphead pushed a commit that referenced this issue Jan 14, 2025
* Add a hint about normalization in error message (#14089)

* normalization suggestion is only shown when a column name matches schema

---------

Co-authored-by: Sergey Zhukov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants