Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordReaderImpl.getValueRange() may cause incorrect results #1061

Open
PengleiShi opened this issue Mar 9, 2022 · 7 comments
Open

RecordReaderImpl.getValueRange() may cause incorrect results #1061

PengleiShi opened this issue Mar 9, 2022 · 7 comments

Comments

@PengleiShi
Copy link
Contributor

PengleiShi commented Mar 9, 2022

orc version: 1.6.11, sql: select xxx from xxx where str is not null

Recently i found some orc files wrote by trino didn't have complete statistics in files meta(maybe a presto bug), this causes OrcProto.ColumnStatistics can't be deserialized to any specific ColumnStatisticsImpl such as StringStatisticsImpl, then RecordReaderImpl.getValueRange() returns ValueRange with null lower and RecordReaderImpl.pickRowGroups() skips this row group, which should not be skipped. In normal conditions except above, everything is ok. And i found orc-1.5.x can handle above case according to RecordReaderImpl.UNKNOWN_VALUE, which has removed in 1.6.x. Maybe we could add it back for better compatibility. @dongjoon-hyun @omalley

@dongjoon-hyun
Copy link
Member

Thank you for reporting, @PengleiShi .

  1. Ya, I've heard that there exists ORC writers that doesn't generate statistics properly.
  2. Could you provide some sample ORC files?

@dongjoon-hyun
Copy link
Member

AFAIK, this doesn't happen between Apache ORC writer and reader, right? @PengleiShi

@PengleiShi
Copy link
Contributor Author

PengleiShi commented Mar 10, 2022

AFAIK, this doesn't happen between Apache ORC writer and reader, right? @PengleiShi

Yes, it doesn't. In the case i mentioned, the files were wrote by trino(which has own orc writer) and read by spark(which depends on Apache ORC reader).

@PengleiShi
Copy link
Contributor Author

Thank you for reporting, @PengleiShi .

  1. Could you provide some sample ORC files?

Most of files wrote by trino have proper statistics. I will try to re-generate some problem orc files which can be public.
The meta of problem files is below:
image

@PengleiShi
Copy link
Contributor Author

@dongjoon-hyun. Trino won't write string column statistics if string value is bigger than 64 bytes
wecom-temp-6a63fc2f2a72e176c2d1fc77699f880b
Here is a orc file wrote by trino and contains only one row
image
Test with spark3.2

select * from xxx; 
select * from xxx where name is not null;

20220310_100444_03858_nbvwj_53625cc9-7183-4beb-be48-9d059d8fa560.zip

@pgaref
Copy link
Contributor

pgaref commented Mar 10, 2022

@PengleiShi do you mind sharing the stats of the problematic case above?
We currently trim StringStatistics to 1024 chars in the default writer https://issues.apache.org/jira/browse/ORC-203
I believe Presto should follow a similar logic.
In addition, on the Reader path we probably want to avoid skipping RowGroups when facing problematic/null ValueRange stats.

@PengleiShi
Copy link
Contributor Author

@dongjoon-hyun. Trino won't write string column statistics if string value is bigger than 64 bytes wecom-temp-6a63fc2f2a72e176c2d1fc77699f880b Here is a orc file wrote by trino and contains only one row image Test with spark3.2

select * from xxx; 
select * from xxx where name is not null;

20220310_100444_03858_nbvwj_53625cc9-7183-4beb-be48-9d059d8fa560.zip

@pgaref here i have uploaded a problematic file for test. Its meta shows below
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants