RecordReaderImpl.getValueRange() may cause incorrect results #1061

PengleiShi · 2022-03-09T13:13:54Z

orc version: 1.6.11, sql: select xxx from xxx where str is not null

Recently i found some orc files wrote by trino didn't have complete statistics in files meta(maybe a presto bug), this causes OrcProto.ColumnStatistics can't be deserialized to any specific ColumnStatisticsImpl such as StringStatisticsImpl, then RecordReaderImpl.getValueRange() returns ValueRange with null lower and RecordReaderImpl.pickRowGroups() skips this row group, which should not be skipped. In normal conditions except above, everything is ok. And i found orc-1.5.x can handle above case according to RecordReaderImpl.UNKNOWN_VALUE, which has removed in 1.6.x. Maybe we could add it back for better compatibility. @dongjoon-hyun @omalley

The text was updated successfully, but these errors were encountered:

dongjoon-hyun · 2022-03-10T00:27:28Z

Thank you for reporting, @PengleiShi .

Ya, I've heard that there exists ORC writers that doesn't generate statistics properly.
Could you provide some sample ORC files?

dongjoon-hyun · 2022-03-10T01:02:21Z

AFAIK, this doesn't happen between Apache ORC writer and reader, right? @PengleiShi

PengleiShi · 2022-03-10T03:26:38Z

AFAIK, this doesn't happen between Apache ORC writer and reader, right? @PengleiShi

Yes, it doesn't. In the case i mentioned, the files were wrote by trino(which has own orc writer) and read by spark(which depends on Apache ORC reader).

PengleiShi · 2022-03-10T03:50:17Z

Thank you for reporting, @PengleiShi .

Could you provide some sample ORC files?

Most of files wrote by trino have proper statistics. I will try to re-generate some problem orc files which can be public.
The meta of problem files is below:

PengleiShi · 2022-03-10T12:13:17Z

@dongjoon-hyun. Trino won't write string column statistics if string value is bigger than 64 bytes

Here is a orc file wrote by trino and contains only one row

Test with spark3.2

select * from xxx; 
select * from xxx where name is not null;

20220310_100444_03858_nbvwj_53625cc9-7183-4beb-be48-9d059d8fa560.zip

pgaref · 2022-03-10T18:32:23Z

@PengleiShi do you mind sharing the stats of the problematic case above?
We currently trim StringStatistics to 1024 chars in the default writer https://issues.apache.org/jira/browse/ORC-203
I believe Presto should follow a similar logic.
In addition, on the Reader path we probably want to avoid skipping RowGroups when facing problematic/null ValueRange stats.

PengleiShi · 2022-03-11T03:25:32Z

@dongjoon-hyun. Trino won't write string column statistics if string value is bigger than 64 bytes Here is a orc file wrote by trino and contains only one row Test with spark3.2
select * from xxx; 
select * from xxx where name is not null;
20220310_100444_03858_nbvwj_53625cc9-7183-4beb-be48-9d059d8fa560.zip

@pgaref here i have uploaded a problematic file for test. Its meta shows below

dongjoon-hyun added the enhancement label Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RecordReaderImpl.getValueRange() may cause incorrect results #1061

RecordReaderImpl.getValueRange() may cause incorrect results #1061

PengleiShi commented Mar 9, 2022 •

edited

Loading

dongjoon-hyun commented Mar 10, 2022

dongjoon-hyun commented Mar 10, 2022

PengleiShi commented Mar 10, 2022 •

edited

Loading

PengleiShi commented Mar 10, 2022

PengleiShi commented Mar 10, 2022

pgaref commented Mar 10, 2022

PengleiShi commented Mar 11, 2022

RecordReaderImpl.getValueRange() may cause incorrect results #1061

RecordReaderImpl.getValueRange() may cause incorrect results #1061

Comments

PengleiShi commented Mar 9, 2022 • edited Loading

dongjoon-hyun commented Mar 10, 2022

dongjoon-hyun commented Mar 10, 2022

PengleiShi commented Mar 10, 2022 • edited Loading

PengleiShi commented Mar 10, 2022

PengleiShi commented Mar 10, 2022

pgaref commented Mar 10, 2022

PengleiShi commented Mar 11, 2022

PengleiShi commented Mar 9, 2022 •

edited

Loading

PengleiShi commented Mar 10, 2022 •

edited

Loading