Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support read large statistics exceed 2GB #1404

Open
deshanxiao opened this issue Feb 9, 2023 · 7 comments
Open

Support read large statistics exceed 2GB #1404

deshanxiao opened this issue Feb 9, 2023 · 7 comments

Comments

@deshanxiao
Copy link
Contributor

deshanxiao commented Feb 9, 2023

In Java and C++ reader, we cannot read the orc file with statistics exceed 2GB. We should find a new way or design to support read these files.

com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase the size limit.
	at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:154)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readRawBytesSlowPathOneChunk(CodedInputStream.java:2954)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readBytesSlowPath(CodedInputStream.java:3035)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readBytes(CodedInputStream.java:2446)
	at org.apache.orc.OrcProto$StringStatistics.<init>(OrcProto.java:2118)
	at org.apache.orc.OrcProto$StringStatistics.<init>(OrcProto.java:2070)
	at org.apache.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:3285)
	at org.apache.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:3279)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2423)
	at org.apache.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:8172)
	at org.apache.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:8093)
	at org.apache.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:10494)
	at org.apache.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:10488)
@deshanxiao
Copy link
Contributor Author

In fact, it is not reasonable to have such large statistics for a single ORC file, it requires too much memory. My suggestion is to limit writing such huge statistics in writer side. What do you think? @zabetak

@zabetak
Copy link
Contributor

zabetak commented Feb 22, 2023

At first glance 2GB of metadata seems big especially when considering the toy example that I made in ORC-1361. However, if you have a 500GB ORC file then 2GB of metadata does not appear too big anymore so things are relative.

Are there limitations on the maximum size of an ORC file? Do we support such kind of use-cases?

If we add a limit while writing (which by the way I also suggested in protocolbuffers/protobuf#11729) then we should define what happens when the limit is exceeded:

  • drop all metadata
  • fail the entire write
  • ignore metadata over the limit (keep partial metadata)

@deshanxiao
Copy link
Contributor Author

@zabetak Could you provide a scenario for a 500GB ORC file? In my experience, columnar storage generally serves big data engines, and each of these big data files is generally about 512M/256M.

@omalley
Copy link
Contributor

omalley commented Feb 22, 2023

Yeah, I've seen ORC files upto 2gb or so, but @deshanxiao 's point is accurate that you'll probably get better performance out of Spark & Trino if you keep them smaller than 1-2gb.

That said, can you share how many rows & columns are in the the 500gb file? How big did you configure the stripe buffers and how large are your actual stripes? You should bump up your stripe buffer to get your stripe size to be 256mb or so. That will still give you 2000 stripes, which unless you have a crazy number of columns should be under 2gb.

My guess is that you have a huge number of very small stripes. That will hurt performance, especially if you are using the C++ reader.

@zabetak
Copy link
Contributor

zabetak commented Feb 23, 2023

I am rather positive on the idea of enforcing limits on the writer but I would expect this to be on the protobuf layer (protocolbuffers/protobuf#11729). The limitation comes from protobuf so it seems natural to add checks there and not in ORC code.

The reason that I brought up the question about maximum size is because as the file increases so does the metadata and clearly here we have a hard limitation on till where it can go. If there is a compelling use-case to support arbitrary big files (with arbitrary big metadata) then investing on a new design would make sense.

To be clear, I am not pushing for a new design since I fully agree with both @deshanxiao and @omalley in that with proper schema design + configuration the chances of hitting the problem are rather small. From the ORC perspective, it seems acceptable to settle with the 1/2GB limit on the metadata section.

I didn't mean to imply that having 500GB is a good/normal thing but if nothing prevents user to do it, eventually they will get there. :)

Speaking about actual use-cases, in Hive, I have recently seen OrcSplit reporting a fileLength of 216488139040 for files under certain table partitions. I am not sure how well this number translates to the actual file size nor about the actual configuration that led to this situation since I am not the end-user myself; I was just investigating the problem by checking the Hive application logs.

Summing up, I don't think a new metadata design is worth it at the moment and limiting the writer seems more appropriate to be done in the protobuf layer. From my perspective, the only actionable item regarding this issue at this point would be to add a brief mention about the metadata size limitation on the website (e.g., https://orc.apache.org/specification/ORCv1/).

@deshanxiao
Copy link
Contributor Author

@zabetak Could you explain in detail why we can write but not read in protobuf layer?

@zabetak
Copy link
Contributor

zabetak commented Feb 24, 2023

@deshanxiao In protocolbuffers/protobuf#11729 I shared a very simple project reproducing/explaining the write/read problem. Please have a look and if you have questions I will be happy to elaborate more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants