Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does hive.exec.orc.default.buffer.size affect the file size? #1985

Open
loukey-lj opened this issue Jul 17, 2024 · 1 comment
Open

How does hive.exec.orc.default.buffer.size affect the file size? #1985

loukey-lj opened this issue Jul 17, 2024 · 1 comment

Comments

@loukey-lj
Copy link

i write orc file use spark sql 3.3。
I noticed that in the production environment, many ORC files had small stripe sizes. So, I decided to adjust the value of hive.exec.orc.default.buffer.size from 256K to 1K. I observed a significant increase in the stripe size, and the number of stripes in a single file decreased significantly. Unexpectedly, I found that the file size generated with the same dataset was different for the two parameter values. The final file size with hive.exec.orc.default.buffer.size set to 1K was twice the size of 256K.

Generally, when the stripe size increases, we would expect the compression ratio to be higher. However, it is surprising that reducing the buffer size affects the final file size.

@dongjoon-hyun
Copy link
Member

Could you share some sample reproducible data, @loukey-lj ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants