-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data catalog: set the min_rows_per_goup when writing a parquet datase… #1281
Conversation
…t to disk for improved IO performance
I left this property out because it was unclear how the writer would behave if there are say 100 rows to write and both min and max row size is 5000. Would the writer then stall waiting for more rows?
|
It will write out the last group without waiting. I could add a unit test later. |
Thanks for confirming this. Yes a unit test will make sure the behavior stays well checked. |
Just added a unit test. I'm guessing there is something unexpected happening though. It can locally success write the data, but on line 358 it is crashing. I'm guessing it is because of my local installation.
|
Thanks @davidblom I think we were after a test when the data has less than the |
Looks like the pre-commit just needs to be run too:
or
|
Great! Thanks. I've ran black / pre-commit and modified the unit test to be less than |
|
Did you want to investigate this? alternatively I can replicate on a local branch and figure it out too - up to you. |
Hi Chris, this weekend I won't be able to dive deeper due to other obligations. If you have time to dive deeper, that would be great. My gut feeling based on some other tests is that the unit test will also fail when the changes to the ParquetDatacatalog are reverted. I think the catalog does not read parquet files which are partitioned. |
@@ -208,12 +208,14 @@ def write_chunk( | |||
self._fast_write(table=table, path=path, fs=self.fs) | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that it works without the partitioning flag here, perhaps an other check can be used here to enter the write_dataset code block.
Thanks @davidblom603 I went ahead and incorporated your changes and added the unit test, now on The differences being I added In a later commit I also added a check for sort order on write, which raises a |
Fantastic! Many thanks |
…t to disk for improved IO performance.
Pull Request
Set the
min_rows_per_group
attribute when writing a parquet dataset, otherwise it could end up with very small row groups hurting IO performance. This is mentioned in the documentation at themax_rows_per_group
attribute, and also confirmed by my local tests:https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html
There is still room for improvement to process an iterator instead of a list to write datasets which do not fit in memory. Will leave that for later.
Type of change
Delete options that are not relevant.
How has this change been tested?
Describe how this code was/is tested.