Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data catalog: set the min_rows_per_goup when writing a parquet datase… #1281

Closed

Conversation

davidblom603
Copy link

@davidblom603 davidblom603 commented Oct 18, 2023

…t to disk for improved IO performance.

Pull Request

Set the min_rows_per_group attribute when writing a parquet dataset, otherwise it could end up with very small row groups hurting IO performance. This is mentioned in the documentation at the max_rows_per_group attribute, and also confirmed by my local tests:
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html

There is still room for improvement to process an iterator instead of a list to write datasets which do not fit in memory. Will leave that for later.

Type of change

Delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How has this change been tested?

Describe how this code was/is tested.

@CLAassistant
Copy link

CLAassistant commented Oct 18, 2023

CLA assistant check
All committers have signed the CLA.

@twitu
Copy link
Collaborator

twitu commented Oct 19, 2023

I left this property out because it was unclear how the writer would behave if there are say 100 rows to write and both min and max row size is 5000. Would the writer then stall waiting for more rows?

only write the row groups to the disk when sufficient rows have accumulated.

@davidsblom
Copy link
Member

It will write out the last group without waiting. I could add a unit test later.

@twitu
Copy link
Collaborator

twitu commented Oct 19, 2023

Thanks for confirming this. Yes a unit test will make sure the behavior stays well checked.

@davidblom603
Copy link
Author

davidblom603 commented Oct 19, 2023

Just added a unit test. I'm guessing there is something unexpected happening though. It can locally success write the data, but on line 358 it is crashing. I'm guessing it is because of my local installation.

tests/unit_tests/persistence/test_catalog.py Fatal Python error: Aborted

Thread 0x00000001f4276080 (most recent call first):
  File "/Users/david.blom/repositories/nautilus_trader/nautilus_trader/persistence/catalog/parquet.py", line 346 in backend_session
  File "/Users/david.blom/repositories/nautilus_trader/nautilus_trader/persistence/catalog/parquet.py", line 360 in query_rust
  File "/Users/david.blom/repositories/nautilus_trader/nautilus_trader/persistence/catalog/parquet.py", line 265 in query
  File "/Users/david.blom/repositories/nautilus_trader/nautilus_trader/persistence/catalog/base.py", line 127 in quote_ticks
  File "/Users/david.blom/repositories/nautilus_trader/tests/unit_tests/persistence/test_catalog.py", line 358 in test_partioning_min_rows_per_group
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/python.py", line 1792 in runtest
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 341 in from_call
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/main.py", line 350 in pytest_runtestloop
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/main.py", line 325 in _main
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/main.py", line 271 in wrap_session
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/main.py", line 318 in pytest_cmdline_main
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/config/__init__.py", line 169 in main
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/config/__init__.py", line 192 in console_main
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/bin/pytest", line 8 in <module>

Extension modules: multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, charset_normalizer.md, nautilus_trader.core.datazsh: abort      poetry run pytest

@cjdsellers
Copy link
Member

Thanks @davidblom

I think we were after a test when the data has less than the min_rows_per_group though. I can also try this out locally when I get to it.

@cjdsellers
Copy link
Member

Looks like the pre-commit just needs to be run too:

make pre-commit

or

pre-commit run --all-files

@davidblom603
Copy link
Author

davidblom603 commented Oct 20, 2023

Great! Thanks. I've ran black / pre-commit and modified the unit test to be less than min_rows_per_group .

@cjdsellers
Copy link
Member

         # Assert
>       assert result == expected_num_quotes
E       assert 0 == 100

@cjdsellers
Copy link
Member

@davidblom603

Did you want to investigate this? alternatively I can replicate on a local branch and figure it out too - up to you.

@davidsblom
Copy link
Member

Hi Chris, this weekend I won't be able to dive deeper due to other obligations. If you have time to dive deeper, that would be great.

My gut feeling based on some other tests is that the unit test will also fail when the changes to the ParquetDatacatalog are reverted. I think the catalog does not read parquet files which are partitioned.

@@ -208,12 +208,14 @@ def write_chunk(
self._fast_write(table=table, path=path, fs=self.fs)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that it works without the partitioning flag here, perhaps an other check can be used here to enter the write_dataset code block.

@cjdsellers
Copy link
Member

Thanks @davidblom603

I went ahead and incorporated your changes and added the unit test, now on develop branch from 3895263.

The differences being I added min_rows_per_group and max_rows_per_group to the catalogs constructor, and also skipped the unit test for now as I was getting the same outcome locally as CI (regardless of the min_rows_per_group setting) -- so I agree I think somethings up with partitioning, but haven't dug deeper as yet.

In a later commit I also added a check for sort order on write, which raises a ValueError if data is not either monotonically increasing (or non-decreasing).

@cjdsellers cjdsellers closed this Oct 22, 2023
@davidsblom
Copy link
Member

Fantastic! Many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants