data catalog: set the min_rows_per_goup when writing a parquet datase… #1281

davidblom603 · 2023-10-18T14:51:27Z

…t to disk for improved IO performance.

Pull Request

Set the min_rows_per_group attribute when writing a parquet dataset, otherwise it could end up with very small row groups hurting IO performance. This is mentioned in the documentation at the max_rows_per_group attribute, and also confirmed by my local tests:
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html

There is still room for improvement to process an iterator instead of a list to write datasets which do not fit in memory. Will leave that for later.

Type of change

Delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How has this change been tested?

Describe how this code was/is tested.

…t to disk for improved IO performance

CLAassistant · 2023-10-18T14:51:35Z

All committers have signed the CLA.

twitu · 2023-10-19T03:40:09Z

I left this property out because it was unclear how the writer would behave if there are say 100 rows to write and both min and max row size is 5000. Would the writer then stall waiting for more rows?

only write the row groups to the disk when sufficient rows have accumulated.

davidsblom · 2023-10-19T04:56:03Z

It will write out the last group without waiting. I could add a unit test later.

twitu · 2023-10-19T08:23:35Z

Thanks for confirming this. Yes a unit test will make sure the behavior stays well checked.

davidblom603 · 2023-10-19T11:20:26Z

Just added a unit test. I'm guessing there is something unexpected happening though. It can locally success write the data, but on line 358 it is crashing. I'm guessing it is because of my local installation.

tests/unit_tests/persistence/test_catalog.py Fatal Python error: Aborted

Thread 0x00000001f4276080 (most recent call first):
  File "/Users/david.blom/repositories/nautilus_trader/nautilus_trader/persistence/catalog/parquet.py", line 346 in backend_session
  File "/Users/david.blom/repositories/nautilus_trader/nautilus_trader/persistence/catalog/parquet.py", line 360 in query_rust
  File "/Users/david.blom/repositories/nautilus_trader/nautilus_trader/persistence/catalog/parquet.py", line 265 in query
  File "/Users/david.blom/repositories/nautilus_trader/nautilus_trader/persistence/catalog/base.py", line 127 in quote_ticks
  File "/Users/david.blom/repositories/nautilus_trader/tests/unit_tests/persistence/test_catalog.py", line 358 in test_partioning_min_rows_per_group
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/python.py", line 1792 in runtest
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 341 in from_call
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/main.py", line 350 in pytest_runtestloop
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/main.py", line 325 in _main
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/main.py", line 271 in wrap_session
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/main.py", line 318 in pytest_cmdline_main
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/config/__init__.py", line 169 in main
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/lib/python3.10/site-packages/_pytest/config/__init__.py", line 192 in console_main
  File "/Users/david.blom/Library/Caches/pypoetry/virtualenvs/nautilus-trader-rxZ3bm8I-py3.10/bin/pytest", line 8 in <module>

Extension modules: multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, charset_normalizer.md, nautilus_trader.core.datazsh: abort      poetry run pytest

cjdsellers · 2023-10-20T07:32:20Z

Thanks @davidblom

I think we were after a test when the data has less than the min_rows_per_group though. I can also try this out locally when I get to it.

cjdsellers · 2023-10-20T07:48:19Z

Looks like the pre-commit just needs to be run too:

make pre-commit

or

pre-commit run --all-files

davidblom603 · 2023-10-20T07:54:35Z

Great! Thanks. I've ran black / pre-commit and modified the unit test to be less than min_rows_per_group .

cjdsellers · 2023-10-20T12:10:42Z

         # Assert
>       assert result == expected_num_quotes
E       assert 0 == 100

cjdsellers · 2023-10-21T06:55:21Z

@davidblom603

Did you want to investigate this? alternatively I can replicate on a local branch and figure it out too - up to you.

davidsblom · 2023-10-21T07:38:14Z

Hi Chris, this weekend I won't be able to dive deeper due to other obligations. If you have time to dive deeper, that would be great.

My gut feeling based on some other tests is that the unit test will also fail when the changes to the ParquetDatacatalog are reverted. I think the catalog does not read parquet files which are partitioned.

davidsblom · 2023-10-21T07:44:40Z

nautilus_trader/persistence/catalog/parquet.py

@@ -208,12 +208,14 @@ def write_chunk(
            self._fast_write(table=table, path=path, fs=self.fs)
        else:


My thinking is that it works without the partitioning flag here, perhaps an other check can be used here to enter the write_dataset code block.

cjdsellers · 2023-10-22T06:50:20Z

Thanks @davidblom603

I went ahead and incorporated your changes and added the unit test, now on develop branch from 3895263.

The differences being I added min_rows_per_group and max_rows_per_group to the catalogs constructor, and also skipped the unit test for now as I was getting the same outcome locally as CI (regardless of the min_rows_per_group setting) -- so I agree I think somethings up with partitioning, but haven't dug deeper as yet.

In a later commit I also added a check for sort order on write, which raises a ValueError if data is not either monotonically increasing (or non-decreasing).

davidsblom · 2023-10-22T07:39:58Z

Fantastic! Many thanks

data catalog: set the min_rows_per_goup when writing a parquet datase…

9d63085

…t to disk for improved IO performance

Add unit test

fb78c91

formatting with black

0262109

davidsblom reviewed Oct 21, 2023

View reviewed changes

cjdsellers closed this Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data catalog: set the min_rows_per_goup when writing a parquet datase… #1281

data catalog: set the min_rows_per_goup when writing a parquet datase… #1281

davidblom603 commented Oct 18, 2023 •

edited

Loading

CLAassistant commented Oct 18, 2023 •

edited

Loading

twitu commented Oct 19, 2023

davidsblom commented Oct 19, 2023

twitu commented Oct 19, 2023

davidblom603 commented Oct 19, 2023 •

edited

Loading

cjdsellers commented Oct 20, 2023

cjdsellers commented Oct 20, 2023

davidblom603 commented Oct 20, 2023 •

edited by cjdsellers

Loading

cjdsellers commented Oct 20, 2023

cjdsellers commented Oct 21, 2023

davidsblom commented Oct 21, 2023

davidsblom Oct 21, 2023

cjdsellers commented Oct 22, 2023

davidsblom commented Oct 22, 2023

		@@ -208,12 +208,14 @@ def write_chunk(
		self._fast_write(table=table, path=path, fs=self.fs)
		else:

data catalog: set the min_rows_per_goup when writing a parquet datase… #1281

data catalog: set the min_rows_per_goup when writing a parquet datase… #1281

Conversation

davidblom603 commented Oct 18, 2023 • edited Loading

Pull Request

Type of change

How has this change been tested?

CLAassistant commented Oct 18, 2023 • edited Loading

twitu commented Oct 19, 2023

davidsblom commented Oct 19, 2023

twitu commented Oct 19, 2023

davidblom603 commented Oct 19, 2023 • edited Loading

cjdsellers commented Oct 20, 2023

cjdsellers commented Oct 20, 2023

davidblom603 commented Oct 20, 2023 • edited by cjdsellers Loading

cjdsellers commented Oct 20, 2023

cjdsellers commented Oct 21, 2023

davidsblom commented Oct 21, 2023

davidsblom Oct 21, 2023

Choose a reason for hiding this comment

cjdsellers commented Oct 22, 2023

davidsblom commented Oct 22, 2023

davidblom603 commented Oct 18, 2023 •

edited

Loading

CLAassistant commented Oct 18, 2023 •

edited

Loading

davidblom603 commented Oct 19, 2023 •

edited

Loading

davidblom603 commented Oct 20, 2023 •

edited by cjdsellers

Loading