-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let merge_datasets_as_delayed merge >2 datasets and filter by predicates #235
Comments
Firstly, thanks for the interest and sorry for the delayed response. Secondly, the entire currently existing alignment logic is about as old as the library and was never really refactored. I'm open to breaking some eggs in this situation (We have a few other things which should be addressed in terms of UX so friendly, breaking release would be appropriate in the near future, anyhow)
As I said, this implementation is dated back a long time. In the very first iterations we used hard coded file names (e.g. partition_0, partition_1) which were provided by the user application. In this context this makes a lot of sense and allows for the easiest alignment, obviously.
Similar to the labels I would encourage people not to actually use multi-table datasets and stick to a single table per dataset. If we remove this feature this question also becomes obsolete.
Definitely.
As a first iteration I would propose to stick to a simple deep join, i.e. Full disclosure: We do have in in-house implementation of a more advanced multi dataset alignment. The alignment logic is essentially based on "match partition_keys" and does not allow more advanced alignment. We intend to push this OSS as well but the time line is, best case, in a few weeks. I'd be curious if this would suite your need. Can you elaborate a bit more how you want to use this? |
Depending on how your datasets look like (partitioning and indexing), #226 could also be interesting for you. If a simple join axis suffices, you could let dask figure out the join once you have an index |
I am currently shocked, we forgot to put the tests for the merge pipeline in the upstream package 😱 (that's embarrassing) |
No, there are tests in https://github.com/JDASoftwareGroup/kartothek/blob/9314d66b2b35a64282945c6b6ae24d8bb5a51ed0/kartothek/io/testing/merge.py and https://github.com/JDASoftwareGroup/kartothek/blob/9314d66b2b35a64282945c6b6ae24d8bb5a51ed0/tests/io/dask/delayed/test_merge.py They are very basic but cover at least the majority of the code to show that #239 breaks these tests.
Supply all DataFrames to the merge function is a enormous performance benefit. If you don't do a plain merge but can do other optimizations like only use |
Agreed, but I would argue this doesn't need to be part of the public interface, does it? I guess it should be possible to find a suitable fast join method which can be used for everything. |
No worries for taking your time to respond. I guess @xhochy (who is now implementing the merge functionality much better than I could) could answer most of your questions? Simply put, the issue we were facing is that the combined dataset is too big to fit into memory before subselecting columns and applying predicates. We have around 1000 partitions per dataset with only one parquet file per partition, such that sequentially loading chunks (or a subset of columns), aligning (basically |
It would be nice to be able to supply
kartothek.io.dask.delayed.merge_datasets_as_delayed
with a list ofdataset_uuids
to merge an arbitrary number of datasets.This could be implemented by
kartothek.dask.delayed.concat_datasets_as_delayed
or similar which takes a listdataset_uuids
as an argument orkartothek.dask.delayed.merge_datasets_as_delayed
to be supplied with a list forleft
to not break existing usages of the function.
The
match_how="left"
would need to be replaced bymatch_how="first"
. I am not sure how to translatematch_how="prefix"
. What is a typical use case here? Additionally it would be nice to supply amerge_func
viamerge_tasks
that takes more than two dataframes as input. This would require a similar change as above, either defining aMetaPartition.concat_dataframes
or similar method or allowingMetaPartition.merge_dataframes
to be supplied with a list forleft
.Questions concerning the current implementation of
merge_datasets_as_delayed
:match_how="exact"
, the keys of the partitions are compared to match partitions. However, since the keys include the name of the stored parquet files, they will never match. What is the idea here?merge_func
with all available dataframes if no labels (currentlyleft
andright
) are supplied?Additionally it would be nice to be able to supply
merge_datasets_as_delayed
with apredicates
argument that filters by partitions before merging and then usesfilter_df_from_predicates
after merging.The text was updated successfully, but these errors were encountered: