Add merge_many_datasets_as_delayed #243

xhochy · 2020-03-11T14:21:55Z

This implements a merge that works on multiple datasets. For the moment, I have kept the code separate from the existing merge. Aligning partitions is done depending on match_how as in some cases, we can speed it up greatly by using e.g. the indices dataframes.

Fixes #235

Missing:

predicate support
columns support
exact matching
merge with existing code

codecov · 2020-03-11T14:28:16Z

Codecov Report

Merging #243 into master will decrease coverage by 0.08%.
The diff coverage is 86.02%.

@@            Coverage Diff             @@
##           master     #243      +/-   ##
==========================================
- Coverage   89.73%   89.64%   -0.09%     
==========================================
  Files          39       39              
  Lines        3720     3795      +75     
  Branches      901      927      +26     
==========================================
+ Hits         3338     3402      +64     
- Misses        224      230       +6     
- Partials      158      163       +5

Impacted Files	Coverage Δ
kartothek/core/factory.py	`85.29% <100%> (+4.73%)`	⬆️
kartothek/io/dask/delayed.py	`100% <100%> (ø)`	⬆️
kartothek/io_components/read.py	`89.28% <71.42%> (-3.45%)`	⬇️
kartothek/io_components/metapartition.py	`90.86% <73.68%> (-0.52%)`	⬇️
kartothek/io_components/merge.py	`92.94% <87.23%> (-7.06%)`	⬇️
kartothek/core/common_metadata.py	`94.73% <0%> (-0.66%)`	⬇️
kartothek/io/eager.py	`76.96% <0%> (-0.13%)`	⬇️
kartothek/core/dataset.py	`87.78% <0%> (-0.04%)`	⬇️
kartothek/io_components/utils.py	`84.45% <0%> (ø)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65948dc...3bfa595. Read the comment docs.

lr4d · 2020-03-19T11:33:56Z

kartothek/io_components/merge.py

+from kartothek.io_components.utils import _instantiate_store, _make_callable
+
+if TYPE_CHECKING:
+    from simplekv import KeyValueStore


Suggested change

from simplekv import KeyValueStore

from simplekv import KeyValueStore # noqa: F401

linting fails: kartothek/io_components/merge.py:14:5: F401 'simplekv.KeyValueStore' imported but unused

marco-neumann-by · 2020-03-30T15:05:33Z

kartothek/io/dask/delayed.py

+    ----------
+    dataset_uuids : List[str]
+    match_how : Union[str, Callable]
+        Define the partition label matching scheme.


Why is the whole thing label-based and not index-based?

marco-neumann-by · 2020-03-30T15:06:30Z

kartothek/io/dask/delayed.py

+        Define the partition label matching scheme.
+        Available implementations are:
+
+        * first : The partitions of the first dataset are considered to be the base


related to the question above: do we really need different string-based join modes?

marco-neumann-by · 2020-03-30T15:07:20Z

kartothek/io/dask/delayed.py

+        * first : The partitions of the first dataset are considered to be the base
+                  partitions and **all** partitions of the remaining datasets are
+                  joined to the partitions of the first dataset. This should only be
+                  used if all but the first dataset contain very few partitions.


What does "few" mean? What happens if this is not the case? Please give the user more guidance and try to provide a more failure-proof API.

marco-neumann-by · 2020-03-30T15:08:17Z

kartothek/io/dask/delayed.py

+        explicit instructions for a specific merge.
+        Each dict should contain key/values:
+
+        * 'output_label' : The table for the merged dataframe


what about the tables key from the example below?

marco-neumann-by · 2020-03-30T15:08:43Z

kartothek/io/dask/delayed.py

+        * `merge_func`: A callable with signature
+                        `merge_func(dfs, merge_kwargs)` to
+                        handle the data preprocessing and merging.
+        * 'merge_kwargs' : The kwargs to be passed to the `merge_func`


not required, use a partial instead.

marco-neumann-by · 2020-03-30T15:09:56Z

kartothek/io/dask/delayed.py

+        If False (Default), the partition labels of the dataset with fewer
+        partitions are interpreted as prefixes.
+    merge_tasks : List[Dict]
+        A list of merge tasks. Each item in this list is a dictionary giving


does a merge task drop/consume its input tables? if not, I think this might be a memory issue.

xhochy force-pushed the merge-many-dispatch-by branch from a023291 to 9c3765f Compare March 11, 2020 14:22

xhochy force-pushed the merge-many-dispatch-by branch from 273271d to 3bc661b Compare March 12, 2020 09:26

xhochy mentioned this pull request Mar 12, 2020

Type dispatch_metapartitions #246

Merged

xhochy added 4 commits March 12, 2020 15:59

Add merge_many_datasets_as_delayed

b4e76c2

Add support for predicates

2bb0b78

Add columns attribute

df28900

Fix rebase conflicts

3bfa595

xhochy force-pushed the merge-many-dispatch-by branch from 3bc661b to 3bfa595 Compare March 12, 2020 15:00

This was referenced Mar 13, 2020

Let merge_datasets_as_delayed merge >2 datasets, filter by predicates and subset column #239

Closed

Let merge_datasets_as_delayed merge >2 datasets and filter by predicates #235

Open

lr4d reviewed Mar 19, 2020

View reviewed changes

marco-neumann-by suggested changes Mar 30, 2020

View reviewed changes

mlondschien mentioned this pull request Mar 4, 2021

Removal of complex input types #427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add merge_many_datasets_as_delayed #243

Add merge_many_datasets_as_delayed #243

xhochy commented Mar 11, 2020 •

edited

Loading

codecov bot commented Mar 11, 2020 •

edited

Loading

lr4d Mar 19, 2020

marco-neumann-by Mar 30, 2020

marco-neumann-by Mar 30, 2020

marco-neumann-by Mar 30, 2020

marco-neumann-by Mar 30, 2020

marco-neumann-by Mar 30, 2020

marco-neumann-by Mar 30, 2020

	from simplekv import KeyValueStore
	from simplekv import KeyValueStore # noqa: F401

Add merge_many_datasets_as_delayed #243

Are you sure you want to change the base?

Add merge_many_datasets_as_delayed #243

Conversation

xhochy commented Mar 11, 2020 • edited Loading

codecov bot commented Mar 11, 2020 • edited Loading

Codecov Report

lr4d Mar 19, 2020

Choose a reason for hiding this comment

marco-neumann-by Mar 30, 2020

Choose a reason for hiding this comment

marco-neumann-by Mar 30, 2020

Choose a reason for hiding this comment

marco-neumann-by Mar 30, 2020

Choose a reason for hiding this comment

marco-neumann-by Mar 30, 2020

Choose a reason for hiding this comment

marco-neumann-by Mar 30, 2020

Choose a reason for hiding this comment

marco-neumann-by Mar 30, 2020

Choose a reason for hiding this comment

xhochy commented Mar 11, 2020 •

edited

Loading

codecov bot commented Mar 11, 2020 •

edited

Loading