get_indices_as_dataframe returns an empty dataframe if using secondary_indices #282

brl0 · 2020-05-15T22:48:11Z

get_indices_as_dataframe returns an empty dataframe if the dataset was created using secondary_indices, even though .partitions and .index_columns reflect correct values.

Creating an identical dataset without secondary_indices works normally.

The dataset was created with update_dataset_from_ddf.

Using kartothek 3.8.2.

This seems to be an issue, but let me know if I am missing something.
Thanks.

The text was updated successfully, but these errors were encountered:

brl0 · 2020-05-16T00:30:24Z

Example:

from functools import partial
from tempfile import TemporaryDirectory
from storefact import get_store_from_url
from kartothek.io.eager import store_dataframes_as_dataset

dataset_dir = TemporaryDirectory()
store_factory = partial(get_store_from_url, f"hfs://{dataset_dir.name}")

df = pd.DataFrame({"part": ["part1", "part2", "part2"], "id": [1, 2, 3], "val": [4, 5, 6]})

dm1 = store_dataframes_as_dataset(
    store_factory,
    "a_unique_dataset_identifier",
    [df],
    partition_on=["part"],
)
print(len(dm1.load_partition_indices().get_indices_as_dataframe()))

dm2 = store_dataframes_as_dataset(
    store_factory,
    "another_unique_dataset_identifier",
    [df],
    partition_on=["part"],
    secondary_indices=["id"]
)
print(len(dm2.load_partition_indices().get_indices_as_dataframe()))

lr4d · 2020-05-19T10:02:56Z

I have confirmed the behavior of the above code block, which is indeed inconsistent behavior. However, the following seems to work fine

dm22 = dm2.load_all_indices(store_factory())
dm22.get_indices_as_dataframe()
Out[7]: 
                                              part  id
partition                                             
part=part1/4c3069b3c09b405cad5699caf9afaad1  part1   1
part=part2/4c3069b3c09b405cad5699caf9afaad1  part2   2
part=part2/4c3069b3c09b405cad5699caf9afaad1  part2   3

brl0 · 2020-05-19T18:45:22Z

Thanks, this helps!

On a separate but related note, it does seem redundant to have to pass a store_factory to load_all_indices since the DatasetMetadata object from which it is called has a store property.

fjetter · 2020-05-20T07:02:21Z

This issue is caused by get_indices_as_dataframe not loading the indices automatically. From a UX perspective, I would argue it should either raise or load them automatically but not silently do the wrong thing

fjetter · 2020-05-20T07:02:24Z

On a separate but related note, it does seem redundant to have to pass a store_factory to load_all_indices since the DatasetMetadata object from which it is called has a store property.

I agree this is currently a bit confusing since we have two competing interfaces.

kartothek/kartothek/core/dataset.py

Line 53 in ba6215e

class DatasetMetadataBase(CopyMixin):

and

kartothek/kartothek/core/factory.py

Line 39 in ba6215e

class DatasetFactory(DatasetMetadataBase):

The latter is a newer generation and holds on a reference to the store, eliminating this redundancy. I would like to simplify and straighten the interface in this area, even if we need to break a few eggs.

fjetter added refactoring usability Interface is unclear or inconvenient labels May 20, 2020

lr4d mentioned this issue Jul 22, 2020

Clean up DatasetMetadata / DatasetFactory meta-issue #321

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_indices_as_dataframe returns an empty dataframe if using secondary_indices #282

get_indices_as_dataframe returns an empty dataframe if using secondary_indices #282

brl0 commented May 15, 2020

brl0 commented May 16, 2020

lr4d commented May 19, 2020

brl0 commented May 19, 2020

fjetter commented May 20, 2020

fjetter commented May 20, 2020 •

edited

Loading

get_indices_as_dataframe returns an empty dataframe if using secondary_indices #282

get_indices_as_dataframe returns an empty dataframe if using secondary_indices #282

Comments

brl0 commented May 15, 2020

brl0 commented May 16, 2020

lr4d commented May 19, 2020

brl0 commented May 19, 2020

fjetter commented May 20, 2020

fjetter commented May 20, 2020 • edited Loading

fjetter commented May 20, 2020 •

edited

Loading