Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_indices_as_dataframe returns an empty dataframe if using secondary_indices #282

Open
brl0 opened this issue May 15, 2020 · 5 comments
Open
Labels
refactoring usability Interface is unclear or inconvenient

Comments

@brl0
Copy link

brl0 commented May 15, 2020

get_indices_as_dataframe returns an empty dataframe if the dataset was created using secondary_indices, even though .partitions and .index_columns reflect correct values.

Creating an identical dataset without secondary_indices works normally.

The dataset was created with update_dataset_from_ddf.

Using kartothek 3.8.2.

This seems to be an issue, but let me know if I am missing something.
Thanks.

@brl0
Copy link
Author

brl0 commented May 16, 2020

Example:

from functools import partial
from tempfile import TemporaryDirectory
from storefact import get_store_from_url
from kartothek.io.eager import store_dataframes_as_dataset

dataset_dir = TemporaryDirectory()
store_factory = partial(get_store_from_url, f"hfs://{dataset_dir.name}")

df = pd.DataFrame({"part": ["part1", "part2", "part2"], "id": [1, 2, 3], "val": [4, 5, 6]})

dm1 = store_dataframes_as_dataset(
    store_factory,
    "a_unique_dataset_identifier",
    [df],
    partition_on=["part"],
)
print(len(dm1.load_partition_indices().get_indices_as_dataframe()))

dm2 = store_dataframes_as_dataset(
    store_factory,
    "another_unique_dataset_identifier",
    [df],
    partition_on=["part"],
    secondary_indices=["id"]
)
print(len(dm2.load_partition_indices().get_indices_as_dataframe()))

@lr4d
Copy link
Collaborator

lr4d commented May 19, 2020

I have confirmed the behavior of the above code block, which is indeed inconsistent behavior. However, the following seems to work fine

dm22 = dm2.load_all_indices(store_factory())
dm22.get_indices_as_dataframe()
Out[7]: 
                                              part  id
partition                                             
part=part1/4c3069b3c09b405cad5699caf9afaad1  part1   1
part=part2/4c3069b3c09b405cad5699caf9afaad1  part2   2
part=part2/4c3069b3c09b405cad5699caf9afaad1  part2   3

@brl0
Copy link
Author

brl0 commented May 19, 2020

Thanks, this helps!

On a separate but related note, it does seem redundant to have to pass a store_factory to load_all_indices since the DatasetMetadata object from which it is called has a store property.

@fjetter
Copy link
Collaborator

fjetter commented May 20, 2020

This issue is caused by get_indices_as_dataframe not loading the indices automatically. From a UX perspective, I would argue it should either raise or load them automatically but not silently do the wrong thing

@fjetter
Copy link
Collaborator

fjetter commented May 20, 2020

On a separate but related note, it does seem redundant to have to pass a store_factory to load_all_indices since the DatasetMetadata object from which it is called has a store property.

I agree this is currently a bit confusing since we have two competing interfaces.

class DatasetMetadataBase(CopyMixin):

and
class DatasetFactory(DatasetMetadataBase):

The latter is a newer generation and holds on a reference to the store, eliminating this redundancy. I would like to simplify and straighten the interface in this area, even if we need to break a few eggs.

@fjetter fjetter added refactoring usability Interface is unclear or inconvenient labels May 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring usability Interface is unclear or inconvenient
Development

No branches or pull requests

3 participants