-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_indices_as_dataframe returns an empty dataframe if using secondary_indices #282
Comments
Example: from functools import partial
from tempfile import TemporaryDirectory
from storefact import get_store_from_url
from kartothek.io.eager import store_dataframes_as_dataset
dataset_dir = TemporaryDirectory()
store_factory = partial(get_store_from_url, f"hfs://{dataset_dir.name}")
df = pd.DataFrame({"part": ["part1", "part2", "part2"], "id": [1, 2, 3], "val": [4, 5, 6]})
dm1 = store_dataframes_as_dataset(
store_factory,
"a_unique_dataset_identifier",
[df],
partition_on=["part"],
)
print(len(dm1.load_partition_indices().get_indices_as_dataframe()))
dm2 = store_dataframes_as_dataset(
store_factory,
"another_unique_dataset_identifier",
[df],
partition_on=["part"],
secondary_indices=["id"]
)
print(len(dm2.load_partition_indices().get_indices_as_dataframe())) |
I have confirmed the behavior of the above code block, which is indeed inconsistent behavior. However, the following seems to work fine
|
Thanks, this helps! On a separate but related note, it does seem redundant to have to pass a |
This issue is caused by |
I agree this is currently a bit confusing since we have two competing interfaces. kartothek/kartothek/core/dataset.py Line 53 in ba6215e
and kartothek/kartothek/core/factory.py Line 39 in ba6215e
The latter is a newer generation and holds on a reference to the store, eliminating this redundancy. I would like to simplify and straighten the interface in this area, even if we need to break a few eggs. |
get_indices_as_dataframe
returns an empty dataframe if the dataset was created using secondary_indices, even though.partitions
and.index_columns
reflect correct values.Creating an identical dataset without secondary_indices works normally.
The dataset was created with
update_dataset_from_ddf
.Using kartothek 3.8.2.
This seems to be an issue, but let me know if I am missing something.
Thanks.
The text was updated successfully, but these errors were encountered: