Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError when reading dataset #335

Open
NeroCorleone opened this issue Aug 11, 2020 · 3 comments
Open

ValueError when reading dataset #335

NeroCorleone opened this issue Aug 11, 2020 · 3 comments
Labels
good first issue Good for newcomers usability Interface is unclear or inconvenient

Comments

@NeroCorleone
Copy link
Contributor

NeroCorleone commented Aug 11, 2020

Problem description

Reading a dataset with eager's read functionality raises a ValueError when providing columns.

Example code (ideally copy-pastable)

import pandas as pd

from tempfile import TemporaryDirectory
from functools import partial
from storefact import get_store_from_url

from kartothek.io.eager import store_dataframes_as_dataset, read_dataset_as_dataframes


dataset_dir = TemporaryDirectory()
store_factory = partial(get_store_from_url, f"hfs://{dataset_dir.name}")
dataset_uuid = "test"

df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": "42",
    "PARTITION": "PARTITION_1",
})

# Store dataset
store_dataframes_as_dataset(store=store_factory(), dataset_uuid=dataset_uuid, dfs=df)

# Read the whole dataset -- this works:
result = read_dataset_as_dataframes(store=store_factory(), dataset_uuid=dataset_uuid)

# Read only some columns -- this fails:
result = read_dataset_as_dataframes(store=store_factory(), dataset_uuid=dataset_uuid, columns=["A"])
ValueError                                Traceback (most recent call last)
<ipython-input-116-7ee227d12007> in <module>
----> 1 result = read_dataset_as_dataframes(store=store_factory(), dataset_uuid=dataset_uuid, columns=["A"])

<decorator-gen-132> in read_dataset_as_dataframes(dataset_uuid, store, tables, columns, concat_partitions_on_primary_index, predicate_pushdown_to_io, categoricals, label_filter, dates_as_object, predicates, factory, dispatch_by)

~/venv/lib/python3.6/site-packages/kartothek/io_components/utils.py in normalize_args(function, *args, **kwargs)
    246         return function(*args, **kwargs)
    247 
--> 248     return _wrapper(*args, **kwargs)
    249 
    250 

~/venv/lib/python3.6/site-packages/kartothek/io_components/utils.py in _wrapper(*args, **kwargs)
    244                 else:
    245                     kwargs[arg_name] = normalize_arg(arg_name, None)
--> 246         return function(*args, **kwargs)
    247 
    248     return _wrapper(*args, **kwargs)

~/venv/lib/python3.6/site-packages/kartothek/io/eager.py in read_dataset_as_dataframes(dataset_uuid, store, tables, columns, concat_partitions_on_primary_index, predicate_pushdown_to_io, categoricals, label_filter, dates_as_object, predicates, factory, dispatch_by)
    143         factory=ds_factory,
    144         dispatch_by=dispatch_by,
--> 145         dispatch_metadata=False,
    146     )
    147     return [mp.data for mp in mps]

<decorator-gen-133> in read_dataset_as_metapartitions(dataset_uuid, store, tables, columns, concat_partitions_on_primary_index, predicate_pushdown_to_io, categoricals, label_filter, dates_as_object, predicates, factory, dispatch_by, dispatch_metadata)

~/venv/lib/python3.6/site-packages/kartothek/io_components/utils.py in normalize_args(function, *args, **kwargs)
    246         return function(*args, **kwargs)
    247 
--> 248     return _wrapper(*args, **kwargs)
    249 
    250 

~/venv/lib/python3.6/site-packages/kartothek/io_components/utils.py in _wrapper(*args, **kwargs)
    244                 else:
    245                     kwargs[arg_name] = normalize_arg(arg_name, None)
--> 246         return function(*args, **kwargs)
    247 
    248     return _wrapper(*args, **kwargs)

~/venv/lib/python3.6/site-packages/kartothek/io/eager.py in read_dataset_as_metapartitions(dataset_uuid, store, tables, columns, concat_partitions_on_primary_index, predicate_pushdown_to_io, categoricals, label_filter, dates_as_object, predicates, factory, dispatch_by, dispatch_metadata)
    214         dispatch_metadata=dispatch_metadata,
    215     )
--> 216     return list(ds_iter)
    217 
    218 

~/venv/lib/python3.6/site-packages/kartothek/io/iter.py in read_dataset_as_metapartitions__iterator(dataset_uuid, store, tables, columns, concat_partitions_on_primary_index, predicate_pushdown_to_io, categoricals, label_filter, dates_as_object, load_dataset_metadata, predicates, factory, dispatch_by, dispatch_metadata)
    102                 predicate_pushdown_to_io=predicate_pushdown_to_io,
    103                 dates_as_object=dates_as_object,
--> 104                 predicates=predicates,
    105             )
    106         yield mp

~/venv/lib/python3.6/site-packages/kartothek/io_components/metapartition.py in _impl(self, *method_args, **method_kwargs)
    137         else:
    138             for mp in self:
--> 139                 method_return = method(mp, *method_args, **method_kwargs)
    140                 if not isinstance(method_return, MetaPartition):
    141                     raise ValueError(

~/venv/lib/python3.6/site-packages/kartothek/io_components/metapartition.py in load_dataframes(self, store, tables, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates)
    668                 ValueError(
    669                     "You are trying to read columns from invalid table(s): {}".format(
--> 670                         set(columns).difference(self.tables)
    671                     )
    672                 )

ValueError: You are trying to read columns from invalid table(s): {'A'}

Used versions

pip list
Package            Version      Location
------------------ ------------ --------
attrs              19.3.0
backcall           0.1.0
bleach             3.1.1
cffi               1.14.0
click              7.1.2
dask               2.21.0
decorator          4.4.1
defusedxml         0.6.0
entrypoints        0.3
fsspec             0.7.4
importlib-metadata 1.3.0
ipdb               0.11
ipykernel          5.1.3
ipython            7.9.0
ipython-genutils   0.2.0
ipywidgets         7.5.1
jedi               0.15.2
Jinja2             2.11.1
jsonschema         3.2.0
jupyter            1.0.0
jupyter-client     5.3.4
jupyter-console    6.0.0
jupyter-core       4.6.1
kartothek          3.13.0       /src
locket             0.2.0
MarkupSafe         1.1.1
milksnake          0.1.5
mistune            0.8.4
msgpack            1.0.0
nbconvert          5.6.1
nbformat           4.4.0
notebook           6.0.2
numpy              1.18.1
pandas             1.0.1
pandocfilters      1.4.2
parso              0.6.1
partd              1.1.0
pexpect            4.8.0
pickleshare        0.7.5
pip                20.0.2
prometheus-client  0.7.1
prompt-toolkit     2.0.10
ptyprocess         0.6.0
pyarrow            0.17.1.post2
pycparser          2.20
Pygments           2.5.2
pyrsistent         0.15.5
python-dateutil    2.8.1
pytz               2019.3
PyYAML             5.3.1
pyzmq              18.1.1
qtconsole          4.5.5
Send2Trash         1.5.0
setuptools         45.2.0
simplejson         3.17.0
simplekv           0.14.1
six                1.14.0
storefact          0.10.0
terminado          0.8.3
testpath           0.4.4
toolz              0.10.0
tornado            6.0.3
traitlets          4.3.3
uritools           3.0.0
urlquote           1.1.4
wcwidth            0.1.8
webencodings       0.5.1
wheel              0.34.2
widgetsnbextension 3.5.1
zipp               3.0.0
zstandard          0.14.0

@marco-neumann-by
Copy link
Contributor

columns is mapping from table to column set (aka Dict[str, Iterable[str]]). I agree that the error is not nice and that we could probably automatically convert ["a"] to {"table: ["a"]} as we do this in other places as well.

@lr4d
Copy link
Collaborator

lr4d commented Aug 12, 2020

Given all these multi-table usability issues, wouldn't it be less work to just make a breaking release and drop the multi-table support?

@marco-neumann-by
Copy link
Contributor

I'm very much in favor of removing that feature, but we have to flesh out what this means? Likely a metadata version bump, which clearly requires migration pipelines.

@lr4d lr4d added usability Interface is unclear or inconvenient good first issue Good for newcomers labels Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers usability Interface is unclear or inconvenient
Development

No branches or pull requests

3 participants