Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keys/items yields two diferent results between v2 and v3 #2757

Open
ilan-gold opened this issue Jan 24, 2025 · 6 comments · May be fixed by #2758
Open

keys/items yields two diferent results between v2 and v3 #2757

ilan-gold opened this issue Jan 24, 2025 · 6 comments · May be fixed by #2758
Labels
bug Potential issues with the zarr-python library

Comments

@ilan-gold
Copy link
Contributor

Zarr version

0c154c3

Numcodecs version

0.15.0

Python Version

3.12

Operating System

Mac ARM

Installation

uv

Description

I have an admittedly old zarr zip store where the keys from one of the subgroups appears to be yielding itself in zarr v3 (erroneously as it errors out on access) but correctly in zarr v2. I can't reproduce this just by creating groups in v3 or v2, with either the current main branch or the final v2 commit.

It's possible the file is corrupted or something, tough to say since it is a zip store. The same problem does not happen when I unzip the store and either visually examine its contents or when I actually open the store using zarr.open and then run .keys()

Steps to reproduce

adata.zarr.zip

For the above file one should run in v2:

import zarr
g = zarr.open('adata.zarr.zip', mode="r")
list(g["obsm"].keys()) # ['array', 'df', 'sparse']
g["obsm"]["obsm"] # not there!

and then

import zarr
from zarr.storage import ZipStore
g = zarr.open(ZipStore('/Users/ilangold/Projects/Theis/anndata/tests/data/archives/v0.7.0/adata.zarr.zip'), mode="r")
list(g["obsm"].keys()) # ['array', 'df', 'sparse', 'obsm']
g["obsm"]["obsm"] # also not there!

Additional output

No response

@ilan-gold ilan-gold added the bug Potential issues with the zarr-python library label Jan 24, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Jan 24, 2025

can you list the contents of the zip store, as zarr sees it? there's an async method on the ZipStore class that should do this (I think if you call list_prefix('') it should list everything), and you can call sync around it to invoke that method from non-async code. It wouldn't surprise me if this is due to a directory listing bug

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Jan 24, 2025

[item async for item in store.list_prefix("")]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[16], line 1
----> 1 [item async for item in store.list_prefix("")]

File ~/Projects/Theis/zarr-python/src/zarr/storage/_zip.py:256, in ZipStore.list_prefix(self, prefix)
    254 async def list_prefix(self, prefix: str) -> AsyncIterator[str]:
    255     # docstring inherited
--> 256     async for key in self.list():
    257         if key.startswith(prefix):
    258             yield key

File ~/Projects/Theis/zarr-python/src/zarr/storage/_zip.py:250, in ZipStore.list(self)
    248 async def list(self) -> AsyncIterator[str]:
    249     # docstring inherited
--> 250     with self._lock:
    251         for key in self._zf.namelist():
    252             yield key

AttributeError: 'ZipStore' object has no attribute '_lock'

This is from an ipython shell. Maybe I should try a lil test script?

@d-v-b
Copy link
Contributor

d-v-b commented Jan 24, 2025

zipstore is a little special, you might need to invoke a special method to completely create one (yes, this is cursed)

@ilan-gold
Copy link
Contributor Author

Just from obsm (there are many subgroups), it does not list itself obsm as a subgroup, as keys does:

'obsm/',
 'obsm/df/',
 'obsm/df/cat/',
 'obsm/df/cat/.zarray',
 'obsm/df/cat/.zattrs',
 'obsm/df/cat/0',
 'obsm/df/cat_ordered/',
 'obsm/df/cat_ordered/.zarray',
 'obsm/df/cat_ordered/.zattrs',
 'obsm/df/cat_ordered/0',
 'obsm/df/uint8/',
 'obsm/df/uint8/.zarray',
 'obsm/df/uint8/0',
 'obsm/df/.zattrs',
 'obsm/df/.zgroup',
 'obsm/df/float64/',
 'obsm/df/float64/.zarray',
 'obsm/df/float64/0',
 'obsm/df/__categories/',
 'obsm/df/__categories/cat/',
 'obsm/df/__categories/cat/.zarray',
 'obsm/df/__categories/cat/.zattrs',
 'obsm/df/__categories/cat/0',
 'obsm/df/__categories/cat_ordered/',
 'obsm/df/__categories/cat_ordered/.zarray',
 'obsm/df/__categories/cat_ordered/.zattrs',
 'obsm/df/__categories/cat_ordered/0',
 'obsm/df/__categories/.zgroup',
 'obsm/df/int64/',
 'obsm/df/int64/.zarray',
 'obsm/df/int64/0',
 'obsm/df/_index/',
 'obsm/df/_index/.zarray',
 'obsm/df/_index/0',
 'obsm/array/',
 'obsm/array/.zarray',
 'obsm/array/0.0',
 'obsm/.zgroup',
 'obsm/sparse/',
 'obsm/sparse/indices/',
 'obsm/sparse/indices/.zarray',
 'obsm/sparse/indices/0',
 'obsm/sparse/.zattrs',
 'obsm/sparse/.zgroup',
 'obsm/sparse/indptr/',
 'obsm/sparse/indptr/.zarray',
 'obsm/sparse/indptr/0',
 'obsm/sparse/data/',
 'obsm/sparse/data/.zarray',
 'obsm/sparse/data/0',

@d-v-b
Copy link
Contributor

d-v-b commented Jan 24, 2025

so the stuff coming out of keys comes from members, which wraps some recursive functions that might not be the easiest to follow. I think the relevant line is here:

keys = [key async for key in node.store.list_dir(node.path)]

that's where the to-be-recursed function lists the keys it can see at a given level of the hierarchy. this method, and the line I linked, are both async generators, which are kind of annoying to invoke in non-async code. you will need another helper function (_collect_aiterator, sorry for the awful name), which you can also bring in from sync.

e.g., to replicate that keys line yourself in non-async code, you would run sync(_collect_aiterator(group.store.list_dir(group.store.path))). See if obsm appears there when it shouldnt

@ilan-gold
Copy link
Contributor Author

@d-v-b For me, this old zip store is producing "groups" as keys from g.store._zf.namelist() i.e., obsm is returned as a key from namelist even though it is only a "directory" within the zip. It's possible this file was created in some way that yielded this irregularity. I'm not entirely sure. It seems clear from the docs that namelist should return only the file names.

That being said, there is a check for this in the ZipStore that is wrong because these directory names end up containing a trailing slash that is not accounted for. I will open a PR and if you want, I can add this test file in (although it might be worth trying to create one from scratch even if I haven't yet figured out how).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants