Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix): ensure zip directory store compares key to prefix correctly #2758

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Jan 24, 2025

Fixes #2757

Since there was code there previously, I presume it was somehow possible to hit the condition although I haven't figured out how (outside of the reproducer file that I have).

Here is how I would in theory go about it:

import zarr
from zarr.storage import ZipStore

g = zarr.open(ZipStore('foo.zip', mode="w"), zarr_format=2)
g.require_group("bar")
g["bar"].require_group("faz")
list(g["bar"].keys())

But it only yields the subkey [faz] instead of ['bar', 'faz'] as in the linked issue/file.

UPDATE: I think the file was created by simply zipping an old zarr store, but I can't be certain. In any case I checked using unzip -v /Users/ilangold/Projects/Theis/anndata/tests/data/archives/v0.7.0/adata.zarr.zip and saw that folders are indeed listed there so I do think this is a real possibility.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@@ -271,7 +271,7 @@ async def list_dir(self, prefix: str) -> AsyncIterator[str]:
yield key
else:
for key in keys:
if key.startswith(prefix + "/") and key != prefix:
if key.startswith(prefix + "/") and key.strip("/") != prefix:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a TON of this kind of thing already in this codebase. We need these things defined in the IO layer generally, not in individual implementations, as they all face similar problems.

Similarly, do we actually need a ZIP implementation when fsspec can do this for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there have been a LOT of these string parsing bugs. it's a weak point in the codebase, and we definitely need something more solid!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsspec knows how you feel.

We can come up with a small number of string normalising functions that live on the Store baseclass, something like os.path, and normalise all user-passed strings at the first opportunity. This is what fsspec's _strip_protocol attempts, but also has problems.

def norm_path(s):
    # this probably adds not insignificant runtime cost
    return re.sub("/+", "/", s.lstrip("/").rstrip("/"))

Copy link
Contributor Author

@ilan-gold ilan-gold Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a weak point in the codebase

I did wonder how widespread this could be. I noticed this sort of thing in other stores but wasn't sure if there was knowledge that they were kosher (for whatever reason) within the maintainers (and therefore didn't need this sort of fix).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration with backend IO libraries (fsspec and probably objstore) is particularly thorny, since they do their own path munging!

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 27, 2025
@ilan-gold ilan-gold marked this pull request as ready for review January 27, 2025 09:56
@d-v-b
Copy link
Contributor

d-v-b commented Jan 27, 2025

can you add a test that would have failed on main, but passes after this fix? Ideally this would be a modification of one of the shared store test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

keys/items yields two diferent results between v2 and v3
3 participants