Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-46776: Add zip creation and ingest #1105

Merged
merged 39 commits into from
Oct 30, 2024
Merged

DM-46776: Add zip creation and ingest #1105

merged 39 commits into from
Oct 30, 2024

Conversation

timj
Copy link
Member

@timj timj commented Oct 22, 2024

Depends on lsst/resources#98

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 82.19178% with 104 lines in your changes missing coverage. Please review.

Project coverage is 89.37%. Comparing base (ef8b14b) to head (c510c7f).
Report is 40 commits behind head on main.

Files with missing lines Patch % Lines
...er/datastores/file_datastore/retrieve_artifacts.py 72.50% 27 Missing and 6 partials ⚠️
python/lsst/daf/butler/datastores/fileDatastore.py 83.33% 14 Missing and 8 partials ⚠️
python/lsst/daf/butler/_dataset_ref.py 86.15% 5 Missing and 4 partials ⚠️
...on/lsst/daf/butler/remote_butler/_remote_butler.py 70.96% 6 Missing and 3 partials ⚠️
python/lsst/daf/butler/cli/cmd/commands.py 52.94% 6 Missing and 2 partials ⚠️
...hon/lsst/daf/butler/datastores/chainedDatastore.py 84.61% 3 Missing and 3 partials ⚠️
...on/lsst/daf/butler/direct_butler/_direct_butler.py 90.69% 2 Missing and 2 partials ⚠️
python/lsst/daf/butler/formatters/json.py 33.33% 4 Missing ⚠️
python/lsst/daf/butler/formatters/yaml.py 42.85% 4 Missing ⚠️
python/lsst/daf/butler/script/ingest_zip.py 66.66% 2 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1105      +/-   ##
==========================================
- Coverage   89.46%   89.37%   -0.10%     
==========================================
  Files         361      362       +1     
  Lines       47787    48288     +501     
  Branches     5799     5872      +73     
==========================================
+ Hits        42752    43155     +403     
- Misses       3649     3718      +69     
- Partials     1386     1415      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andy-slac
Copy link
Contributor

@timj, ticket says that butler zip-from-graph should also be implemented, but I can't see it? I see that QBB adds retrieve_artifacts_zip, but I can't find anything that uses that method.

@timj
Copy link
Member Author

timj commented Oct 24, 2024

@andy-slac it's in the pipe_base PR. lsst/pipe_base#452

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, small bunch of comments/questions. I tried to understand as much as I could, but some pieces are very new to me (or I forgot much of it already).

python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_formatter.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_quantum_backed.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_quantum_backed.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/chainedDatastore.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/direct_butler/_direct_butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/remote_butler/_remote_butler.py Outdated Show resolved Hide resolved
Fpr chained datastore retrieveArtifacts was checking each dataset for file
existence in a loop which can be really slow on GCS. Instead use
knows_these and assume the datastore does have the file if it has
the ref.
Refactor some of the zip index code as part of this.
No longer use list[DatasetRef] but use SerializedDatasetRefContainer
This should be fine since we attempt to make the file names
unique. It also works around an issue in ResourcePath where
a "#" character is treated as part of a directory path if
found in the non-file part of the path.
For now rejects if any refs are rejected.
timj and others added 25 commits October 30, 2024 13:47
…rtifacts

This allows Zip retrieval to work in QBB mode.
This makes it easier to work out which file comes from which
dataset ref and guarantees each file will be unique without
directory path.
This does not fix the problem of retrieve-artifacts retrieving
Zip files and ending up with a bad index.
Co-authored-by: Andy Salnikov <[email protected]>
And remove some duplicated code that was writing the index
a second time.
This allows pydantic to drop them from the JSON.
Which is rare with the new query system but is possible.
There is nothing to be gained by including all the dimensions
other than taking extra space.
This makes it much more explicit what the parameters mean.
The mapping included all the paths so no reason to return
the paths separately.
This is important to ensure that the same prefix is used each
time if the artifact has multiple refs associated with it.
@timj timj merged commit 2b0d39d into main Oct 30, 2024
15 of 16 checks passed
@timj timj deleted the tickets/DM-46776 branch October 30, 2024 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants