Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WACZ support #770

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from
Draft

Add WACZ support #770

wants to merge 14 commits into from

Conversation

machawk1
Copy link
Member

@machawk1 machawk1 commented May 24, 2022

  • Cleanup directories created for WARC files extracted from WACZ upon completion of indexing. The files are deleted but the directories remain and the docs say that the creator is responsible.

see more TODOs below before this PR can be merged

@machawk1 machawk1 marked this pull request as draft May 24, 2022 20:03
@machawk1 machawk1 requested a review from ibnesayeed May 24, 2022 20:03
@machawk1
Copy link
Member Author

Pending a test passing and probably some better documentation, but I would appreciate your feedback on the diff here, @ibnesayeed.

@machawk1 machawk1 marked this pull request as ready for review May 25, 2022 20:15
Copy link
Member

@ibnesayeed ibnesayeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we handle it by externally extracting WACZ files and then indexing those WARCs the usual way until we have a built-in WARC record iterator in py-wacz (an idea that I discussed with @ikreymer already), at which point this change will become obsolete.

That said, I do not have any strong objections against this PR. I have added some inline comments for potential improvements though.

ipwb/util.py Outdated Show resolved Hide resolved
ipwb/util.py Show resolved Hide resolved
ipwb/indexer.py Show resolved Hide resolved
@machawk1 machawk1 requested a review from ibnesayeed June 2, 2022 21:04
@@ -171,6 +186,8 @@ def index_file_at(warc_paths, encryption_key=None,
cdxj_metadata_lines = generate_cdxj_metadata(cdxj_lines)
cdxj_lines = cdxj_metadata_lines + cdxj_lines

cleanup_warc_files_extracted_from_wacz(warc_paths_to_append)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big issue, but I think the temporary folders created by the mkdtemp() call will continue to exists (until cleaned up by the OS) because only the files inside them are deleted, not the folders themselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say that the creator is responsible for the deletion, so I think we should handle this. Given each WARC gets a new temp directory, it might be better to just retain the copy of this directory path and delete it along with its contents instead of deleting the WARC then the directory, which would require tracking the directory path, too. Which approach would you rather be implemented, @ibnesayeed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which would require tracking the directory path, too

Not really! It is possible to get the path of the directory if the path of a file is known that it contains.

That said, I would perhaps preferred not holding onto the list of WARC files, instead, operate on each WARC as we discover them, whether those are regular WARC files or those extracted from WACz files. I would deal with one file at a time and loop over for the next one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ibnesayeed This seems like it requires a revamp outside of the scope of this GH issue/PR. I agree that dealing with one WARC at a time would likely be more computationally optimal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that it would require change in the workflow. When done, it would be more space efficient as not all the WARCs need to be extracted from WACZ files upfront, duplicating them on the disk, before processing them.

It is okay to leave it as things are right now and get back to this when we have a WARC record iterator for the WACZ files, when most of these changes will be rendered useless.

@ikreymer
Copy link

ikreymer commented Jun 6, 2022

While its nice to see more support for WACZ, I am wondering about how you're seeing support the WACZ format in this implementation.

It seems like this PR is simply treating WACZ as just ZIP containing some WARC files.
If that is the case, why not just add documentation for users to import a WACZ file by using:

unzip somefile.wacz
ipwb index archive/*.warc.gz

But a WACZ file is more than just a zip of WARC files! More importantly, it has a builtin CDXJ in a specific format, which allows for fast access and does not require reindexing of the WARCs. It also has metadata, fixity information (payload and full record digests), and optional signature. There is a specific structure, extending the frictionless data package, and validated by py-wacz. (Of course, this is all being developed/improved).

But if everything but the WARCs are being discarded, I wonder if that should be considered 'supporting WACZ' or just 'importing WARCs from WACZ file'.

Can you see a way for ipwb to support more of the WACZ format, such as reusing the existing CDXJ indexes perhaps?

In some ways, the current approach is a lossy transformation: WACZ files can already be hosted and used directly from IPFS. By importing into ipwb, you're taking the WARCs from the WACZ, reindexing them, and putting just the WARCs onto IPFS in a slightly different way, while creating a new index that is not stored on IPFS, and discarding other data (metadata, signatures, etc...). I wonder if ipwb could do something else to better leverage the properties of the WACZ format?

@machawk1
Copy link
Member Author

machawk1 commented Jun 6, 2022

@ikreymer This initial implementation is base support for WACZ. We are hoping to do more with the format in the future beyond treating it as a container for WARCs.

Reusing the CDXJ in the WACZ for ipwb's own generated CDXJ index could be interesting. We have not thought about it much yet.

@ibnesayeed likely has some ideas / 2¢. Thanks for your input!

@ibnesayeed
Copy link
Member

ibnesayeed commented Jun 7, 2022

It seems like this PR is simply treating WACZ as just ZIP containing some WARC files.
If that is the case, why not just add documentation for users to import a WACZ file by using:

@ikreymer that would have been my preference as well (as noted in a comment earlier #770 (review)), but @machawk1 wanted a more built-in approach, hence this PR.

But if everything but the WARCs are being discarded, I wonder if that should be considered 'supporting WACZ' or just 'importing WARCs from WACZ file'.

In that sense IPWB does not even support WARC files, let alone WACZ files. During the indexing and ingestion process, it iterates over each WARC record, splits the headers and payload to store them as small IPFS objects to be used at the time of replay. This is why we are interested in an iterator over the WARC records and not the container (or the container of the container) format.

Can you see a way for ipwb to support more of the WACZ format, such as reusing the existing CDXJ indexes perhaps?

We can leverage the CDXJ file packaged inside the WACZ file, but not for replay, only as a means to iterate over WARC records during indexing/ingestion. However, for large WACZ files this approach might be slow because the reader pointer will be jumping around to seek to different locations if the CDXJ file is not sorted in the order those records are preserved. Moreover, if the built-in CDXJ excludes certain WARC record types (which might not be the case, but worth mentioning here) then those will not be reachable.

In some ways, the current approach is a lossy transformation: WACZ files can already be hosted and used directly from IPFS. By importing into ipwb, you're taking the WARCs from the WACZ, reindexing them, and putting just the WARCs onto IPFS in a slightly different way, while creating a new index that is not stored on IPFS, and discarding other data (metadata, signatures, etc...). I wonder if ipwb could do something else to better leverage the properties of the WACZ format?

I am not sure if I would call it lossy, because the goal of IPWB is different from some of the other web archiving systems. It operates on the atomic level of records while gluing related pieces together for replay. IPWB was not modeled to interact with WARC/WACZ or any other container/bundle format at the time of replay. For long term preservation, one is encouraged to retain WARC/WACZ files with as much provenance and metadata as possible. IPWB can be improved to incorporate more metadata followed by re-ingesting WARC files from the cold storage.

@machawk1 machawk1 requested a review from ibnesayeed June 27, 2022 21:47
@machawk1
Copy link
Member Author

@ibnesayeed I would like your re-review here, as I added some logic to retain the temp paths and remove them as required so as to not have side effects. Inferring these might be unreliable, as the WARCs extracted from a WACZ are stored at {unique_temp_path}/archive/warcname.warc.gz rather than simply {unique_temp_path}/warcname.warc.gz.

If you would rather we change this to the latter scenario, which should not cause a clash issue due to the unique paths, let me know and I can cleanup the logic.

Otherwise, the changes here meet the requirements of #710. The latter scenario can always come in with refactoring.

@machawk1
Copy link
Member Author

Another caveat is whether subdirectories beyond /archive are legal for WARC storage in WACZ. This would cause any removal of the structure in WACZ not being retained in the on-disk WARC to introduce a clash.

I would have to check the WACZ spec to see if additional organization like this is legal. If not, the cleaner way would be to removal the /archive prefix when the WARC is extracted to disk.

Copy link
Member

@ibnesayeed ibnesayeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we have some unnecessary code because we are doing things the complicated way. For example, we are maintaining list of extracted WARC files from WACZs to be removed later, then we also maintain list of temporary directories to be removed, though removing the later will automatically remove its contents.

I am not sure if merging this PR would be a good idea as it brings a lot of complexities that will be difficult to maintain and reason about and at some point they might become useless (if and when we have WARC record iterator built in the WACZ library).

wacz_paths = []
for warc_path in warc_paths:
if is_wacz(warc_path):
(w_paths, dirs_to_cleanup) = extract_warcs_from_wacz(warc_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dirs_to_cleanup here is overwritten in each loop, so at the end it will only hold the reference to the temporary dirs of the last WACZ file (unless I am missing something) for cleanup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on dirs_to_cleanup not being retained.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic of extracted WARCs and temp directories can be simplified. There is a bit too much disk maintenance. The general gist is that paths to WACZ files could be inter-mingled with WARCs here and thus the WARCs extracted from the WACZ files need to be removed but not the WARCs that were passed in.

@machawk1
Copy link
Member Author

machawk1 commented Feb 28, 2023

  • Change process of handling one WARC at a time from extraction, ingestion, extraction, ipwb magic, deletion before moving on to next WARC in the WACZ.

@machawk1 machawk1 marked this pull request as draft February 28, 2023 20:28
@machawk1 machawk1 linked an issue Apr 10, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support indexing WACZ files
3 participants