Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically remove non-working videos resources #397

Open
benoit74 opened this issue Sep 19, 2024 · 0 comments
Open

Automatically remove non-working videos resources #397

benoit74 opened this issue Sep 19, 2024 · 0 comments
Labels
enhancement New feature or request
Milestone

Comments

@benoit74
Copy link
Collaborator

I'm not sure it is feasible, but we know that many videos are not working currently. The first seconds (at least) of the video are however crawled and pushed into the ZIM, even if not usable in the end.

We hence end-up with big ZIMs for nothing.

Manual inspection of the ZIM can help to identify URLs which must be ignored, and openzim/zimit#353 can help to excude these resources from the crawl / warc.

It would be super cool if we could do it in a more automated way inside warc2zim.

This is somehow a companion issue of #396

@benoit74 benoit74 added the enhancement New feature or request label Sep 19, 2024
@benoit74 benoit74 added this to the backlog milestone Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant