You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not sure it is feasible, but we know that many videos are not working currently. The first seconds (at least) of the video are however crawled and pushed into the ZIM, even if not usable in the end.
We hence end-up with big ZIMs for nothing.
Manual inspection of the ZIM can help to identify URLs which must be ignored, and openzim/zimit#353 can help to excude these resources from the crawl / warc.
It would be super cool if we could do it in a more automated way inside warc2zim.
I'm not sure it is feasible, but we know that many videos are not working currently. The first seconds (at least) of the video are however crawled and pushed into the ZIM, even if not usable in the end.
We hence end-up with big ZIMs for nothing.
Manual inspection of the ZIM can help to identify URLs which must be ignored, and openzim/zimit#353 can help to excude these resources from the crawl / warc.
It would be super cool if we could do it in a more automated way inside warc2zim.
This is somehow a companion issue of #396
The text was updated successfully, but these errors were encountered: