Intelligent crawling by resource types to respect maximum file size #349

fluidicice · 2024-07-20T04:13:18Z

Due to the limitations on creating ZIM files with Zimit and the seemingly random order the web pages are downloaded, could an advanced option be added to exclude certain types of files? E.g. .mp4, .avi, .jpg

I have found that videos have been downloaded on some larger websites before Zimit grabs some of the HTML pages before the limit(s) are reached, leaving a lot of broken links but including some un-necessary videos.

Alternatively, a tickbox to start grabbing the HTML files first would be helpful, followed by pictures and then videos if there's still space remaining.

Abel-Trans · 2024-07-20T07:27:48Z

It's benefitial to know how the tool works in detail. Hope the readme.md can be more detailed.

kelson42 · 2024-07-20T08:35:09Z

@Abel-Trans We are really wanting to improve the documentation. If you have questions, please open issues. One issue per question. Based on this issues, we will update the documentation.

rgaudin · 2024-07-20T08:41:56Z

There are really two requests in one here. Excluding filetype is easy with existing --exclude using path extensions. Excludi ng by detected mimetype could be an addition.

The second request about fetching resources is interesting.
I think we need a URL and limit details because in my understanding resources are found in page at parsing time and that is just before the page is added to WARC. So resources from a page inside the WARC but the page referencing them not in the WARC seems unlikely.
The opposite is more likely though: HTML is included but not all resources are added because the limit has been reached. This could be an option. Is that what you were describing?

fluidicice · 2024-07-20T09:18:33Z

@rgaudin Thanks for your reply, that's almost correct - Videos linked to already-downloaded HTML pages were included but before all the other HTML pages have been downloaded - leaving some fully functional pages with videos and a lot of dead HTML links to other pages in the website.

The aim is to start with the most compact form of information - text (in this case HTML files). From there it will fill out the rest of the remaining data limit with less dense information - pictures next and then videos last if there's still remaining space - if that makes sense.

May I have an example of an --exclude file type please? I've read the manual and couldn't work it out.

benoit74 · 2024-07-22T07:36:20Z

--exclude parameter will only exclude pages, not resource inside a given page. So for instance if a page embeds an mp4 player, then the mp4 will be fetched. This is probably why you don't achieve to make it work.

It is possible in browsertrix crawler to exclude page resources with the --blockRules parameter (see https://crawler.docs.browsertrix.com/user-guide/crawl-scope/#scope-rule-examples), but this is not yet available in zimit. But then these resources will be completely ignored.

If I get you correctly, what you would like is to consider how to best allocate a given archive size, so that we ensure to have at least all HTML/JS/CSS, if possible all images, and if possible all videos. This is an extremely new and complex feature, pretty sure this is something which will be possible to work out in the coming months without significant funding, it is far more than only scraper maintenance, not an easy one (but still meaningful).

I propose to keep this issue focused on this main concern, I've moved the fact that zimit should support --blockRules in #353

kelson42 assigned benoit74 Jul 20, 2024

kelson42 added the question label Jul 20, 2024

benoit74 changed the title ~~Exclude Certain File-types - Enhancement~~ Intelligent crawling by resource types to respect maximum file size Jul 22, 2024

benoit74 added this to the later milestone Jul 22, 2024

benoit74 added enhancement and removed question labels Jul 22, 2024

benoit74 removed their assignment Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intelligent crawling by resource types to respect maximum file size #349

Intelligent crawling by resource types to respect maximum file size #349

fluidicice commented Jul 20, 2024

Abel-Trans commented Jul 20, 2024

kelson42 commented Jul 20, 2024

rgaudin commented Jul 20, 2024

fluidicice commented Jul 20, 2024 •

edited

Loading

benoit74 commented Jul 22, 2024

Intelligent crawling by resource types to respect maximum file size #349

Intelligent crawling by resource types to respect maximum file size #349

Comments

fluidicice commented Jul 20, 2024

Abel-Trans commented Jul 20, 2024

kelson42 commented Jul 20, 2024

rgaudin commented Jul 20, 2024

fluidicice commented Jul 20, 2024 • edited Loading

benoit74 commented Jul 22, 2024

fluidicice commented Jul 20, 2024 •

edited

Loading