Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intelligent crawling by resource types to respect maximum file size #349

Open
fluidicice opened this issue Jul 20, 2024 · 5 comments
Open
Milestone

Comments

@fluidicice
Copy link

Due to the limitations on creating ZIM files with Zimit and the seemingly random order the web pages are downloaded, could an advanced option be added to exclude certain types of files? E.g. .mp4, .avi, .jpg

I have found that videos have been downloaded on some larger websites before Zimit grabs some of the HTML pages before the limit(s) are reached, leaving a lot of broken links but including some un-necessary videos.

Alternatively, a tickbox to start grabbing the HTML files first would be helpful, followed by pictures and then videos if there's still space remaining.

@Abel-Trans
Copy link

It's benefitial to know how the tool works in detail. Hope the readme.md can be more detailed.

@kelson42
Copy link
Contributor

@Abel-Trans We are really wanting to improve the documentation. If you have questions, please open issues. One issue per question. Based on this issues, we will update the documentation.

@rgaudin
Copy link
Member

rgaudin commented Jul 20, 2024

There are really two requests in one here. Excluding filetype is easy with existing --exclude using path extensions. Excludi ng by detected mimetype could be an addition.

The second request about fetching resources is interesting.
I think we need a URL and limit details because in my understanding resources are found in page at parsing time and that is just before the page is added to WARC. So resources from a page inside the WARC but the page referencing them not in the WARC seems unlikely.
The opposite is more likely though: HTML is included but not all resources are added because the limit has been reached. This could be an option. Is that what you were describing?

@fluidicice
Copy link
Author

fluidicice commented Jul 20, 2024

@rgaudin Thanks for your reply, that's almost correct - Videos linked to already-downloaded HTML pages were included but before all the other HTML pages have been downloaded - leaving some fully functional pages with videos and a lot of dead HTML links to other pages in the website.

The aim is to start with the most compact form of information - text (in this case HTML files). From there it will fill out the rest of the remaining data limit with less dense information - pictures next and then videos last if there's still remaining space - if that makes sense.

May I have an example of an --exclude file type please? I've read the manual and couldn't work it out.

@benoit74
Copy link
Collaborator

--exclude parameter will only exclude pages, not resource inside a given page. So for instance if a page embeds an mp4 player, then the mp4 will be fetched. This is probably why you don't achieve to make it work.

It is possible in browsertrix crawler to exclude page resources with the --blockRules parameter (see https://crawler.docs.browsertrix.com/user-guide/crawl-scope/#scope-rule-examples), but this is not yet available in zimit. But then these resources will be completely ignored.

If I get you correctly, what you would like is to consider how to best allocate a given archive size, so that we ensure to have at least all HTML/JS/CSS, if possible all images, and if possible all videos. This is an extremely new and complex feature, pretty sure this is something which will be possible to work out in the coming months without significant funding, it is far more than only scraper maintenance, not an easy one (but still meaningful).

I propose to keep this issue focused on this main concern, I've moved the fact that zimit should support --blockRules in #353

@benoit74 benoit74 changed the title Exclude Certain File-types - Enhancement Intelligent crawling by resource types to respect maximum file size Jul 22, 2024
@benoit74 benoit74 added this to the later milestone Jul 22, 2024
@benoit74 benoit74 removed their assignment Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants