-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intelligent crawling by resource types to respect maximum file size #349
Comments
It's benefitial to know how the tool works in detail. Hope the readme.md can be more detailed. |
@Abel-Trans We are really wanting to improve the documentation. If you have questions, please open issues. One issue per question. Based on this issues, we will update the documentation. |
There are really two requests in one here. Excluding filetype is easy with existing The second request about fetching resources is interesting. |
@rgaudin Thanks for your reply, that's almost correct - Videos linked to already-downloaded HTML pages were included but before all the other HTML pages have been downloaded - leaving some fully functional pages with videos and a lot of dead HTML links to other pages in the website. The aim is to start with the most compact form of information - text (in this case HTML files). From there it will fill out the rest of the remaining data limit with less dense information - pictures next and then videos last if there's still remaining space - if that makes sense. May I have an example of an |
It is possible in browsertrix crawler to exclude page resources with the If I get you correctly, what you would like is to consider how to best allocate a given archive size, so that we ensure to have at least all HTML/JS/CSS, if possible all images, and if possible all videos. This is an extremely new and complex feature, pretty sure this is something which will be possible to work out in the coming months without significant funding, it is far more than only scraper maintenance, not an easy one (but still meaningful). I propose to keep this issue focused on this main concern, I've moved the fact that zimit should support |
Due to the limitations on creating ZIM files with Zimit and the seemingly random order the web pages are downloaded, could an advanced option be added to exclude certain types of files? E.g. .mp4, .avi, .jpg
I have found that videos have been downloaded on some larger websites before Zimit grabs some of the HTML pages before the limit(s) are reached, leaving a lot of broken links but including some un-necessary videos.
Alternatively, a tickbox to start grabbing the HTML files first would be helpful, followed by pictures and then videos if there's still space remaining.
The text was updated successfully, but these errors were encountered: