Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to perform incremental scrapping of websites ? #344

Open
nish2482 opened this issue Jul 19, 2024 · 2 comments
Open

How to perform incremental scrapping of websites ? #344

nish2482 opened this issue Jul 19, 2024 · 2 comments

Comments

@nish2482
Copy link

I find that for scrapping a mediawiki site,zimit takes 5-6 hours , is there some recommended setting for scraping a mediawiki site for zimit?
To reduce scrapping time how can we do delta scrapping using zimit so that we just scrap the changed web pages and add it to the original zim ?

@nish2482 nish2482 changed the title Question : How to Perform incremental scan sicne scrapping take quite a lot of time and some pages on website do not change How to Perform incremental scrapping of Mediawiki sites ? Jul 19, 2024
@nish2482 nish2482 changed the title How to Perform incremental scrapping of Mediawiki sites ? How to perform incremental scrapping of Mediawiki sites ? Jul 19, 2024
@benoit74 benoit74 added this to the later milestone Jul 19, 2024
@benoit74
Copy link
Collaborator

Problem with mediawikis is that all revision pages are grabbed one by one. This is probably not something you're interested in, you can probably set an exclude parameter to exclude revision history URLs (never tried, but should work). It is also important to note that in order to ZIM a mediawiki it is preferable to use mwoflliner scraper which is specifically tailored to ZIM a mediawiki.

That being said, the problem of incrementally scrapping a site is still relevant for many other cases, and for now there is no real solution in place. And it is probably not going to be something straightforward to implement.

@benoit74 benoit74 changed the title How to perform incremental scrapping of Mediawiki sites ? How to perform incremental scrapping of websites ? Jul 19, 2024
@kelson42
Copy link
Contributor

kelson42 commented Jul 19, 2024

Scraping of Mefiawiki sites is recommended to be done via MWoffliner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants