Skip to content

Commit

Permalink
Improve the general usage docs and move the advanced usage to a new s…
Browse files Browse the repository at this point in the history
…ection
  • Loading branch information
Wesley van Lee committed Oct 28, 2024
1 parent 9ea5f71 commit 24873a3
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 29 deletions.
26 changes: 26 additions & 0 deletions docs/advanced_usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Advanced usage

## Crawling

### Iterating a WACZ archive index

Going around the default behaviour of the spider, the `WaczCrawlMiddleware` spider middleware will, when enabled, replace the crawl by an iteration through all the entries in the WACZ archive index.

To use this strategy, enable both middlewares in the spider settings like so:

```python
DOWNLOADER_MIDDLEWARES = {
"scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}

SPIDER_MIDDLEWARES = {
"scrapy_webarchive.spidermiddlewares.WaczCrawlMiddleware": 543,
}
```

Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:

```python
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True
```
38 changes: 9 additions & 29 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Usage

The general use for this plugin is separated in two parts, exporting and crawling.

1. **Exporting**; Run your spider with the extension to generate and export a WACZ file. This WACZ archive can be used in future crawls to retrieve historical data or simply to decrease the load on the website when your spider has changed but needs to run on the same data.
2. **Crawling**; Re-run your spider on an WACZ archive that was generated previously. This time we will not be generating a new WACZ archive but simply retrieve each reponse from the WACZ instead of making a request to the live resource (website). The WACZ contains complete response data that will be reconstructed to actual `Response` objects.

## Exporting

### Exporting a WACZ archive
Expand All @@ -12,21 +17,19 @@ EXTENSIONS = {
}
```

This extension also requires you to set the export location using the `SW_EXPORT_URI` settings.
This extension also requires you to set the export location using the `SW_EXPORT_URI` settings (check the settings page for different options for exporting).

```python
SW_EXPORT_URI = "s3://scrapy-webarchive/"
```

Running a crawl job using these settings will result in a newly created WACZ file.
Running a crawl job using these settings will result in a newly created WACZ file on the specified output location.

## Crawling

There are 2 ways to crawl against a WACZ archive. Choose a strategy that you want to use for your crawl job, and follow the instruction as described below.

### Lookup in a WACZ archive
### Using the download middleware

One of the ways to crawl against a WACZ archive is to use the `WaczMiddleware` downloader middleware. Instead of fetching the live resource the middleware will instead retrieve it from the archive and recreate a response using the data from the archive.
To crawl against a WACZ archive you need to use the `WaczMiddleware` downloader middleware. Instead of fetching the live resource the middleware will retrieve it from the archive and recreate a `Response` using the data from the archive.

To use the downloader middleware, enable it in the settings like so:

Expand All @@ -42,26 +45,3 @@ Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True
```

### Iterating a WACZ archive

Going around the default behaviour of the spider, the `WaczCrawlMiddleware` spider middleware will, when enabled, replace the crawl by an iteration through all the entries in the WACZ archive index. Then, similar to the previous strategy, it will recreate a response using the data from the archive.

To use this strategy, enable both middlewares in the spider settings like so:

```python
DOWNLOADER_MIDDLEWARES = {
"scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}

SPIDER_MIDDLEWARES = {
"scrapy_webarchive.spidermiddlewares.WaczCrawlMiddleware": 543,
}
```

Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:

```python
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ nav:
- Introduction: index.md
- installation.md
- usage.md
- Advanced Usage: advanced_usage.md
- settings.md

0 comments on commit 24873a3

Please sign in to comment.