Skip to content

Commit

Permalink
Correctly implement code blocks in docs
Browse files Browse the repository at this point in the history
  • Loading branch information
leewesleyv committed Jan 28, 2025
1 parent 37524bc commit f80dcc4
Show file tree
Hide file tree
Showing 5 changed files with 52 additions and 26 deletions.
39 changes: 23 additions & 16 deletions docs/advanced_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

The `wacz_crawl_skip` flag is applied to requests that should be ignored by the crawler. When this flag is present, the middleware intercepts the request and prevents it from being processed further, skipping both download and parsing. This is useful in scenarios where the request should not be collected during a scraping session. Usage:

```python
``` py
yield Request(url, callback=cb_func, flags=["wacz_crawl_skip"])
```

Expand All @@ -28,7 +28,7 @@ Going around the default behaviour of the spider, the `WaczCrawlMiddleware` spid

To use this strategy, enable both the spider- and the downloadermiddleware in the spider settings like so:

```python
``` py title="settings.py"
DOWNLOADER_MIDDLEWARES = {
"scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}
Expand All @@ -40,7 +40,7 @@ SPIDER_MIDDLEWARES = {

Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:

```python
``` py title="settings.py"
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True
```
Expand All @@ -49,7 +49,10 @@ SW_WACZ_CRAWL = True

Not all URLs will be interesting for the crawl since your WACZ will most likely contain static files such as fonts, JavaScript (website and external), stylesheets, etc. In order to improve the performance of the spider by not reading all the irrelevant request/response entries, you can configure the following atrribute in your spider, `archive_regex`:

```python
``` py title="my_wacz_spider.py"
from scrapy.spiders import Spider


class MyWaczSpider(Spider):
name = "myspider"
archive_regex = r"^/tag/[\w-]+/$"
Expand Down Expand Up @@ -77,27 +80,31 @@ com,toscrape,quotes)/static/main.css 20241007081525074 {...}

## Requests and Responses

## Special Keys in Request.meta
### Special Keys in Request.meta

The `Request.meta` attribute in Scrapy allows you to store arbitrary data for use during the crawling process. While you can store any custom data in this attribute, Scrapy and its built-in extensions recognize certain special keys. Additionally, the `scrapy-webarchive` extension introduces its own special key for managing metadata. Below is a description of the key used by `scrapy-webarchive`:

* `webarchive_warc`

### `webarchive_warc`
#### `webarchive_warc`
This key stores the result of a WACZ crawl or export. The data associated with this key is read-only and is not used to control Scrapy's behavior. The value of this key can be accessed using the constant `WEBARCHIVE_META_KEY`, but direct usage of this constant is discouraged. Instead, you should use the provided class method to instantiate a metadata object, as shown in the example below:

```python
``` py title="my_wacz_spider.py"
from scrapy.spiders import Spider
from scrapy_webarchive.models import WarcMetadata


def parse_function(self, response):
# Instantiate a WarcMetadata object from the response
warc_meta = WarcMetadata.from_response(response)
class MyWaczSpider(Spider):
name = "myspider"

def parse_function(self, response):
# Instantiate a WarcMetadata object from the response
warc_meta = WarcMetadata.from_response(response)

# Extract the attributes to attach while parsing a page/item
if warc_meta:
yield {
'warc_record_id': warc_meta.record_id,
'wacz_uri': warc_meta.wacz_uri,
}
# Extract the attributes to attach while parsing a page/item
if warc_meta:
yield {
'warc_record_id': warc_meta.record_id,
'wacz_uri': warc_meta.wacz_uri,
}
```
10 changes: 8 additions & 2 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,20 @@

To install `scrapy-webarchive`, run:

```bash
``` bash
pip install scrapy-webarchive
```

If you want to use a cloud provider for storing/scraping, you opt-in to install these dependencies:

```bash
``` bash
pip install scrapy-webarchive[aws]
```

``` bash
pip install scrapy-webarchive[gcs]
```

``` bash
pip install scrapy-webarchive[all]
```
6 changes: 3 additions & 3 deletions docs/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

### `SW_EXPORT_URI`

```python
``` py title="settings.py"
# Either configure the directory where the output should be uploaded to
SW_EXPORT_URI = "s3://scrapy-webarchive/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/"
Expand Down Expand Up @@ -45,7 +45,7 @@ This setting defines the description of the WACZ used in the `datapackage.json`,

⚠️ Scraping against a remote source currently only supports AWS S3.

```python
``` py title="settings.py"
# "file://" must be explicitly added, unlike SW_EXPORT_URI where it makes an assumption if no scheme is added.
SW_WACZ_SOURCE_URI = "file:///Users/username/Documents/archive.wacz"
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
Expand All @@ -58,7 +58,7 @@ This setting defines the location of the WACZ file that should be used as a sour

### `SW_WACZ_CRAWL`

```python
``` py title="settings.py"
SW_WACZ_CRAWL = True
```

Expand Down
8 changes: 4 additions & 4 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@ The general use for this plugin is separated in two parts, exporting and crawlin

To archive the requests/responses during a crawl job you need to enable the `WaczExporter` extension.

```python
``` py title="settings.py"
EXTENSIONS = {
"scrapy_webarchive.extensions.WaczExporter": 543,
}
```

This extension also requires you to set the export location using the `SW_EXPORT_URI` settings (check the settings page for different options for exporting).

```python
``` py title="settings.py"
SW_EXPORT_URI = "s3://scrapy-webarchive/"
```

Expand All @@ -33,15 +33,15 @@ To crawl against a WACZ archive you need to use the `WaczMiddleware` downloader

To use the downloader middleware, enable it in the settings like so:

```python
``` py title="settings.py"
DOWNLOADER_MIDDLEWARES = {
"scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}
```

Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:

```python
``` py title="settings.py"
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True
```
15 changes: 14 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,24 @@ site_url: https://github.com/q-m/scrapy-webarchive

theme:
name: material
features:
- content.code.copy
- content.code.select
- content.code.annotate

nav:
- Introduction: index.md
- Installation: installation.md
- Usage:
- Usage: usage.md
- Advanced Usage: advanced_usage.md
- Settings: settings.md
- Settings: settings.md

markdown_extensions:
- pymdownx.highlight:
anchor_linenums: true
line_spans: __span
pygments_lang_class: true
- pymdownx.inlinehilite
- pymdownx.snippets
- pymdownx.superfences

0 comments on commit f80dcc4

Please sign in to comment.