Correctly implement code blocks in docs

q-m · Jan 28, 2025 · f80dcc4 · f80dcc4
1 parent 37524bc
commit f80dcc4
Show file tree

Hide file tree

Showing 5 changed files with 52 additions and 26 deletions.
diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md
@@ -6,7 +6,7 @@
 
 The `wacz_crawl_skip` flag is applied to requests that should be ignored by the crawler. When this flag is present, the middleware intercepts the request and prevents it from being processed further, skipping both download and parsing. This is useful in scenarios where the request should not be collected during a scraping session. Usage:
 
-```python
+``` py
 yield Request(url, callback=cb_func, flags=["wacz_crawl_skip"])
 ```
 
@@ -28,7 +28,7 @@ Going around the default behaviour of the spider, the `WaczCrawlMiddleware` spid
 
 To use this strategy, enable both the spider- and the downloadermiddleware in the spider settings like so:
 
-```python
+``` py title="settings.py"
 DOWNLOADER_MIDDLEWARES = {
     "scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
 }
@@ -40,7 +40,7 @@ SPIDER_MIDDLEWARES = {
 
 Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:
 
-```python
+``` py title="settings.py"
 SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
 SW_WACZ_CRAWL = True
 ```
@@ -49,7 +49,10 @@ SW_WACZ_CRAWL = True
 
 Not all URLs will be interesting for the crawl since your WACZ will most likely contain static files such as fonts, JavaScript (website and external), stylesheets, etc. In order to improve the performance of the spider by not reading all the irrelevant request/response entries, you can configure the following atrribute in your spider, `archive_regex`:
 
-```python
+``` py title="my_wacz_spider.py"
+from scrapy.spiders import Spider
+
+
 class MyWaczSpider(Spider):
     name = "myspider"
     archive_regex = r"^/tag/[\w-]+/$"
@@ -77,27 +80,31 @@ com,toscrape,quotes)/static/main.css 20241007081525074 {...}
 
 ## Requests and Responses
 
-## Special Keys in Request.meta
+### Special Keys in Request.meta
 
 The `Request.meta` attribute in Scrapy allows you to store arbitrary data for use during the crawling process. While you can store any custom data in this attribute, Scrapy and its built-in extensions recognize certain special keys. Additionally, the `scrapy-webarchive` extension introduces its own special key for managing metadata. Below is a description of the key used by `scrapy-webarchive`:
 
 * `webarchive_warc`
 
-### `webarchive_warc`
+#### `webarchive_warc`
 This key stores the result of a WACZ crawl or export. The data associated with this key is read-only and is not used to control Scrapy's behavior. The value of this key can be accessed using the constant `WEBARCHIVE_META_KEY`, but direct usage of this constant is discouraged. Instead, you should use the provided class method to instantiate a metadata object, as shown in the example below:
 
-```python
+``` py title="my_wacz_spider.py"
+from scrapy.spiders import Spider
 from scrapy_webarchive.models import WarcMetadata
 
 
-def parse_function(self, response):
-    # Instantiate a WarcMetadata object from the response
-    warc_meta = WarcMetadata.from_response(response)
+class MyWaczSpider(Spider):
+    name = "myspider"
+
+    def parse_function(self, response):
+        # Instantiate a WarcMetadata object from the response
+        warc_meta = WarcMetadata.from_response(response)
 
-    # Extract the attributes to attach while parsing a page/item
-    if warc_meta:
-        yield {
-            'warc_record_id': warc_meta.record_id,
-            'wacz_uri': warc_meta.wacz_uri,
-        }
+        # Extract the attributes to attach while parsing a page/item
+        if warc_meta:
+            yield {
+                'warc_record_id': warc_meta.record_id,
+                'wacz_uri': warc_meta.wacz_uri,
+            }
 ```
diff --git a/docs/installation.md b/docs/installation.md
@@ -2,14 +2,20 @@
 
 To install `scrapy-webarchive`, run:
 
-```bash
+``` bash
 pip install scrapy-webarchive
 ```
 
 If you want to use a cloud provider for storing/scraping, you opt-in to install these dependencies:
 
-```bash
+``` bash
 pip install scrapy-webarchive[aws]
+```
+
+``` bash
 pip install scrapy-webarchive[gcs]
+```
+
+``` bash
 pip install scrapy-webarchive[all]
 ```
diff --git a/docs/settings.md b/docs/settings.md
@@ -6,7 +6,7 @@
 
 ### `SW_EXPORT_URI`
 
-```python
+``` py title="settings.py"
 # Either configure the directory where the output should be uploaded to
 SW_EXPORT_URI = "s3://scrapy-webarchive/"
 SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/"
@@ -45,7 +45,7 @@ This setting defines the description of the WACZ used in the `datapackage.json`,
 
 ⚠️ Scraping against a remote source currently only supports AWS S3.
 
-```python
+``` py title="settings.py"
 # "file://" must be explicitly added, unlike SW_EXPORT_URI where it makes an assumption if no scheme is added.
 SW_WACZ_SOURCE_URI = "file:///Users/username/Documents/archive.wacz"
 SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
@@ -58,7 +58,7 @@ This setting defines the location of the WACZ file that should be used as a sour
 
 ### `SW_WACZ_CRAWL`
 
-```python
+``` py title="settings.py"
 SW_WACZ_CRAWL = True
 ```
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -11,15 +11,15 @@ The general use for this plugin is separated in two parts, exporting and crawlin
 
 To archive the requests/responses during a crawl job you need to enable the `WaczExporter` extension. 
 
-```python
+``` py title="settings.py"
 EXTENSIONS = {
     "scrapy_webarchive.extensions.WaczExporter": 543,
 }
 ```
 
 This extension also requires you to set the export location using the `SW_EXPORT_URI` settings (check the settings page for different options for exporting).
 
-```python
+``` py title="settings.py"
 SW_EXPORT_URI = "s3://scrapy-webarchive/"
 ```
 
@@ -33,15 +33,15 @@ To crawl against a WACZ archive you need to use the `WaczMiddleware` downloader
 
 To use the downloader middleware, enable it in the settings like so:
 
-```python
+``` py title="settings.py"
 DOWNLOADER_MIDDLEWARES = {
     "scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
 }
 ```
 
 Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:
 
-```python
+``` py title="settings.py"
 SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
 SW_WACZ_CRAWL = True
 ```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -4,11 +4,24 @@ site_url: https://github.com/q-m/scrapy-webarchive
 
 theme:
   name: material
+  features:
+    - content.code.copy
+    - content.code.select
+    - content.code.annotate
 
 nav:
   - Introduction: index.md
   - Installation: installation.md
   - Usage:
       - Usage: usage.md
       - Advanced Usage: advanced_usage.md
-  - Settings: settings.md
+  - Settings: settings.md
+
+markdown_extensions:
+  - pymdownx.highlight:
+      anchor_linenums: true
+      line_spans: __span
+      pygments_lang_class: true
+  - pymdownx.inlinehilite
+  - pymdownx.snippets
+  - pymdownx.superfences