From 8124426eebb46dde6ac2239e4837dd3d94a078f7 Mon Sep 17 00:00:00 2001 From: Thomas Meschede Date: Mon, 7 Dec 2020 00:04:40 +0100 Subject: [PATCH] Update README.md Added chapter about caching the results using requests_cache library --- README.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/README.md b/README.md index 9686286..d1dd7bb 100644 --- a/README.md +++ b/README.md @@ -87,6 +87,38 @@ You can also fetch the content of the web capture as bytes: There's a full example of iterating and selecting a subset of captures to write into an extracted WARC file in [examples/iter-and-warc.py](examples/iter-and-warc.py) +### Caching + +It is possible to cache request to the index by using the requests_cache library https://github.com/reclosedev/requests-cache. +This is a nice way to reduce the load on the CC servers and speed up your +code at the same time. + +To make this work we need to initialize the cache with more allowable codes +in order to also cache "empty" search results from the Index Server (404 and 400). +The 206 Http code is needed for downloading contents of the WARC archives. + +Just put the following code at the start of your script before making any calls to the +index using cdx_toolkit: + +``` +import requests_cache + +requests_cache.install_cache( + "/my/path/to/cache", + include_get_headers=True, + allowable_codes=(200, 404, 400, 206) + ) +``` + +Additionally, in order for the caching to work we need a static request url. cdx_toolkit +default parameters use a dynamic timestamp parameter. It is necessary to override this +by a custom static date when fetching the index parts: + +``` +for obj in cdx.iter(url, limit=1, from_ts="20191207000000")): + print(obj) +``` + ## Filter syntax Filters can be used to limit captures to a subset of the results.