S3 Cache Policy

What & Why

The Zimfarm manages the automatic creation of our thousands+ collection of recipes of ZIM files. Most of them are offline versions of live website that continues to be updated and for which periodic new versions of the ZIM files are necessary.

While it's important to create new versions, it's a fact that most of the content remains the same as the previous version ; but because our scraper tools don't handle delta updates, these have to redo everything each time (monthly for most recipes).

Text is easy to fetch and process. It's not a big deal to do it over. Media files on the other end, like images and videos are large, long and transfer-consuming and requires a lot of time to process as all scrapers optimize those for smaller footprint or for compatibility reasons.

Additionally, most platforms (Wikimedia, Youtube) imposes crawling restrictions so re-downloading the very same file not only is a waste of our time and resources but also impacts our ability to run other recipes as it brings us closer to the quotas imposed by those platforms.

For all those reasons, we have set up an Optimization Cache. This Cache is an online file store that is used only by our zimfarm scrapers and allows them to:

check if a remote file is already in store
download the file from the store.
store files. they do that after downloading and optimizing a file that's not in the store, for future use.

With this setup, we save on downloads from the platform and on processing time on the workers.

How

The Cache is an S3-compatible Cloud Object Store. We are using Wasabi at the moment but scrapers should only rely on the S3 API which many providers support.

Zimfarm implementation

One bucket per scraper: org-kiwix-mwoffliner, org-kiwix-youtube, etc.
One additional bucket per scraper for development: org-kiwix-dev-mwoffliner, org-kiwix-dev-youtube, etc.
One account per developer under developers group.
1+ set of credentials per individual developer (attached to its user).
One workers group.
One single account for all zimfarm workers (zimfarm-worker) inside workers group ; with single set of credentials.
One developer-worker user account inside workers group (for testing to prod).
dev buckets are opened to all actions by all developers.
prod buckets are restricted RW and only from white-listed workers IPs

Note: AWS S3 offers 100 buckets per account. Wasabi offers 1,000.

Bucket rules

Each scraper defines how to use its bucket.

This could be documented on the scraper's repo and linked here.

User rules

Don't use Root Account Key.
Don't use worker credentials.
If you don't have an account, contact @kelson42.

Policies

Entity	Value
Group `admin`	explicit allow `` on ``
Group `developers`	explicit allow RW for `org-kiwix-dev-*`
Group `workers`	explicit allow RW for `org-kiwix-*`
User `zimfarm-worker`	explicit deny from IPs not in list for `s3:` on `s3::`

No default policy on buckets. Buckets creators may allow public downloads (non-zimfarm use, ex: cardshop).

Group `admin`

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Admin",
      "Effect": "Allow",
      "Action": "*",
      "Resource": "*"
    }
  ]
}

Group `developers`

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DevelopersRWOnDevBuckets",
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": "arn:aws:s3::org-kiwix-dev-*"
    },
    {
      "Sid": "DevelopersIAMManagement",
      "Effect": "Allow",
      "Action": [
        "iam:*",
        "sts:*"
      ],
      "Resource": "arn:aws:iam::${aws:accountid}:user/${aws:username}"
    }
  ]
}

Group `workers`

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "WorkersRWOnProdBuckets",
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": "arn:aws:s3::org-kiwix-*"
    },
  ]
}

User `zimfarm-worker`

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ZimfarmWorkersIPsWhiteList",
      "Effect": "Deny",
      "Action": "s3:*",
      "Resource": "arn:aws:s3::*",
      "Condition": {
        "NotIpAddress": {
          "aws:SourceIp": [
            "IPs list goes here"
          ]
        }
      }
    }
  ]
}

Non-Zimfarm use

Except developers usage (on the dev buckets), non-zimfarm scraper users are not allowed to use our cache. They are free to use any S3 cache of their own by providing a different URL though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 Cache Policy

What & Why

How

Zimfarm implementation

Bucket rules

User rules

Policies

Group `admin`

Group `developers`

Group `workers`

User `zimfarm-worker`

Non-Zimfarm use

Clone this wiki locally

S3 Cache Policy

What & Why

How

Zimfarm implementation

Bucket rules

User rules

Policies

Group admin

Group developers

Group workers

User zimfarm-worker

Non-Zimfarm use

Clone this wiki locally

Group `admin`

Group `developers`

Group `workers`

User `zimfarm-worker`