Skip to content

S3 Cache Policy

rgaudin edited this page Apr 10, 2020 · 4 revisions

What & Why

The Zimfarm manages the automatic creation of our thousands+ collection of recipes of ZIM files. Most of them are offline versions of live website that continues to be updated and for which periodic new versions of the ZIM files are necessary.

While it's important to create new versions, it's a fact that most of the content remains the same as the previous version ; but because our scraper tools don't handle delta updates, these have to redo everything each time (monthly for most recipes).

Text is easy to fetch and process. It's not a big deal to do it over. Media files on the other end, like images and videos are large, long and transfer-consuming and requires a lot of time to process as all scrapers optimize those for smaller footprint or for compatibility reasons.

Additionally, most platforms (Wikimedia, Youtube) imposes crawling restrictions so re-downloading the very same file not only is a waste of our time and resources but also impacts our ability to run other recipes as it brings us closer to the quotas imposed by those platforms.

For all those reasons, we have set up an Optimization Cache. This Cache is an online file store that is used only by our zimfarm scrapers and allows them to:

  • check if a remote file is already in store
  • download the file from the store.
  • store files. they do that after downloading and optimizing a file that's not in the store, for future use.

With this setup, we save on downloads from the platform and on processing time on the workers.

How

The Cache is an S3-compatible Cloud Object Store. We are using Wasabi at the moment but scrapers should only rely on the S3 API which many providers support.

Zimfarm implementation

  • One bucket per scraper: org-kiwix-mwoffliner, org-kiwix-youtube, etc.
  • One additional bucket per scraper for development: org-kiwix-dev-mwoffliner, org-kiwix-dev-youtube, etc.
  • One account per developer under developers group.
  • One set of credentials per individual developer (attached to its user).
  • One workers group.
  • One single account for all zimfarm workers (zimfarm-worker) inside workers group ; with single set of credentials.
  • One developer-worker user account inside workers group (for testing to prod).
  • dev buckets are opened to all actions by all developers.
  • prod buckets are restricted RW and only from white-listed workers IPs

Note: AWS S3 offers 100 buckets per account. Wasabi offers 1,000.

Bucket rules

  • Each scraper defines how to use its bucket.

This could be documented on the scraper's repo and linked here.

User rules

  • Don't use Root Account Key.
  • Don't use worker credentials.
  • If you need to test on prod, ask for credentials from developer-worker user. We'll remove them once you're done.
  • If you don't have an account, contact @kelson42.

Policies

Entity Value
Group admins explicit allow * on *
Group developers explicit allow RW for org-kiwix-dev-*
Group workers explicit allow RW for org-kiwix-*
User zimfarm-worker explicit deny from IPs not in list for s3:* on s3::*

No default policy on buckets. Buckets creators may allow public downloads (non-zimfarm use, ex: cardshop).

Group admins

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Admin",
      "Effect": "Allow",
      "Action": "*",
      "Resource": "*"
    }
  ]
}

Group developers

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DevelopersRWOnDevBuckets",
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::org-kiwix-dev-*"
    },
    {
      "Sid": "DevelopersListOwnBuckets",
      "Effect": "Allow",
      "Action": "s3:ListAllMyBuckets",
      "Resource": "arn:aws:s3:::"
    },
    {
      "Sid": "DevelopersIAMManagement",
      "Effect": "Allow",
      "Action": [
        "iam:*",
        "sts:*"
      ],
      "Resource": "arn:aws:iam::${aws:accountid}:user/${aws:username}"
    }
  ]
}

Group workers

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "WorkersListOwnBuckets",
      "Effect": "Allow",
      "Action": "s3:ListAllMyBuckets",
      "Resource": "arn:aws:s3:::"
    },
    {
      "Sid": "WorkersRWOnProdBuckets",
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": "arn:aws:s3::org-kiwix-*"
    }
  ]
}

User zimfarm-worker

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ZimfarmWorkersIPsWhiteList",
      "Effect": "Deny",
      "Action": "s3:*",
      "Resource": "arn:aws:s3::*",
      "Condition": {
        "NotIpAddress": {
          "aws:SourceIp": [
            "IPs list goes here"
          ]
        }
      }
    }
  ]
}

Non-Zimfarm use

Except developers usage (on the dev buckets), non-zimfarm scraper users are not allowed to use our cache. They are free to use any S3 cache of their own by providing a different URL though.