-
-
Notifications
You must be signed in to change notification settings - Fork 25
S3 Cache Policy
The Zimfarm manages the automatic creation of our thousands+ collection of recipes of ZIM files. Most of them are offline versions of live website that continues to be updated and for which periodic new versions of the ZIM files are necessary.
While it's important to create new versions, it's a fact that most of the content remains the same as the previous version ; but because our scraper tools don't handle delta updates, these have to redo everything each time (monthly for most recipes).
Text is easy to fetch and process. It's not a big deal to do it over. Media files on the other end, like images and videos are large, long and transfer-consuming and requires a lot of time to process as all scrapers optimize those for smaller footprint or for compatibility reasons.
Additionally, most platforms (Wikimedia, Youtube) imposes crawling restrictions so re-downloading the very same file not only is a waste of our time and resources but also impacts our ability to run other recipes as it brings us closer to the quotas imposed by those platforms.
For all those reasons, we have set up an Optimization Cache. This Cache is an online file store that is used only by our zimfarm scrapers and allows them to:
- check if a remote file is already in store
- download the file from the store.
- store files. they do that after downloading and optimizing a file that's not in the store, for future use.
With this setup, we save on downloads from the platform and on processing time on the workers.
The Cache is an S3-compatible Cloud Object Store. We are using Wasabi at the moment but scrapers should only rely on the S3 API which many providers support.
- One bucket per scraper:
org-kiwix-mwoffliner
,org-kiwix-youtube
, etc. - One additional bucket per scraper for development:
org-kiwix-dev-mwoffliner
,org-kiwix-dev-youtube
, etc. - One account per developer under
developers
group. - 1+ set of credentials per individual developer (attached to its user).
- One
workers
group. - One single account for all zimfarm workers (
zimfarm-worker
) insideworkers
group ; with single set of credentials. - One
developer-worker
user account insideworkers
group (for testing to prod). - dev buckets are opened to all actions by all developers.
- prod buckets are restricted RW and only from white-listed workers IPs
Note: AWS S3 offers 100 buckets per account. Wasabi offers 1,000.
- Each scraper defines how to use its bucket.
This could be documented on the scraper's repo and linked here.
- Don't use Root Account Key.
- Don't use worker credentials.
- If you don't have an account, contact @kelson42.
Entity | Value |
---|---|
Group admin
|
explicit allow * on *
|
Group developers
|
explicit allow RW for org-kiwix-dev-*
|
Group workers
|
explicit allow RW for org-kiwix-*
|
User zimfarm-worker
|
explicit deny from IPs not in list for s3:* on s3::*
|
No default policy on buckets. Buckets creators may allow public downloads (non-zimfarm use, ex: cardshop).
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Admin",
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DevelopersRWOnDevBuckets",
"Effect": "Allow",
"Action": "s3:*",
"Resource": "arn:aws:s3::org-kiwix-dev-*"
},
{
"Sid": "DevelopersIAMManagement",
"Effect": "Allow",
"Action": [
"iam:*",
"sts:*"
],
"Resource": "arn:aws:iam::${aws:accountid}:user/${aws:username}"
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "WorkersRWOnProdBuckets",
"Effect": "Allow",
"Action": "s3:*",
"Resource": "arn:aws:s3::org-kiwix-*"
},
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ZimfarmWorkersIPsWhiteList",
"Effect": "Deny",
"Action": "s3:*",
"Resource": "arn:aws:s3::*",
"Condition": {
"NotIpAddress": {
"aws:SourceIp": [
"IPs list goes here"
]
}
}
}
]
}
Except developers usage (on the dev buckets), non-zimfarm scraper users are not allowed to use our cache. They are free to use any S3 cache of their own by providing a different URL though.