Automated publishing of all data + metadata #131

penelopeysm · 2024-07-02T07:58:00Z

Closes #123

This PR:

adds an asset to generate the top-level countries.txt file, along with the corresponding IO managers (CountriesTextIOManager)
adds a popgetter.run module which traverses the dependency list in a named job and runs it asset-by-asset. Running python -m popgetter.run all will:
- look at the $POPGETTER_COUNTRIES env var to see which countries are to be run
- run the individual country jobs to generate the data
- run each of the cloud sensor assets individually to publish the data
adds a bash script which sets up the necessary environment variables and then publishes all the data
adds docs explaining all of this: https://popgetter--131.org.readthedocs.build/en/131/deployment/

How to test

Checkout this PR, then run:

POPGETTER_COUNTRIES=bel,gb_nir ENV=dev ./deploy.sh

This will generate all the requisite data somewhere inside /tmp (note, it doesn't use your environment variables from .env). If you want to deploy to Azure:

POPGETTER_COUNTRIES=bel,gb_nir ENV=dev SAS_TOKEN="(whatever)" ./deploy.sh

(the value of SAS_TOKEN needs to be quoted because it contains ampersands, which will otherwise be interpreted by the shell)

The Bash script publishes the data to a different directory to what we've been using so far. To be exact, it publishes to: popgetter/prod/{version_number} (link to Azure). The idea is that if the metadata schema changes, we can / should bump the version number, and then the deployment script will regenerate the data in a separate directory from previously.

(Ultimately, we should also set the default base path in the CLI to this directory.)

Potential future improvements

Don't stop the entire thing if one partition fails to materialise (e.g. due to rate limits, etc.). For example, Belgium partition 4134 is failing right now due to a 404 (the data seems to have been moved away from the link in the catalogue). Done now, this also means that if you only try to materialise a subset of the countries it will successfully finish.

Run multiple partitions of the same asset in parallel. (I've looked into this a bit, and we can't use async because dagster's API doesn't expose an async version of this; we'd have to use something like threading or concurrent.futures inside the popgetter.run module — something like this:

def run_partition(partition):
    print(f"  - with partition key: {partition}")  # noqa: T201
    try_materialise(
        asset, upstream_deps, instance, fail_fast, partition_key=partition
    )
    time.sleep(delay)
import matplotlib
matplotlib.use("agg")  # non-interactive backend
with ThreadPoolExecutor(max_workers=max_threads) as executor:
     futures = [
         executor.submit(run_partition, partition)
         for partition in partition_names
     ]
     for future in futures:
         future.result()

(matplotlib errors in side threads if you don't use that backend.) Anyway, I tested this for three partitions of bel/census_tables and it really didn't speed anything up, so I didn't include it in this PR.)

A note on Docker builds

I originally also had a Dockerfile (plus a .dockerignore) which installed all the necessary dependencies and ran the bash script above.

Being able to run via Docker is nice for CD purposes. However, I couldn't / am too lazy to figure out a good way of ensuring that the container does not leak secrets such as the Azure SAS token. It would be a risk to deploy on a cloud service that we don't control, or to push the image to Docker Hub or similar, because someone can just pull the image, launch a shell and echo $SAS_TOKEN to get the value.

It's OK to build and run the image locally, of course — however, if that's the case, then a Dockerfile seems unnecessary — just the bash script would suffice. Hence I've removed it for now. I think that if we ever need one we can figure it out then.

I think in general it would depend on exactly where we deploy the Dockerfile. If it's on GHA or Azure for example we can set secrets as envvars (which is technically not super safe either but better than baking it into the image itself).

penelopeysm · 2024-07-02T18:13:52Z

Tested on Belgium and NI data (with ENV=dev; my internet isn't good enough to run the whole thing) and works fine!

Towards #123

This ensures that the partition keys are loaded immediately when the library is imported. Previously we had to wait until the sensor ran before they are added.

penelopeysm force-pushed the automated-publishing branch from f59a337 to 2235752 Compare July 2, 2024 07:58

penelopeysm added 17 commits July 13, 2024 00:42

Add run_job module

69aab7c

Towards #123

Have a single PROD variable that can be shared

edf087b

Add cloud sensor partitions immediately when assets are defined

8c2c961

This ensures that the partition keys are loaded immediately when the library is imported. Previously we had to wait until the sensor ran before they are added.

Expand run script to run single assets (e.g. publish_*)

cb2c0ee

Add publish_all command

06d9fd9

Add Dockerfile

2873015

Add asset to generate countries.txt file

1eeccb5

Make popgetter.run all read from $POPGETTER_COUNTRIES

1e2b33e

Add POPGETTER_COUNTRIES to Dockerfile

8f91918

Add bash script

9023727

Remove Docker (insecure, annoying to pass args, etc)

434fe3e

Lint deployment script

b25a373

Add deployment docs

8abaf58

Add some exit code checks to deploy script

743330b

Add fail_fast flag

1073bfb

Fix shellcheck

d279023

Allow user to override envvars, don't overwrite dagster.yaml

92affe2

penelopeysm force-pushed the automated-publishing branch from 620f235 to 92affe2 Compare July 12, 2024 16:48

andrewphilipsmith mentioned this pull request Jul 19, 2024

Handling versioning of the popgetter pipeline #91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated publishing of all data + metadata #131

Automated publishing of all data + metadata #131

penelopeysm commented Jul 2, 2024 •

edited

Loading

penelopeysm commented Jul 2, 2024

Automated publishing of all data + metadata #131

Are you sure you want to change the base?

Automated publishing of all data + metadata #131

Conversation

penelopeysm commented Jul 2, 2024 • edited Loading

How to test

Potential future improvements

A note on Docker builds

penelopeysm commented Jul 2, 2024

penelopeysm commented Jul 2, 2024 •

edited

Loading