Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated publishing of all data + metadata #131

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

penelopeysm
Copy link
Member

@penelopeysm penelopeysm commented Jul 2, 2024

Closes #123

This PR:

  • adds an asset to generate the top-level countries.txt file, along with the corresponding IO managers (CountriesTextIOManager)
  • adds a popgetter.run module which traverses the dependency list in a named job and runs it asset-by-asset. Running python -m popgetter.run all will:
    • look at the $POPGETTER_COUNTRIES env var to see which countries are to be run
    • run the individual country jobs to generate the data
    • run each of the cloud sensor assets individually to publish the data
  • adds a bash script which sets up the necessary environment variables and then publishes all the data
  • adds docs explaining all of this: https://popgetter--131.org.readthedocs.build/en/131/deployment/

How to test

Checkout this PR, then run:

POPGETTER_COUNTRIES=bel,gb_nir ENV=dev ./deploy.sh

This will generate all the requisite data somewhere inside /tmp (note, it doesn't use your environment variables from .env). If you want to deploy to Azure:

POPGETTER_COUNTRIES=bel,gb_nir ENV=dev SAS_TOKEN="(whatever)" ./deploy.sh

(the value of SAS_TOKEN needs to be quoted because it contains ampersands, which will otherwise be interpreted by the shell)

The Bash script publishes the data to a different directory to what we've been using so far. To be exact, it publishes to: popgetter/prod/{version_number} (link to Azure). The idea is that if the metadata schema changes, we can / should bump the version number, and then the deployment script will regenerate the data in a separate directory from previously.

(Ultimately, we should also set the default base path in the CLI to this directory.)

Potential future improvements

  • Don't stop the entire thing if one partition fails to materialise (e.g. due to rate limits, etc.). For example, Belgium partition 4134 is failing right now due to a 404 (the data seems to have been moved away from the link in the catalogue). Done now, this also means that if you only try to materialise a subset of the countries it will successfully finish.

  • Run multiple partitions of the same asset in parallel. (I've looked into this a bit, and we can't use async because dagster's API doesn't expose an async version of this; we'd have to use something like threading or concurrent.futures inside the popgetter.run module — something like this:

    def run_partition(partition):
        print(f"  - with partition key: {partition}")  # noqa: T201
        try_materialise(
            asset, upstream_deps, instance, fail_fast, partition_key=partition
        )
        time.sleep(delay)
    import matplotlib
    matplotlib.use("agg")  # non-interactive backend
    with ThreadPoolExecutor(max_workers=max_threads) as executor:
         futures = [
             executor.submit(run_partition, partition)
             for partition in partition_names
         ]
         for future in futures:
             future.result()

    (matplotlib errors in side threads if you don't use that backend.) Anyway, I tested this for three partitions of bel/census_tables and it really didn't speed anything up, so I didn't include it in this PR.)

A note on Docker builds

I originally also had a Dockerfile (plus a .dockerignore) which installed all the necessary dependencies and ran the bash script above.

Being able to run via Docker is nice for CD purposes. However, I couldn't / am too lazy to figure out a good way of ensuring that the container does not leak secrets such as the Azure SAS token. It would be a risk to deploy on a cloud service that we don't control, or to push the image to Docker Hub or similar, because someone can just pull the image, launch a shell and echo $SAS_TOKEN to get the value.

It's OK to build and run the image locally, of course — however, if that's the case, then a Dockerfile seems unnecessary — just the bash script would suffice. Hence I've removed it for now. I think that if we ever need one we can figure it out then.

I think in general it would depend on exactly where we deploy the Dockerfile. If it's on GHA or Azure for example we can set secrets as envvars (which is technically not super safe either but better than baking it into the image itself).

@penelopeysm
Copy link
Member Author

Tested on Belgium and NI data (with ENV=dev; my internet isn't good enough to run the whole thing) and works fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress:
Development

Successfully merging this pull request may close these issues.

Create centralised way to publish production versions of data to Azure
1 participant