-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated publishing of all data + metadata #131
Open
penelopeysm
wants to merge
17
commits into
main
Choose a base branch
from
automated-publishing
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
penelopeysm
force-pushed
the
automated-publishing
branch
from
July 2, 2024 07:58
f59a337
to
2235752
Compare
Tested on Belgium and NI data (with ENV=dev; my internet isn't good enough to run the whole thing) and works fine! |
Towards #123
This ensures that the partition keys are loaded immediately when the library is imported. Previously we had to wait until the sensor ran before they are added.
penelopeysm
force-pushed
the
automated-publishing
branch
from
July 12, 2024 16:48
620f235
to
92affe2
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #123
This PR:
countries.txt
file, along with the corresponding IO managers (CountriesTextIOManager
)popgetter.run
module which traverses the dependency list in a named job and runs it asset-by-asset. Runningpython -m popgetter.run all
will:$POPGETTER_COUNTRIES
env var to see which countries are to be runHow to test
Checkout this PR, then run:
This will generate all the requisite data somewhere inside
/tmp
(note, it doesn't use your environment variables from.env
). If you want to deploy to Azure:(the value of SAS_TOKEN needs to be quoted because it contains ampersands, which will otherwise be interpreted by the shell)
The Bash script publishes the data to a different directory to what we've been using so far. To be exact, it publishes to:
popgetter/prod/{version_number}
(link to Azure). The idea is that if the metadata schema changes, we can / should bump the version number, and then the deployment script will regenerate the data in a separate directory from previously.(Ultimately, we should also set the default base path in the CLI to this directory.)
Potential future improvements
Don't stop the entire thing if one partition fails to materialise (e.g. due to rate limits, etc.). For example, Belgium partition 4134 is failing right now due to a 404 (the data seems to have been moved away from the link in the catalogue).Done now, this also means that if you only try to materialise a subset of the countries it will successfully finish.Run multiple partitions of the same asset in parallel. (I've looked into this a bit, and we can't use async because dagster's API doesn't expose an async version of this; we'd have to use something like
threading
orconcurrent.futures
inside thepopgetter.run
module — something like this:(matplotlib errors in side threads if you don't use that backend.) Anyway, I tested this for three partitions of bel/census_tables and it really didn't speed anything up, so I didn't include it in this PR.)
A note on Docker builds
I originally also had a Dockerfile (plus a .dockerignore) which installed all the necessary dependencies and ran the bash script above.
Being able to run via Docker is nice for CD purposes. However, I couldn't / am too lazy to figure out a good way of ensuring that the container does not leak secrets such as the Azure SAS token. It would be a risk to deploy on a cloud service that we don't control, or to push the image to Docker Hub or similar, because someone can just pull the image, launch a shell and
echo $SAS_TOKEN
to get the value.It's OK to build and run the image locally, of course — however, if that's the case, then a Dockerfile seems unnecessary — just the bash script would suffice. Hence I've removed it for now. I think that if we ever need one we can figure it out then.
I think in general it would depend on exactly where we deploy the Dockerfile. If it's on GHA or Azure for example we can set secrets as envvars (which is technically not super safe either but better than baking it into the image itself).