-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable sec10k distribution #4026
Conversation
@@ -42,6 +42,7 @@ ENV CONTAINER_PUDL_WORKSPACE=${CONTAINER_HOME}/pudl_work | |||
ENV PUDL_INPUT=${CONTAINER_PUDL_WORKSPACE}/input | |||
ENV PUDL_OUTPUT=${CONTAINER_PUDL_WORKSPACE}/output | |||
ENV DAGSTER_HOME=${CONTAINER_PUDL_WORKSPACE}/dagster_home | |||
ENV USE_PUDL_MODELS=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set environment variable to load PUDL models data during nightly builds.
) | ||
+ get_pudl_models_assets() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Append assets to load pudl models tables to default set of assets. If USE_PUDL_MODELS
is not set, then no assets will be added.
@@ -322,13 +322,21 @@ def load_input(self, context: InputContext) -> pd.DataFrame: | |||
class PudlParquetIOManager(IOManager): | |||
"""IOManager that writes pudl tables to pyarrow parquet files.""" | |||
|
|||
def _get_table_resource(self, table_name: str) -> Resource: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This handles metadata for PUDL models tables slightly differently from all other tables.
@@ -572,6 +573,24 @@ class Field(PudlMeta): | |||
harvest: FieldHarvest = FieldHarvest() | |||
encoder: Encoder | None = None | |||
|
|||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PUDL models schemas are stored as pyarrow
schemas, which are converted to our standard metadata structures here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good!
we probably don't need to generate the assets in every nightly build if it's resource intensive but good for now!
Overview
This PR enables the distribution of SEC10k and other future datasets produced by what will be the
pudl-modelling
repo. This repo is currently called mozilla_sec_eia and has been used to develop models for extracting company ownership data from SEC10k filings and linking this data to EIA data. This work lives separately from the main PUDL repo because training and running the models is highly resource intensive and would not interact well with the our current nightly build setup. It also allows us to isolate the complex dependencies from PUDL.Interacting with this data is fairly straightforward on the PUDL side as all that is required is to load a table and schema information from cloud storage, then write this data alongside the rest of the PUDL data. Given that outside contributors won't have credentials to access cloud resources, I've added an environment variable
USE_PUDL_MODELS
that will be checked before ever trying to load the any PUDL models data. This environment variable needs to be set for testing and nightly builds, but for most developers can be ignored.