Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable sec10k distribution #4026

Merged
merged 13 commits into from
Jan 25, 2025
Merged

Enable sec10k distribution #4026

merged 13 commits into from
Jan 25, 2025

Conversation

zschira
Copy link
Member

@zschira zschira commented Jan 24, 2025

Overview

This PR enables the distribution of SEC10k and other future datasets produced by what will be the pudl-modelling repo. This repo is currently called mozilla_sec_eia and has been used to develop models for extracting company ownership data from SEC10k filings and linking this data to EIA data. This work lives separately from the main PUDL repo because training and running the models is highly resource intensive and would not interact well with the our current nightly build setup. It also allows us to isolate the complex dependencies from PUDL.

Interacting with this data is fairly straightforward on the PUDL side as all that is required is to load a table and schema information from cloud storage, then write this data alongside the rest of the PUDL data. Given that outside contributors won't have credentials to access cloud resources, I've added an environment variable USE_PUDL_MODELS that will be checked before ever trying to load the any PUDL models data. This environment variable needs to be set for testing and nightly builds, but for most developers can be ignored.

@zschira zschira requested a review from katie-lamb January 24, 2025 16:54
@zschira zschira self-assigned this Jan 24, 2025
@@ -42,6 +42,7 @@ ENV CONTAINER_PUDL_WORKSPACE=${CONTAINER_HOME}/pudl_work
ENV PUDL_INPUT=${CONTAINER_PUDL_WORKSPACE}/input
ENV PUDL_OUTPUT=${CONTAINER_PUDL_WORKSPACE}/output
ENV DAGSTER_HOME=${CONTAINER_PUDL_WORKSPACE}/dagster_home
ENV USE_PUDL_MODELS=True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set environment variable to load PUDL models data during nightly builds.

)
+ get_pudl_models_assets()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append assets to load pudl models tables to default set of assets. If USE_PUDL_MODELS is not set, then no assets will be added.

@@ -322,13 +322,21 @@ def load_input(self, context: InputContext) -> pd.DataFrame:
class PudlParquetIOManager(IOManager):
"""IOManager that writes pudl tables to pyarrow parquet files."""

def _get_table_resource(self, table_name: str) -> Resource:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handles metadata for PUDL models tables slightly differently from all other tables.

@@ -572,6 +573,24 @@ class Field(PudlMeta):
harvest: FieldHarvest = FieldHarvest()
encoder: Encoder | None = None

@classmethod
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PUDL models schemas are stored as pyarrow schemas, which are converted to our standard metadata structures here.

@zschira zschira enabled auto-merge January 24, 2025 20:26
Copy link
Member

@katie-lamb katie-lamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

we probably don't need to generate the assets in every nightly build if it's resource intensive but good for now!

@zschira zschira added this pull request to the merge queue Jan 25, 2025
Merged via the queue into main with commit f200fa0 Jan 25, 2025
17 checks passed
@zschira zschira deleted the sec-distribution branch January 25, 2025 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants