Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEC 10K Tables Missing from PUDL Data Dictionary #4033

Open
zaneselvans opened this issue Jan 28, 2025 · 3 comments
Open

SEC 10K Tables Missing from PUDL Data Dictionary #4033

zaneselvans opened this issue Jan 28, 2025 · 3 comments
Labels
docs Documentation for users and contributors. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. mozilla_sec_to_eia Mozilla AI for EJ grant to link SEC utility ownership data to EIA operational data

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Jan 28, 2025

The metadata defining the SEC 10K tables is only pulled into our asset definitions when USE_PUDL_MODELS is set, and it is not currently set in our ReadTheDocs environment, so the SEC 10K tables do not appear in our published documentation.

Additionally, because that table metadata is being pulled from a remote location that requires authentication, it seems like we would somehow need to provide permissions to the docs build environment on RTD, and I'm not sure if or how we can do that.

See e.g. the lack of search results in the docs corresponding to the latest version of our repo.

With this environment variable set and my cloud permissions locally I'm able to build the documentation and it contains the new SEC 10K tables in the data dictionary.

@zaneselvans zaneselvans converted this from a draft issue Jan 28, 2025
@zaneselvans zaneselvans added docs Documentation for users and contributors. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. mozilla_sec_to_eia Mozilla AI for EJ grant to link SEC utility ownership data to EIA operational data labels Jan 28, 2025
@zaneselvans
Copy link
Member Author

Should we be setting USE_PUDL_MODELS on the GitHub runner where the integration tests are getting run in the merge queue as well? Do we need to give the runners any additional permissions to access this data?

@zaneselvans
Copy link
Member Author

I'm also getting an error related to DeltaLake in the pudl-archiver repository. Even just trying to get it to import...

pudl_archiver --help

Results in:

Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-cataloger/bin/pudl_archiver", line 5, in <module>
    from pudl_archiver.cli import main
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/__init__.py", line 10, in <module>
    import pudl_archiver.orchestrator  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/orchestrator.py", line 7, in <module>
    from pudl_archiver.archivers.classes import AbstractDatasetArchiver
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/archivers/classes.py", line 19, in <module>
    from pudl_archiver.archivers import validate
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/archivers/validate.py", line 14, in <module>
    from pudl_archiver.frictionless import DataPackage, Resource, ZipLayout
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/frictionless.py", line 10, in <module>
    from pudl.metadata.classes import Contributor, DataSource, License
  File "/Users/zane/miniforge3/envs/pudl-cataloger/lib/python3.12/site-packages/pudl/__init__.py", line 5, in <module>
    from . import (
  File "/Users/zane/miniforge3/envs/pudl-cataloger/lib/python3.12/site-packages/pudl/analysis/__init__.py", line 9, in <module>
    from . import (
  File "/Users/zane/miniforge3/envs/pudl-cataloger/lib/python3.12/site-packages/pudl/analysis/allocate_gen_fuel.py", line 145, in <module>
    from pudl.metadata.fields import apply_pudl_dtypes
  File "/Users/zane/miniforge3/envs/pudl-cataloger/lib/python3.12/site-packages/pudl/metadata/__init__.py", line 3, in <module>
    from . import (
  File "/Users/zane/miniforge3/envs/pudl-cataloger/lib/python3.12/site-packages/pudl/metadata/classes.py", line 2158, in <module>
    PUDL_PACKAGE = Package.from_resource_ids()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-cataloger/lib/python3.12/site-packages/pudl/metadata/classes.py", line 2021, in from_resource_ids
    for name, description, schema in get_model_table_schemas()
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-cataloger/lib/python3.12/site-packages/pudl/analysis/pudl_models.py", line 49, in get_model_table_schemas
    dts = [DeltaTable(_get_table_uri(table_name)) for table_name in get_model_tables()]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-cataloger/lib/python3.12/site-packages/deltalake/table.py", line 418, in __init__
    self._table = RawDeltaTable(
                  ^^^^^^^^^^^^^^
_internal.TableNotFoundError: no log files

@zaneselvans
Copy link
Member Author

Updating the pudl-cataloger environment with mamba env update did not fix this.

Howeve, removing the pudl-cataloger environment completely and rebuilding it from scratch seems to have resolved it:

mamba env remove -n pudl-cataloger
mamba env create -f environment.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation for users and contributors. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. mozilla_sec_to_eia Mozilla AI for EJ grant to link SEC utility ownership data to EIA operational data
Projects
Status: Backlog
Development

No branches or pull requests

1 participant