Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FIPS coding in PUDL #3884

Open
e-belfer opened this issue Sep 30, 2024 · 1 comment
Open

Fix FIPS coding in PUDL #3884

e-belfer opened this issue Sep 30, 2024 · 1 comment
Labels
bug Things that are just plain broken.

Comments

@e-belfer
Copy link
Member

e-belfer commented Sep 30, 2024

Is your feature request related to a problem? Please describe.
In #3531 we've learned that the addfips package we rely on to map state/county data into a FIPS code is irregularly maintained. This is causing problems for downstream users, and we should consider deprecating this package and moving to a different solution.

Describe the solution you'd like
There are a few different use cases for FIPS encoding that we could hypothetically imagine for PUDL.
Right now, add_fips_ids in pudl.src.helpers.py is the only place we import the addfips package. This method is expects a county and state name, and a FIPS vintage:

def add_fips_ids(
    df: pd.DataFrame,
    state_col: str = "state",
    county_col: str = "county",
    vintage: int = 2015,
) -> pd.DataFrame:
    """Add State and County FIPS IDs to a dataframe.

    To just add State FIPS IDs, make county_col = None.
    """
    # force the columns to be the nullable string types so we have a consistent
    # null value to filter out before feeding to addfips
    df = df.astype({state_col: pd.StringDtype()})
    if county_col:
        df = df.astype({county_col: pd.StringDtype()})
    af = addfips.AddFIPS(vintage=vintage)
    # Lookup the state and county FIPS IDs and add them to the dataframe:
    df["state_id_fips"] = df.apply(
        lambda x: (
            af.get_state_fips(state=x[state_col]) if pd.notnull(x[state_col]) else pd.NA
        ),
        axis=1,
    )

    # force the code columns to be nullable strings - the leading zeros are
    # important
    df = df.astype({"state_id_fips": pd.StringDtype()})

    logger.info(
        f"Assigned state FIPS codes for "
        f"{len(df[df.state_id_fips.notnull()])/len(df):.2%} of records."
    )
    if county_col:
        df["county_id_fips"] = df.apply(
            lambda x: (
                af.get_county_fips(state=x[state_col], county=x[county_col])
                if pd.notnull(x[county_col]) and pd.notnull(x[state_col])
                else pd.NA
            ),
            axis=1,
        )
        # force the code columns to be nullable strings - the leading zeros are
        # important
        df = df.astype({"county_id_fips": pd.StringDtype()})
        logger.info(
            f"Assigned county FIPS codes for "
            f"{len(df[df.county_id_fips.notnull()])/len(df):.2%} of records."
        )
    return df

Approaches to resolving the problem

County and state names -> FIPS Code

This is our current FIPS encoding use case. We could:

  • create a fork of addfips, and implement the fixes we want to see that solve the problem identified in Incorrect county FIPS code for Bedford, VA #3531. We'd relatively easily benefit from any upstream changes, and if the project returns to being more actively maintained it wouldn't be particularly complex to switch back.
  • write our own method to convert state/county codes to FIPS codes drawing on the addfips methodology. As is noted in Add FIPS codes to power plant database #338 and below, the slight string cleaning implemented in addfips is definitely of relevance to us, so we'd want to replicate it at minimum but it shouldn't require heavy lifting to recreate.
  • explore a maximalist version, which involves encoding FIPS from addresses or lat/lon coordinates rather than county/state. However, this function is currently used for EIA 860, 861 and 923 data, and at least 861 and 923 data don't have complete addresses or lat/lon coordinates, so this likely won't meet our needs. The pygris package seems like a great API-free option for implementing this functionality down the line, though!

Address -> Lat lon -> FIPS Code or lat/lon -> FIPS code

As mentioned above, this wouldn't address the use case for 2/3 EIA datasets we're currently using the addfips package for. See pygris package, which seems interesting if we want to explore this further, but this is out of scope for the actual problem we're trying to resolve. See also #3531 for some discussion on other ways to implement this.

Proposed solutions

  1. Fork addfips and implement our desired changes, then do our own release so we can import this as a package.
    Pros: We don't have to write much code, and we can keep up to date with any changes made in addfips.
    Cons: Extremely annoying and we're likely to forget about this. We're still relying on a largely unmaintained package.

  2. Recreate the logic of addfips in pudl.helpers.py using Census data directly (e.g., this 2023 data or by calling the Census's geometry API, see [here](https://www.census.gov/data/developers/data-sets/geo-info.html) for more info).
    Pros: We are our own maintainers.
    Cons: We have to recreate the fuzzy-matching logic in the package we're currently using to prevent performance loss.

  3. Find some other new Python package to meet our needs.

  • us is maintained but doesn't yet support counties despite an open PR on this topic from 2017: https://github.com/unitedstates/python-us/. The census package depends on it, so there's no county support there either.
  • I have yet to find another comparable alternative
@e-belfer e-belfer added the bug Things that are just plain broken. label Sep 30, 2024
@e-belfer e-belfer changed the title Improve FIPS coding in PUDL Fix FIPS coding in PUDL Sep 30, 2024
@TrentonBush
Copy link
Contributor

The only fuzzy matching addfips does is to replace 12 diacritics (like "é") and standardize 3 abbreviations like "Saint"/"St.". The code is like 10 lines. Your input data has to have perfect quality modulo those two replacements or the matches will fail.

The cardinal sin of addfips is to not maintain the source census data or handle the slight changes over time, which is the only thing that needs maintenance! But I'll stop there because I think I've written more lines complaining about addfips than there are lines of code in the entire package 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken.
Projects
Status: Backlog
Development

No branches or pull requests

2 participants