Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data V2 #59

Closed
wants to merge 11 commits into from
Closed

Data V2 #59

wants to merge 11 commits into from

Conversation

emackev
Copy link
Contributor

@emackev emackev commented Oct 31, 2023

This PR is for developing the next version of our data pipeline. See docs/data_sources.ipynb for variables that are currently in the dataset. Desired features for new pipeline:

  • Data cleaning process documented in code. Each variable has
  1. high-level description in data_sources.ipynb
  2. Initial cleaning pipeline: convert info from websites into wide format with GeoFIPS, GeoName, etc columns, and perform any dataset-specific cleaning. gdp_wide.csv is the canonical example. (pre-sql) The results of the initial cleaning should be stored in data/raw.
  3. Final steps to make each variable compatible with the rest of our data (handle exclusions, standardize data, etc.)
    Step 2 can be the same code for all datasets (i.e., the clean_variable code). The results of the final steps should be stored in data/processed (as 4 csvs, wide vs long, and raw vs std)

Steps to do:

  • Replace exclusions.pkl with a file included_counties.csv, with columns GeoFIPS, GeoName, Included, Explanation. Initially, leave Explanation blank -- fill it in while constructing datasets
  • Make sure variable units are in data_sources.ipynb
  • Some paths are hard-coded -- use find_repo_root instead.
  • In included_counties.csv, for counties where Included is False, Explanation should say something like "Missing 2012 unemployment data from bls.gov"
  • inspect_variable function for sanity-check plotting variables. See inspect_data.ipynb in docs/experimental_notebooks
  • list_all_counties function
  • Re-factor unemployment data pipeline in the way described above (currently, step 1 is in docs/experimental_notebooks/clean_unemployment.ipynb)
  • Re-factor spending variable pipelines in the way described above (see docs/experimental_notebooks/grants_from_fips.ipynb for relevant notebook calls)
  • Re-factor other variable pipelines in the way described above
  • functions for getting additional info about a particular county e.g. grant_df_from_fips
  • Dataset stored as a SQL database (connect with Ria about how to structure it)
  • DataGrabber grabs from SQL database in SQL-standard ways
  • Consistency tests
  • Add additional variables

@emackev
Copy link
Contributor Author

emackev commented Nov 13, 2023

@riadas riadas self-assigned this Nov 13, 2023
@rfl-urbaniak
Copy link
Contributor

Superseded by the current data pipeline.

@rfl-urbaniak rfl-urbaniak deleted the data_v2 branch November 15, 2024 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants