migrate from `csv` files to `sqlite` databases for downstream use in queries #120

rfl-urbaniak · 2024-03-12T08:18:04Z

All csv data are now in two .db files for the two levels of analysis (counties and msa). Prior to deployment
of dbs to the polis server, these live locally. As the db files are now too large to store on GitHub, the user needs to
run csv_to_db_pipeline.py before the first use to generate the db locally.
The original DataGrabber classes have been refactored, renamed to ...CSV and decoupled from use dowstream.
The new DataGrabberDB class has been introduced and passed on to function as generic DataGrabber and MSADataGrabber.
Additional tests for DataGrabberDB have been introduced in test_data_grabber_sql. Additionally, DataGrabberDB under the generic alias passes all the tests that the original DataGrabber did.
generate_sql.ipynb (docs/experimental) contains performance tests for both approaches. At least in the current setting the orignal method is faster. The main culprit seems to be:

13    0.013    0.001    0.013    0.001 {method 'fetchall' of 'sqlite3.Cursor' objects}

This is not too surprising, after some reflection, as illustrated by this comment from ChatGPT:

As the ultimate tests of the switch to DB would involve data updates and model retraining, I leave the original .csv files and classes until those events. Keep in mind they are now not needed for queries to work correctly (they are needed to generate the .db files and for some of the tests).
The new pytest release leads to incompatibilities that might be worth investigating later. For now, fixed the pytest version to be 7.4.3. in setup.py.
Some cleaning scripts have been moved to a subfolder, which required a small refactoring of import statements in generic data cleaning pipeline scripts.
Incorrect indentation in a DataGrabber test has been fixed.
It turns out that isort with --profile black on the runner still works as if without this profile. Checked versions between local install and the one on the runner, the version numbers are the same. More people have similar issues, I suspended isort and decided to trust black, at least till stable isort 6.0 gets out.
Inference tests succeed with pyro-ppl==1.8.5 fail with pyro-ppl==1.9. For now fixed the version number in setup.py, but will think about investigating this deeper.

rfl-urbaniak · 2024-03-12T10:25:32Z

Have been running into this issue apparently.

PyCQA/isort#1518

rfl-urbaniak · 2024-03-12T18:09:08Z

still issues with what isort does between CI and locally, despite the version numbers being the same. Other people have faced this, I'm looking for a solution, but slowly considering using black w/o isort at least till switching to isort 6.0

PyCQA/isort#1889

Niklewa

The SQL database's poor performance might be because we're treating it like regular DataFrames. I think using it differently, focusing on fetching only the necessary columns in each code section, would suit SQL databases better. Comparing performance by listing all features (one of our tests) isn't really fair.

Here are some actions that might improve performance:

Using SQL queries directly on the database instead of using a pandas DataFrame. This way, we only fetch the data needed for each task (also optimize the queries).
Implementing indexing in the SQL queries to restructure the database and add indexed variables.

Other options to consider:

In-memory databases like Redis or Memcached
Adding a caching layer to the SQL setup
Columnar databases such as Amazon Redshift or Apache Cassandra

I'm not sure if these actions will make SQL perform better, keep it the same, or not improve performance significantly. The same applies if we try new methods. However, I think these points are worth considering, especially as our dataset grows.

Niklewa · 2024-06-18T09:48:39Z

cities/utils/data_grabber.py

-            self.wide[feature] = pd.read_csv(file_path)
+            file_path = os.path.join(self.data_path, f"{feature}_{table_suffix}.csv")
+            df = pd.read_csv(file_path)
+            if table_suffix == "wide":


Something like that could be faster (so not relying on multiple elifs), not sure if the change in the performance would be meaningful:

suffix_map = {
"wide": self.wide,
"std_wide": self.std_wide,
"long": self.long,
"std_long": self.std_long
}

if table_suffix in suffix_map:
suffix_map[table_suffix][feature] = df
else:
raise ValueError("Invalid table suffix. Please choose 'wide', 'std_wide', 'long', or 'std_long'.")

rfl-urbaniak changed the base branch from main to staging-county-data March 12, 2024 08:21

rfl-urbaniak changed the title ~~migrate from csv files to sqllite databases for downstream use in queries~~ migrate from csv files to sqlite databases for downstream use in queries Mar 12, 2024

rfl-urbaniak added the status : WIP label Mar 12, 2024

rfl-urbaniak added status: awaiting review and removed status : WIP labels Mar 13, 2024

rfl-urbaniak requested a review from riadas March 13, 2024 10:12

rfl-urbaniak requested a review from Niklewa June 17, 2024 10:57

Niklewa approved these changes Jun 18, 2024

View reviewed changes

rfl-urbaniak removed the request for review from riadas August 20, 2024 13:54

rfl-urbaniak added status: blocked and removed status: awaiting review labels Aug 20, 2024

rfl-urbaniak added 2 commits September 11, 2024 10:23

scripts, tests

19ed746

scripts, tests

00fa94b

rfl-urbaniak changed the base branch from staging-county-data to main November 15, 2024 15:51

rfl-urbaniak added 2 commits November 15, 2024 10:59

remove large files

5fc7161

Merge branch 'main' into ru-sql

6c16d14

rfl-urbaniak removed the status: blocked label Nov 15, 2024

rfl-urbaniak added 7 commits November 15, 2024 11:29

moving db test to skip in CI environment

f305285

format, lint

91703ab

remove sorted intervention testing from ci

23944aa

format, lint

1dd1145

decoupled DataGrabberDB

08fc266

Merge branch 'main' into ru-sql

d580337

format lint

ed02d96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrate from `csv` files to `sqlite` databases for downstream use in queries #120

migrate from `csv` files to `sqlite` databases for downstream use in queries #120

rfl-urbaniak commented Mar 12, 2024 •

edited

Loading

rfl-urbaniak commented Mar 12, 2024

rfl-urbaniak commented Mar 12, 2024

Niklewa left a comment

Niklewa Jun 18, 2024

migrate from csv files to sqlite databases for downstream use in queries #120

Are you sure you want to change the base?

migrate from csv files to sqlite databases for downstream use in queries #120

Conversation

rfl-urbaniak commented Mar 12, 2024 • edited Loading

rfl-urbaniak commented Mar 12, 2024

rfl-urbaniak commented Mar 12, 2024

Niklewa left a comment

Choose a reason for hiding this comment

Niklewa Jun 18, 2024

Choose a reason for hiding this comment

migrate from `csv` files to `sqlite` databases for downstream use in queries #120

migrate from `csv` files to `sqlite` databases for downstream use in queries #120

rfl-urbaniak commented Mar 12, 2024 •

edited

Loading