Load parquet files to a duckdb file #3739

bendnorman · 2024-07-25T16:43:16Z

Superset does not support loading data from sqlite so we want to use duckdb instead! Duckdb is well suited for our data because it's designed to handle local data warehouses. It's also a cheaper option for superset because something like BQ we'd have to pay for query compute costs.

Success Criteria

all parquet files except CEMS and long-named tables are in a .duckdb file
.duckdb file is generated & distributed to S3/GCS in nightly builds

Tasks

Give feedback

Move script from devtools to package and add as entry point. Add to tests
Run parquet_to_duckdb.py script in gcp_pudl_etl.sh. Add lines to copy pudl.duckdb to s3://pudl.catalyst.coop, gs://pudl.catalyst.coop and gs://superset.catalyst.coop
Create a duckdb schema from our metadata classes
Load all the parquet files into the .duckdb file
Figure out why foreign keys slow freeze the loading step. Is it related to the enum issue?
Agree on solution to handling enums
Handle high memory usage in CEMS parquet-to-duckdb conversion #3812

0 of 1
Rename long tables to fit in DuckDB #3811

0 of 1
Bonus: Use SQL Expression Language instead of raw text to specify constraints
Options

The text was updated successfully, but these errors were encountered:

bendnorman · 2024-07-25T16:50:32Z

Duckdb table names have a character limit of 63. We have four tables that exceed 63 characters:

core_eiaaeo__yearly_projected_fuel_cost_in_electric_sector_by_type: 66
core_eiaaeo__yearly_projected_generation_in_electric_sector_by_technology: 73
core_eiaaeo__yearly_projected_generation_in_end_use_sectors_by_fuel_type: 72
out_eia923__yearly_generation_fuel_by_generator_energy_source_owner: 67

We should rename these resources, enforce the resource name length constraint earlier in the code and update our documentation.

cmgosnell · 2024-07-25T16:56:11Z

hhhhmmm long aeo names. @jdangerx made an aeo schema pr before migrating a lot of the AEO tables. I think the trouble here is that there are sooooo many AEO tables and so many of them contain the same pieces of information just broken down by different attributes.

zaneselvans · 2024-07-25T23:21:13Z

On the topic of making a DuckDB schema with our metadata classes, I'd been thinking we either want to have to_sqlite() and to_duckdb() methods in place of the generic to_sql() we have now (which only creates SQLite schemas) or maybe add dialect="duckdb" and dialect="sqlite" arguments to to_sql() that do the right thing, and have it default to dialect="sqlite" since that's the legacy behavior.

bendnorman · 2024-07-25T23:25:43Z

Agreed! That's what I'm working on right now. I've added a dialect="duckdb" option to to_sql(). For now, I'm just going to add some if statements to handle the different dialects, but there is probably a cleaner way to store the metadata to SQL logic of multiple dialects. I might make a MetadataSQLConverter class or type.

zaneselvans · 2024-07-26T00:28:45Z

It might also be possible to use SQLAlchemy for this -- if the checks, constraints, etc can be stated using their generic API, and then output to the appropriate dialect. IIRC there was at least one SQLite specific thing that we had to code manually though.

jdangerx · 2024-08-07T19:04:40Z

In our inframundo meeting we decided that we can skip the "hard" ones for now and get back to them before we actually release to the public:

tables that have long names
CEMS, which is too big to load all at once

zaneselvans · 2024-09-09T18:46:43Z

Something weird is going on with how big the DuckDB file is. Parquet with snappy compression is expected to be about the same size as the compressed DuckDB file. In Parquet, PUDL only takes up like 1-2GB (minus CEMS), and the DuckDB file is like 13GB, which just seems totally wacked.

bendnorman · 2024-09-09T18:49:46Z

I think Duckdb uses a different compression algorithm so duckdb files aren't expected to be as small as parquet files: duckdb/duckdb#8162 (comment)

zaneselvans · 2024-09-09T19:06:48Z

A factor of 10 feels suspicious though. I searched around for comparisons of the DuckDB and Parquet compression ratios and even a couple of years ago it looked like DuckDB should be less than 2x as big as Parquet.

bendnorman · 2024-09-11T01:05:03Z

Hmm I thought it could be that we're not specifying varchar lengths but the docs say that shouldn't matter.

It looks like many blocks in our out_eia__monthly_generators table are uncompressed:

D select compression, count(*) as count from pragma_storage_info('out_eia__monthly_generators') group by compression order by count desc;
┌──────────────┬───────┐
│ compression  │ count │
│   varchar    │ int64 │
├──────────────┼───────┤
│ Uncompressed │  4205 │
│ RLE          │  2714 │
│ Constant     │  1719 │
│ Dictionary   │  1281 │
│ FSST         │   722 │
│ ALPRD        │   182 │
│ BitPacking   │    51 │
│ ALP          │    35 │
└──────────────┴───────┘

Not sure why this is or if it's expected.

Another idea: Maybe our indexes are taking up a lot of space?

bendnorman added duckdb Issues referring to duckdb, the embedded OLAP database superset labels Jul 25, 2024

bendnorman self-assigned this Jul 25, 2024

bendnorman mentioned this issue Jul 25, 2024

Deploy superset #3703

Closed

zaneselvans linked a pull request Jul 28, 2024 that will close this issue

Create pudl.duckdb from parquet files #3741

Draft

jdangerx linked a pull request Aug 6, 2024 that will close this issue

Create pudl.duckdb from parquet files #3741

Draft

jdangerx assigned zaneselvans and unassigned bendnorman Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load parquet files to a duckdb file #3739

Load parquet files to a duckdb file #3739

bendnorman commented Jul 25, 2024 •

edited by zaneselvans

Loading

Tasks

bendnorman commented Jul 25, 2024

cmgosnell commented Jul 25, 2024

zaneselvans commented Jul 25, 2024

bendnorman commented Jul 25, 2024

zaneselvans commented Jul 26, 2024

jdangerx commented Aug 7, 2024

zaneselvans commented Sep 9, 2024

bendnorman commented Sep 9, 2024

zaneselvans commented Sep 9, 2024

bendnorman commented Sep 11, 2024 •

edited

Loading

Load parquet files to a duckdb file #3739

Load parquet files to a duckdb file #3739

Comments

bendnorman commented Jul 25, 2024 • edited by zaneselvans Loading

Success Criteria

Tasks

bendnorman commented Jul 25, 2024

cmgosnell commented Jul 25, 2024

zaneselvans commented Jul 25, 2024

bendnorman commented Jul 25, 2024

zaneselvans commented Jul 26, 2024

jdangerx commented Aug 7, 2024

zaneselvans commented Sep 9, 2024

bendnorman commented Sep 9, 2024

zaneselvans commented Sep 9, 2024

bendnorman commented Sep 11, 2024 • edited Loading

bendnorman commented Jul 25, 2024 •

edited by zaneselvans

Loading

bendnorman commented Sep 11, 2024 •

edited

Loading