Exhaustive schema inference #50

kylebarron · 2024-05-09T22:05:19Z

In some cases, it's desired to handle bulk STAC to GeoParquet conversion in a foolproof way with minimal user input. In #49, I presented an option to handle partial schemas, so the user could define only which STAC extensions they use, but this is still not totally foolproof. In particular, it can fail due to versioning issues, if there are multiple extension versions and/or multiple core versions in a single collection. Additionally, that approach requires more ongoing maintenance to keep the partial schemas up to date as extensions release new versions.

It may still be desired to finish and merge #49, but I think it makes sense to at least include this "exhaustive" inference as an option, because it's the most foolproof approach (though time consuming). In this PR, we implement a full scan over the input data, which infers a single unified Arrow schema before converting any data to Parquet.

I tested this with the horrible data I got stuck with last fall at the STAC Sprint. I had fetched 10,000 STAC Items from AWS's Sentinel 2 STAC collection, which had a variety of versions of STAC and different assets in each item. Though with this horrid input it does create a pretty insane schema... in this case with 52 separate asset keys 😱 , it works without any user input! (The pyarrow text repr of this schema is 90,000 characters, which means it's too big to paste into a comment 😂 )

Change list

Add InferredSchema class. This is intended to be used to iteratively build up a schema while scanning input JSON data.
Update existing functions to allow InferredSchema input to the schema argument.
Change the return type of parse_stac_ndjson_to_arrow to return a Generator of arrow RecordBatches.
Allow parse_stac_ndjson_to_arrow to take in one or more paths to newline-delimited JSON files.
Add top-level parse_stac_ndjson_to_parquet which wraps parse_stac_ndjson_to_arrow to construct a single GeoParquet file from input ndjson data. This streams data, and so does not hold all input data in memory at once. In the case where the user does not pass in a schema, it first scans the entire input to infer a unified schema, then creates the record batch generator.
Fix Arrow -> STAC conversion for nested GeoJSON properties. We were already converting GeoJSON to WKB at the top level, which is necessary in case there are Polygon and MultiPolygon geometries in the same collection. But we hadn't been doing that for the nested proj:geometry key, which is also a GeoJSON, and which should also be WKB for the same reasons. This fixes that for both conversion to arrow and from arrow back to JSON.

kylebarron · 2024-05-15T16:32:00Z

merging this with @bitner's approval to unblock #53

TomAugspurger · 2024-05-15T16:52:33Z

Thanks, and apologies for not having a chance to review!

…

On Wed, May 15, 2024 at 12:32 PM Kyle Barron ***@***.***> wrote: Merged #50 <#50> into main. — Reply to this email directly, view it on GitHub <#50 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIWB2GJ3JHDKIZUC2ZLZCOEZRAVCNFSM6AAAAABHPR45EOVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSHAZDANRWGIYTOMY> . You are receiving this because your review was requested.Message ID: ***@***.***>

kylebarron added 2 commits May 9, 2024 17:22

exhaustive schema inference

5406f96

Update docstrings

42e84bd

kylebarron changed the title ~~exhaustive schema inference~~ Exhaustive schema inference May 9, 2024

kylebarron added 4 commits May 9, 2024 18:18

fix dict accessor

520cb33

fix setter

2b445e7

Fix Arrow -> STAC conversion for nested proj:geometry

8bf171c

fix types

14eb065

kylebarron requested a review from TomAugspurger May 10, 2024 20:54

kylebarron mentioned this pull request May 15, 2024

Use iterables / and record batches in to_arrow #53

Merged

kylebarron added 2 commits May 15, 2024 12:22

Merge branch 'main' into kyle/exhaustive-schema-inference

db37f0e

fix circular import

ad6121a

bitner approved these changes May 15, 2024

View reviewed changes

kylebarron merged commit 2eaf22d into main May 15, 2024
1 check passed

kylebarron deleted the kyle/exhaustive-schema-inference branch May 15, 2024 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exhaustive schema inference #50

Exhaustive schema inference #50

kylebarron commented May 9, 2024 •

edited

Loading

kylebarron commented May 15, 2024

TomAugspurger commented May 15, 2024 via email

Exhaustive schema inference #50

Exhaustive schema inference #50

Conversation

kylebarron commented May 9, 2024 • edited Loading

Change list

kylebarron commented May 15, 2024

TomAugspurger commented May 15, 2024 via email

kylebarron commented May 9, 2024 •

edited

Loading