Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exhaustive schema inference #50

Merged
merged 8 commits into from
May 15, 2024
Merged

Conversation

kylebarron
Copy link
Collaborator

@kylebarron kylebarron commented May 9, 2024

In some cases, it's desired to handle bulk STAC to GeoParquet conversion in a foolproof way with minimal user input. In #49, I presented an option to handle partial schemas, so the user could define only which STAC extensions they use, but this is still not totally foolproof. In particular, it can fail due to versioning issues, if there are multiple extension versions and/or multiple core versions in a single collection. Additionally, that approach requires more ongoing maintenance to keep the partial schemas up to date as extensions release new versions.

It may still be desired to finish and merge #49, but I think it makes sense to at least include this "exhaustive" inference as an option, because it's the most foolproof approach (though time consuming). In this PR, we implement a full scan over the input data, which infers a single unified Arrow schema before converting any data to Parquet.

I tested this with the horrible data I got stuck with last fall at the STAC Sprint. I had fetched 10,000 STAC Items from AWS's Sentinel 2 STAC collection, which had a variety of versions of STAC and different assets in each item. Though with this horrid input it does create a pretty insane schema... in this case with 52 separate asset keys 😱 , it works without any user input! (The pyarrow text repr of this schema is 90,000 characters, which means it's too big to paste into a comment 😂 )

Change list

  • Add InferredSchema class. This is intended to be used to iteratively build up a schema while scanning input JSON data.
  • Update existing functions to allow InferredSchema input to the schema argument.
  • Change the return type of parse_stac_ndjson_to_arrow to return a Generator of arrow RecordBatches.
  • Allow parse_stac_ndjson_to_arrow to take in one or more paths to newline-delimited JSON files.
  • Add top-level parse_stac_ndjson_to_parquet which wraps parse_stac_ndjson_to_arrow to construct a single GeoParquet file from input ndjson data. This streams data, and so does not hold all input data in memory at once. In the case where the user does not pass in a schema, it first scans the entire input to infer a unified schema, then creates the record batch generator.
  • Fix Arrow -> STAC conversion for nested GeoJSON properties. We were already converting GeoJSON to WKB at the top level, which is necessary in case there are Polygon and MultiPolygon geometries in the same collection. But we hadn't been doing that for the nested proj:geometry key, which is also a GeoJSON, and which should also be WKB for the same reasons. This fixes that for both conversion to arrow and from arrow back to JSON.

@kylebarron kylebarron changed the title exhaustive schema inference Exhaustive schema inference May 9, 2024
@kylebarron
Copy link
Collaborator Author

merging this with @bitner's approval to unblock #53

@kylebarron kylebarron merged commit 2eaf22d into main May 15, 2024
1 check passed
@kylebarron kylebarron deleted the kyle/exhaustive-schema-inference branch May 15, 2024 16:32
@TomAugspurger
Copy link
Collaborator

TomAugspurger commented May 15, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants