Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor interally to RawBatch and CleanBatch wrapper types #57

Merged
merged 10 commits into from
Jun 4, 2024

Conversation

kylebarron
Copy link
Collaborator

@kylebarron kylebarron commented Jun 3, 2024

I was adding tests for parse_stac_ndjson_to_arrow when I discovered that that was inadvertently broken in #53 due to a refactor in the schema inference. It has been confusing to work with untyped pyarrow.RecordBatch classes because they're a black box, and we have two distinct data schemas we're working with: one that is as close to the raw JSON as possible, and another that conforms to our STAC GeoParquet schema.

This PR refactors this internally by adding RawBatch and CleanBatch wrapper types (open to better naming suggestions, but these are not public, so we can easily rename in the future). They both hold an internal pyarrow.RecordBatch but RawBatch aligns as much as possible to the raw STAC JSON representation, while CleanBatch aligns to the STAC-GeoParquet schema.

These wrapper types make it much easier to reason about the shape of the data at different points of the pipeline.

Change list

  • Add more tests for STAC conversion via arrow and ndjson
  • Create RawBatch and CleanBatch for more reliable internal typing
  • Add parquet tests

@kylebarron kylebarron mentioned this pull request Jun 3, 2024
@bitner
Copy link
Contributor

bitner commented Jun 3, 2024

Rather than Raw and Clean, I'd rather something that was more explicit. StacArrowBatch and StacJsonBatch perhaps?

@kylebarron
Copy link
Collaborator Author

I had considered JsonBatch first, but I was trying to think of a different name because it's not a collection of JSON objects. And similarly StacArrowBatch isn't that descriptive because both batches are just Arrow underneath

@TomAugspurger
Copy link
Collaborator

TomAugspurger commented Jun 3, 2024 via email

@kylebarron
Copy link
Collaborator Author

I renamed to StacArrowBatch and StacJsonBatch

@kylebarron kylebarron merged commit cf0699b into main Jun 4, 2024
1 check passed
@kylebarron kylebarron deleted the kyle/batch-typing branch June 25, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants