Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: data versioning in the data catalog #438

Merged
merged 32 commits into from
Jul 31, 2023
Merged

ENH: data versioning in the data catalog #438

merged 32 commits into from
Jul 31, 2023

Conversation

savente93
Copy link
Contributor

Issue addressed

Fixes #329

Explanation

With this PR users will be able to add versions of the same data source but from different origins. E.g. this would allow users to have a data catalog for esa_worldcover both local and on aws, and query both as is necessary. We also plan to add similar functionalities for versions/editions of datasets (see additional notes below), but that will be in a separate PR. The catalogs are disambiguate during parse time and compressed during write time, so the internal representation of the adapters has remained unchanged. The only change to the adapters is that I added a version_name field so we can carry the appropriate data with us throughout the code easier.

An important note, is that to be able to make this happen we have to deprecate the dictionary overloading, so after this get's merged, iterating over the catalog itself, instead of using one of the provided methods, as well as directly accessing the sources dictionary, and using dictionary syntax on the catalog to retrieve or set sources is deprecated and should not be used. All current instances of that code in HydroMT have been updated (as far as I know)

Checklist

  • Updated tests or added new tests
  • Branch is up to date with main
  • Tests & pre-commit hooks pass
  • Updated documentation if needed
  • Updated changelog.rst if needed

Additional Notes (optional)

There are some places in the code where I use the name version and some places where I use the name provider. Currently this might be somewhat confusing but the reason for this is that very likely the next issue I'll be picking up is adding similar functionality but for versions of datasets (e.g. v2022 vs v2021 of esa_worldcover) that will use the same mechanism. I think the naming will make a lot more sense when that is added, but I didn't want to introduce a version in between that would then immediately break compatibility in the next update. For the purposes of this PR you can think of versions and providers as equivalent.

@DirkEilander DirkEilander changed the title Version support ENH: data versioning in the data catalog Jul 20, 2023
Copy link
Contributor

@DirkEilander DirkEilander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial feedback. I think the changes generally look good and we are nearly there. However it seems two different issues with very similar solutions might got mixed up a bit, i.e., #148 which focusses on versions of data from the same source & #329 which focusses on identical data but from different sources. Given the work that you have done I think we should try to solve both here.

What is still needed:

  • slightly different yml structure for versions which supports both provider and version as arguments, see example below. Note that internally we still want the tree structure you implemented.
  • consistent naming of version & provider in the code. I also suggest to use variants where you currently use versions to avoid confusion with version.
  • finally, expose the new keyword arguments in DataCatalog.get_source (i.e. version and provider) in the get_rasterdataset/get_geodataset/get_geodataframe/get_dataframe methods by accepting a dictionary for the data_like argument with {"name": <str>, "version": <str>, "provider": <str>}.
esa_worldcover:
  crs: 4326
  data_type: RasterDataset
  driver: raster
  filesystem: local
  driver_kwargs:
    chunks:
      x: 36000
      y: 36000
  meta:
    category: landuse
    source_license: CC BY 4.0
    source_url: https://doi.org/10.5281/zenodo.5571936
  variants:  # instead of "versions" to avoid confusion with "version"
    - provider: local
      version: 2020
      path: landuse/esa_worldcover/esa-worldcover.vrt
    - provider: local
      version: 2021
      path: landuse/esa_worldcover_2021/esa-worldcover.vrt
    - provider: aws
      version: 2020
      path: s3://esa-worldcover/v100/2020/ESA_WorldCover_10m_2020_v100_Map_AWS.vrt
      rename:
        ESA_WorldCover_10m_2020_v100_Map_AWS: landuse
      filesystem: s3
      driver_kwargs:
        storage_options:
          anon: true

Note that a catalog entry without variants should also still work en support the version and provider keywords

esa_worldcover:
  crs: 4326
  data_type: RasterDataset
  driver: raster
  meta:
    category: landuse
    source_license: CC BY 4.0
    source_url: https://doi.org/10.5281/zenodo.5571936
    source_version: v100
  provider: local
  version: 2020
  path: landuse/esa_worldcover/esa-worldcover.vrt
  filesystem: local
  driver_kwargs:
    chunks:
      x: 36000
      y: 36000

@savente93
Copy link
Contributor Author

I'll be honest, not the cleanest implementation I've ever made, but it get's the job done. Should be all good now

@savente93
Copy link
Contributor Author

One question that remains, is what to actually do with alias but we can potentially address that separately. Not sure if we need to make a decision on that right now, and this PR is already quite big.

@savente93 savente93 linked an issue Jul 24, 2023 that may be closed by this pull request
@savente93
Copy link
Contributor Author

The previous impl did a lot of unnecessary bookkeeping because I was too stubborn to realize that since 3.7 python iteration maintains insertion order (as defined as part of the std lib spec by now). This at least cleans up the impl a little bit

Copy link
Contributor

@DirkEilander DirkEilander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work @savente93 ! There are just a few final bits and pieces left to do I think.

docs/user_guide/data_prepare_cat.rst Outdated Show resolved Hide resolved
hydromt/data_adapter/data_adapter.py Outdated Show resolved Hide resolved
hydromt/models/model_api.py Outdated Show resolved Hide resolved
hydromt/data_catalog.py Outdated Show resolved Hide resolved
tests/test_data_catalog.py Outdated Show resolved Hide resolved
hydromt/models/model_grid.py Outdated Show resolved Hide resolved
hydromt/data_catalog.py Outdated Show resolved Hide resolved
hydromt/data_catalog.py Show resolved Hide resolved
hydromt/data_catalog.py Outdated Show resolved Hide resolved
hydromt/__init__.py Outdated Show resolved Hide resolved
@savente93 savente93 requested a review from DirkEilander July 27, 2023 13:52
@savente93 savente93 marked this pull request as ready for review July 27, 2023 13:52
@savente93
Copy link
Contributor Author

@DirkEilander I think I got to all the requested changes that are good to impl in this PR, please have another look.

P.S. I can relocate the placeholder code from _parse_dicts but while I was trying it became quite confusing so I'd like to do it in a separate PR. If you would like this to happen please open an issue, you can just assign it to me and I'll get on that after this gets merged.

@savente93
Copy link
Contributor Author

The failure of SonarCloud is due to a bit more code duplication, which I don't think is a problem. Honestly I think the implementation would get less clear if I tried to somehow reduce it because it seems to hit on lines with stuff like arguments which have to be the same. So I don't actually intend of fixing that, I think it's good as is

@savente93
Copy link
Contributor Author

@DirkEilander how come you changed the variable name from data_version to version? The reason I did that was that I wanted it to be clear that the version refers to that version of the data, and not the version of either the catalog or the version of hydromt, both of which we're going to have quite soon

Copy link
Contributor

@DirkEilander DirkEilander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing and testing I have made some changes to this branch directly. It's mainly some polishing. Not the ideal process, but hope you don't mind. In my opinion it's now ready to be merged, but it would be good if you had a final critical look at my changes.

What I've done:

  • fixed naming consistency for version (was data_version) and source (was key) in DataCatalog.get_source and DataCatalog.get_rasterdataset and similar methods.
  • fixed the checking of different input types for data_like in DataCatalog,get_rasterdataset and similar methods and also restored your key-word arguments for provider and version for a better experience for python users (see also docs example)
  • fixed DataCatalog.to_dict with more than 2 variants
  • by default return the newest version of a provider by sorting the sources in DataCatalog.add_source. We still return the last added provider. This seemed the wanted behaviour to me. I've clarified this in all docstrings.
  • moved the placeholder parsing to _denormalise_data_dict to simplify _parse_data_source_dict method
  • updated the docs (user guide and doc strings)

@DirkEilander
Copy link
Contributor

DirkEilander commented Jul 31, 2023

@DirkEilander how come you changed the variable name from data_version to version? The reason I did that was that I wanted it to be clear that the version refers to that version of the data, and not the version of either the catalog or the version of hydromt, both of which we're going to have quite soon

This was needed to correctly pass the keys from the data_like dictionary to all downstream methods (see updated tests in test_data_catalog.py::test_data_catalog). Now we have consistent naming in the data catalog files, DataAdapter attribute and DataCatalog.get_ methods arguments. I see the potential confusion with data catalog versions and hydromt versions. I've carefully checked on conflicts which do not occur I think. Would be good to discuss this further if you do see (future) conflicts.

@savente93
Copy link
Contributor Author

Quickly discussed in video call: version (i.e. as is) is good. Will quickly review commit, and merge after if it's good (which I expect)

@savente93
Copy link
Contributor Author

@DirkEilander Looks good, thanks for cleaning up the docs/docstrings!

@savente93 savente93 merged commit f55aaee into main Jul 31, 2023
@savente93 savente93 deleted the version-support branch July 31, 2023 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support for multiple data catalogs with same source names ENH: better data source version control
2 participants