ENH: data versioning in the data catalog #438

savente93 · 2023-07-12T08:54:46Z

Issue addressed

Fixes #329

Explanation

With this PR users will be able to add versions of the same data source but from different origins. E.g. this would allow users to have a data catalog for esa_worldcover both local and on aws, and query both as is necessary. We also plan to add similar functionalities for versions/editions of datasets (see additional notes below), but that will be in a separate PR. The catalogs are disambiguate during parse time and compressed during write time, so the internal representation of the adapters has remained unchanged. The only change to the adapters is that I added a version_name field so we can carry the appropriate data with us throughout the code easier.

An important note, is that to be able to make this happen we have to deprecate the dictionary overloading, so after this get's merged, iterating over the catalog itself, instead of using one of the provided methods, as well as directly accessing the sources dictionary, and using dictionary syntax on the catalog to retrieve or set sources is deprecated and should not be used. All current instances of that code in HydroMT have been updated (as far as I know)

Checklist

Updated tests or added new tests
Branch is up to date with main
Tests & pre-commit hooks pass
Updated documentation if needed
Updated changelog.rst if needed

Additional Notes (optional)

There are some places in the code where I use the name version and some places where I use the name provider. Currently this might be somewhat confusing but the reason for this is that very likely the next issue I'll be picking up is adding similar functionality but for versions of datasets (e.g. v2022 vs v2021 of esa_worldcover) that will use the same mechanism. I think the naming will make a lot more sense when that is added, but I didn't want to introduce a version in between that would then immediately break compatibility in the next update. For the purposes of this PR you can think of versions and providers as equivalent.

DirkEilander

Some initial feedback. I think the changes generally look good and we are nearly there. However it seems two different issues with very similar solutions might got mixed up a bit, i.e., #148 which focusses on versions of data from the same source & #329 which focusses on identical data but from different sources. Given the work that you have done I think we should try to solve both here.

What is still needed:

slightly different yml structure for versions which supports both provider and version as arguments, see example below. Note that internally we still want the tree structure you implemented.
consistent naming of version & provider in the code. I also suggest to use variants where you currently use versions to avoid confusion with version.
finally, expose the new keyword arguments in DataCatalog.get_source (i.e. version and provider) in the get_rasterdataset/get_geodataset/get_geodataframe/get_dataframe methods by accepting a dictionary for the data_like argument with {"name": <str>, "version": <str>, "provider": <str>}.

esa_worldcover:
  crs: 4326
  data_type: RasterDataset
  driver: raster
  filesystem: local
  driver_kwargs:
    chunks:
      x: 36000
      y: 36000
  meta:
    category: landuse
    source_license: CC BY 4.0
    source_url: https://doi.org/10.5281/zenodo.5571936
  variants:  # instead of "versions" to avoid confusion with "version"
    - provider: local
      version: 2020
      path: landuse/esa_worldcover/esa-worldcover.vrt
    - provider: local
      version: 2021
      path: landuse/esa_worldcover_2021/esa-worldcover.vrt
    - provider: aws
      version: 2020
      path: s3://esa-worldcover/v100/2020/ESA_WorldCover_10m_2020_v100_Map_AWS.vrt
      rename:
        ESA_WorldCover_10m_2020_v100_Map_AWS: landuse
      filesystem: s3
      driver_kwargs:
        storage_options:
          anon: true

Note that a catalog entry without variants should also still work en support the version and provider keywords

esa_worldcover:
  crs: 4326
  data_type: RasterDataset
  driver: raster
  meta:
    category: landuse
    source_license: CC BY 4.0
    source_url: https://doi.org/10.5281/zenodo.5571936
    source_version: v100
  provider: local
  version: 2020
  path: landuse/esa_worldcover/esa-worldcover.vrt
  filesystem: local
  driver_kwargs:
    chunks:
      x: 36000
      y: 36000

savente93 · 2023-07-24T17:32:42Z

I'll be honest, not the cleanest implementation I've ever made, but it get's the job done. Should be all good now

savente93 · 2023-07-24T17:34:41Z

One question that remains, is what to actually do with alias but we can potentially address that separately. Not sure if we need to make a decision on that right now, and this PR is already quite big.

savente93 · 2023-07-24T20:37:15Z

The previous impl did a lot of unnecessary bookkeeping because I was too stubborn to realize that since 3.7 python iteration maintains insertion order (as defined as part of the std lib spec by now). This at least cleans up the impl a little bit

DirkEilander

Really great work @savente93 ! There are just a few final bits and pieces left to do I think.

docs/user_guide/data_prepare_cat.rst

hydromt/data_adapter/data_adapter.py

hydromt/models/model_api.py

hydromt/data_catalog.py

tests/test_data_catalog.py

hydromt/models/model_grid.py

hydromt/data_catalog.py

hydromt/__init__.py

Co-authored-by: DirkEilander <[email protected]>

savente93 · 2023-07-27T13:54:02Z

@DirkEilander I think I got to all the requested changes that are good to impl in this PR, please have another look.

P.S. I can relocate the placeholder code from _parse_dicts but while I was trying it became quite confusing so I'd like to do it in a separate PR. If you would like this to happen please open an issue, you can just assign it to me and I'll get on that after this gets merged.

savente93 · 2023-07-28T05:39:43Z

The failure of SonarCloud is due to a bit more code duplication, which I don't think is a problem. Honestly I think the implementation would get less clear if I tried to somehow reduce it because it seems to hit on lines with stuff like arguments which have to be the same. So I don't actually intend of fixing that, I think it's good as is

…s; return newest version by default

savente93 · 2023-07-31T07:52:24Z

@DirkEilander how come you changed the variable name from data_version to version? The reason I did that was that I wanted it to be clear that the version refers to that version of the data, and not the version of either the catalog or the version of hydromt, both of which we're going to have quite soon

DirkEilander

While reviewing and testing I have made some changes to this branch directly. It's mainly some polishing. Not the ideal process, but hope you don't mind. In my opinion it's now ready to be merged, but it would be good if you had a final critical look at my changes.

What I've done:

fixed naming consistency for version (was data_version) and source (was key) in DataCatalog.get_source and DataCatalog.get_rasterdataset and similar methods.
fixed the checking of different input types for data_like in DataCatalog,get_rasterdataset and similar methods and also restored your key-word arguments for provider and version for a better experience for python users (see also docs example)
fixed DataCatalog.to_dict with more than 2 variants
by default return the newest version of a provider by sorting the sources in DataCatalog.add_source. We still return the last added provider. This seemed the wanted behaviour to me. I've clarified this in all docstrings.
moved the placeholder parsing to _denormalise_data_dict to simplify _parse_data_source_dict method
updated the docs (user guide and doc strings)

DirkEilander · 2023-07-31T08:02:31Z

@DirkEilander how come you changed the variable name from data_version to version? The reason I did that was that I wanted it to be clear that the version refers to that version of the data, and not the version of either the catalog or the version of hydromt, both of which we're going to have quite soon

This was needed to correctly pass the keys from the data_like dictionary to all downstream methods (see updated tests in test_data_catalog.py::test_data_catalog). Now we have consistent naming in the data catalog files, DataAdapter attribute and DataCatalog.get_ methods arguments. I see the potential confusion with data catalog versions and hydromt versions. I've carefully checked on conflicts which do not occur I think. Would be good to discuss this further if you do see (future) conflicts.

savente93 · 2023-07-31T08:32:44Z

Quickly discussed in video call: version (i.e. as is) is good. Will quickly review commit, and merge after if it's good (which I expect)

savente93 · 2023-07-31T08:47:16Z

@DirkEilander Looks good, thanks for cleaning up the docs/docstrings!

savente93 and others added 21 commits June 29, 2023 16:33

[no ci] WIP

d8c52a4

[no ci] WIP

7245aad

fix #413

b5fb06f

add test and changelog

977c680

[no ci] WIP

124afaf

wip

5a2b6b4

[no ci] WIP

8deff0b

Merge branch 'main' into version-support

4fbb763

[no ci] WIP

cd4a3ff

[no ci] WIP

bbd5d5e

fix tests, catalog still need diffing

438f375

get started on data catalog partitioning

c97a7ba

fix reading merged catalogs, write impl missing

ee9b803

wip

228db74

[no ci] WIP

ada6621

fix alias tests

e034c85

remove temporary test skip

5b7f7a0

introduce deprecation warning for alias

e46b608

Merge branch 'main' into version-support

91a4e37

Merge branch 'main' into version-support

920aac8

Merge branch 'main' into version-support

2aa76ef

DirkEilander changed the title ~~Version support~~ ENH: data versioning in the data catalog Jul 20, 2023

DirkEilander reviewed Jul 20, 2023

View reviewed changes

savente93 added 2 commits July 24, 2023 15:39

fix data adapters part

1388d60

fix versoined catalog tests

5791818

savente93 linked an issue Jul 24, 2023 that may be closed by this pull request

ENH: better data source version control #148

Closed

clean up impl by just using that dicts are ordered

664d88a

DirkEilander requested changes Jul 27, 2023

View reviewed changes

savente93 marked this pull request as draft July 27, 2023 08:13

DirkEilander mentioned this pull request Jul 27, 2023

ENH: Solve which data catalog version to use based on the user hydromt version #461

Closed

savente93 and others added 4 commits July 27, 2023 11:27

Update docs/user_guide/data_prepare_cat.rst

516b6ec

Co-authored-by: DirkEilander <[email protected]>

incorproate PR feedback

cdd4d5e

Merge branch 'main' into version-support

caea5de

remove stray debug statement

9e59f6f

savente93 requested a review from DirkEilander July 27, 2023 13:52

savente93 marked this pull request as ready for review July 27, 2023 13:52

savente93 added 3 commits July 27, 2023 15:58

move type to explicit Union

3f814a4

fix to_dataframe

7bf4de3

fix docs warning

11296c4

fix inconsistent argument names; fix to_dict with more than 2 variant…

b907c90

…s; return newest version by default

DirkEilander approved these changes Jul 31, 2023

View reviewed changes

savente93 merged commit f55aaee into main Jul 31, 2023

savente93 deleted the version-support branch July 31, 2023 08:47

savente93 added a commit that referenced this pull request Jul 31, 2023

ENH: data versioning in the data catalog (#438)

140c59d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: data versioning in the data catalog #438

ENH: data versioning in the data catalog #438

savente93 commented Jul 12, 2023

DirkEilander left a comment •

edited by savente93

Loading

savente93 commented Jul 24, 2023

savente93 commented Jul 24, 2023

savente93 commented Jul 24, 2023

DirkEilander left a comment

savente93 commented Jul 27, 2023

savente93 commented Jul 28, 2023

savente93 commented Jul 31, 2023

DirkEilander left a comment •

edited

Loading

DirkEilander commented Jul 31, 2023 •

edited

Loading

savente93 commented Jul 31, 2023

savente93 commented Jul 31, 2023

ENH: data versioning in the data catalog #438

ENH: data versioning in the data catalog #438

Conversation

savente93 commented Jul 12, 2023

Issue addressed

Explanation

Checklist

Additional Notes (optional)

DirkEilander left a comment • edited by savente93 Loading

Choose a reason for hiding this comment

savente93 commented Jul 24, 2023

savente93 commented Jul 24, 2023

savente93 commented Jul 24, 2023

DirkEilander left a comment

Choose a reason for hiding this comment

savente93 commented Jul 27, 2023

savente93 commented Jul 28, 2023

savente93 commented Jul 31, 2023

DirkEilander left a comment • edited Loading

Choose a reason for hiding this comment

DirkEilander commented Jul 31, 2023 • edited Loading

savente93 commented Jul 31, 2023

savente93 commented Jul 31, 2023

DirkEilander left a comment •

edited by savente93

Loading

DirkEilander left a comment •

edited

Loading

DirkEilander commented Jul 31, 2023 •

edited

Loading