-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: data versioning in the data catalog #438
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial feedback. I think the changes generally look good and we are nearly there. However it seems two different issues with very similar solutions might got mixed up a bit, i.e., #148 which focusses on versions of data from the same source & #329 which focusses on identical data but from different sources. Given the work that you have done I think we should try to solve both here.
What is still needed:
- slightly different yml structure for versions which supports both
provider
andversion
as arguments, see example below. Note that internally we still want the tree structure you implemented. - consistent naming of
version
&provider
in the code. I also suggest to usevariants
where you currently use versions to avoid confusion with version. - finally, expose the new keyword arguments in DataCatalog.get_source (i.e. version and provider) in the get_rasterdataset/get_geodataset/get_geodataframe/get_dataframe methods by accepting a dictionary for the data_like argument with
{"name": <str>, "version": <str>, "provider": <str>}
.
esa_worldcover:
crs: 4326
data_type: RasterDataset
driver: raster
filesystem: local
driver_kwargs:
chunks:
x: 36000
y: 36000
meta:
category: landuse
source_license: CC BY 4.0
source_url: https://doi.org/10.5281/zenodo.5571936
variants: # instead of "versions" to avoid confusion with "version"
- provider: local
version: 2020
path: landuse/esa_worldcover/esa-worldcover.vrt
- provider: local
version: 2021
path: landuse/esa_worldcover_2021/esa-worldcover.vrt
- provider: aws
version: 2020
path: s3://esa-worldcover/v100/2020/ESA_WorldCover_10m_2020_v100_Map_AWS.vrt
rename:
ESA_WorldCover_10m_2020_v100_Map_AWS: landuse
filesystem: s3
driver_kwargs:
storage_options:
anon: true
Note that a catalog entry without variants should also still work en support the version
and provider
keywords
esa_worldcover:
crs: 4326
data_type: RasterDataset
driver: raster
meta:
category: landuse
source_license: CC BY 4.0
source_url: https://doi.org/10.5281/zenodo.5571936
source_version: v100
provider: local
version: 2020
path: landuse/esa_worldcover/esa-worldcover.vrt
filesystem: local
driver_kwargs:
chunks:
x: 36000
y: 36000
I'll be honest, not the cleanest implementation I've ever made, but it get's the job done. Should be all good now |
One question that remains, is what to actually do with |
The previous impl did a lot of unnecessary bookkeeping because I was too stubborn to realize that since 3.7 python iteration maintains insertion order (as defined as part of the std lib spec by now). This at least cleans up the impl a little bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great work @savente93 ! There are just a few final bits and pieces left to do I think.
@DirkEilander I think I got to all the requested changes that are good to impl in this PR, please have another look. P.S. I can relocate the placeholder code from |
The failure of SonarCloud is due to a bit more code duplication, which I don't think is a problem. Honestly I think the implementation would get less clear if I tried to somehow reduce it because it seems to hit on lines with stuff like arguments which have to be the same. So I don't actually intend of fixing that, I think it's good as is |
…s; return newest version by default
@DirkEilander how come you changed the variable name from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reviewing and testing I have made some changes to this branch directly. It's mainly some polishing. Not the ideal process, but hope you don't mind. In my opinion it's now ready to be merged, but it would be good if you had a final critical look at my changes.
What I've done:
- fixed naming consistency for
version
(wasdata_version
) andsource
(waskey
) inDataCatalog.get_source
andDataCatalog.get_rasterdataset
and similar methods. - fixed the checking of different input types for
data_like
inDataCatalog,get_rasterdataset
and similar methods and also restored your key-word arguments for provider and version for a better experience for python users (see also docs example) - fixed
DataCatalog.to_dict
with more than 2 variants - by default return the newest version of a provider by sorting the sources in
DataCatalog.add_source
. We still return the last added provider. This seemed the wanted behaviour to me. I've clarified this in all docstrings. - moved the placeholder parsing to
_denormalise_data_dict
to simplify_parse_data_source_dict
method - updated the docs (user guide and doc strings)
This was needed to correctly pass the keys from the data_like dictionary to all downstream methods (see updated tests in test_data_catalog.py::test_data_catalog). Now we have consistent naming in the data catalog files, DataAdapter attribute and DataCatalog.get_ methods arguments. I see the potential confusion with data catalog versions and hydromt versions. I've carefully checked on conflicts which do not occur I think. Would be good to discuss this further if you do see (future) conflicts. |
Quickly discussed in video call: |
@DirkEilander Looks good, thanks for cleaning up the docs/docstrings! |
Issue addressed
Fixes #329
Explanation
With this PR users will be able to add
versions
of the same data source but from different origins. E.g. this would allow users to have a data catalog foresa_worldcover
both local and on aws, and query both as is necessary. We also plan to add similar functionalities for versions/editions of datasets (see additional notes below), but that will be in a separate PR. The catalogs are disambiguate during parse time and compressed during write time, so the internal representation of the adapters has remained unchanged. The only change to the adapters is that I added aversion_name
field so we can carry the appropriate data with us throughout the code easier.An important note, is that to be able to make this happen we have to deprecate the dictionary overloading, so after this get's merged, iterating over the catalog itself, instead of using one of the provided methods, as well as directly accessing the sources dictionary, and using dictionary syntax on the catalog to retrieve or set sources is deprecated and should not be used. All current instances of that code in HydroMT have been updated (as far as I know)
Checklist
main
Additional Notes (optional)
There are some places in the code where I use the name
version
and some places where I use the nameprovider
. Currently this might be somewhat confusing but the reason for this is that very likely the next issue I'll be picking up is adding similar functionality but for versions of datasets (e.g.v2022
vsv2021
ofesa_worldcover
) that will use the same mechanism. I think the naming will make a lot more sense when that is added, but I didn't want to introduce a version in between that would then immediately break compatibility in the next update. For the purposes of this PR you can think of versions and providers as equivalent.