-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: better data source version control #148
Comments
#> this comment was initially posted by @savente93 in the wrong issue #331 (comment) Having had a quick brainstorming session with @DirkEilander we discussed this in conjunction with #329 but given that this has potentially bigger implications for the project because it would be a breaking change we decided to leave the issues separate. Basically if we want to add versions of the different data sets we came up with three ways of doing this. Below we've used esa_worldcover:
- crs: 4326
data_type: RasterDataset
driver: raster
version: 2020
kwargs:
chunks:
x: 36000
y: 36000
meta:
category: landuse
source_license: CC BY 4.0
source_url: https://doi.org/10.5281/zenodo.5571936
path: landuse/esa_worldcover/esa-worldcover-2020.vrt
- crs: 4326
version: 2021
data_type: RasterDataset
driver: raster
kwargs:
chunks:
x: 36000
y: 36000
meta:
category: landuse
source_license: CC BY 4.0
source_url: https://doi.org/10.5281/zenodo.5571936
path: landuse/esa_worldcover/esa-worldcover-2021.vrt In the example above we simply duplicate the entire entry but add an attribute for version so the use can specify which version they want. This would probably constitute the fewest changes in terms of the yaml interface but also leads to a lot of duplication and is not very easily human readable. esa_worldcover:
versions:
- 2020:
crs: 4326
data_type: RasterDataset
driver: raster
kwargs:
chunks:
x: 36000
y: 36000
meta:
category: landuse
source_license: CC BY 4.0
source_url: https://doi.org/10.5281/zenodo.5571936
path: landuse/esa_worldcover/esa-worldcover-2020.vrt
- 2021:
crs: 4326
data_type: RasterDataset
driver: raster
kwargs:
chunks:
x: 36000
y: 36000
meta:
category: landuse
source_license: CC BY 4.0
source_url: https://doi.org/10.5281/zenodo.5571936
path: landuse/esa_worldcover/esa-worldcover-2021.vrt In this instance we have esa_worldcover:
base:
crs: 4326
data_type: RasterDataset
driver: raster
kwargs:
chunks:
x: 36000
y: 36000
meta:
category: landuse
source_license: CC BY 4.0
source_url: https://doi.org/10.5281/zenodo.5571936
path: landuse/esa_worldcover/esa-worldcover.vrt
versions:
- 2020:
kwargs:
chunks:
x: 36000
y: 36000
path: landuse/esa_worldcover/esa-worldcover-2020.vrt
- 2021:
kwargs:
chunks:
x: 36000
y: 36000
path: landuse/esa_worldcover/esa-worldcover-2021.vrt In the final iteration we have a base adapter for the catalog entry, and versions record changes to that. So if the user does not specify a version we could just return the base case to make sure we don't break backwards comparability, and if a version is specified, we simply make changes to the base (like the path) and return the adjusted version. The upside of this is that it makes the format leaner, since common keys like driver and crs only have to be recorded in the base version, and the versions only have to record things that are actually different. The downside of this is that it becomes harder to tell what exactly is being used and what all the properties used will be just by looking at the yaml. As I understand, this last one is the closest to how the alias is currently being used in some cases. The choices here are mainly which will be best for the users. The implementations of all the above options are quite managable, so we'll have to decide which of these would be best depending on the users workflows. (Dirk, again, if I've forgotten or misremembered anything, please point it out) |
I'm in favour of the last option from the examples above. In addition we need to make sure that sources without version (see below) are also still supported. esa_worldcover:
crs: 4326
data_type: RasterDataset
driver: raster
kwargs:
chunks:
x: 36000
y: 36000
meta:
category: landuse
source_license: CC BY 4.0
source_url: https://doi.org/10.5281/zenodo.5571936
path: landuse/esa_worldcover/esa-worldcover.vrt |
Feature Type
Enhancement Description
The
alias
keyword in the data catalog was introduced for backwards compatibility while adding the version to the source name. However in our deltares_data catalog it does not always point to the latest version anymore which is confusing and changing it to point to the latest version breaks model building reproducibility for users. Instead I propose to add versioning explicit to each data source and also keep the version as part of the bookkeeping in the DataCatalog.sources dictionaryTO discuss: should we also remove support for
alias
?Feature Description
see comment below for a possible implementation
Additional Context
No response
The text was updated successfully, but these errors were encountered: