Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: better data source version control #148

Closed
DirkEilander opened this issue Jul 28, 2022 · 2 comments · Fixed by #438
Closed

ENH: better data source version control #148

DirkEilander opened this issue Jul 28, 2022 · 2 comments · Fixed by #438
Assignees
Labels
DataCatalog & DataAdapters issues related to the DataCatalog and DataAdapters Enhancement New feature or request
Milestone

Comments

@DirkEilander
Copy link
Contributor

DirkEilander commented Jul 28, 2022

Feature Type

  • Changing existing functionality in HydroMT

Enhancement Description

The alias keyword in the data catalog was introduced for backwards compatibility while adding the version to the source name. However in our deltares_data catalog it does not always point to the latest version anymore which is confusing and changing it to point to the latest version breaks model building reproducibility for users. Instead I propose to add versioning explicit to each data source and also keep the version as part of the bookkeeping in the DataCatalog.sources dictionary

TO discuss: should we also remove support for alias?

Feature Description

see comment below for a possible implementation

Additional Context

No response

@DirkEilander DirkEilander added Enhancement New feature or request DataCatalog & DataAdapters issues related to the DataCatalog and DataAdapters labels Oct 6, 2022
@DirkEilander
Copy link
Contributor Author

DirkEilander commented Jun 28, 2023

#> this comment was initially posted by @savente93 in the wrong issue #331 (comment)

Having had a quick brainstorming session with @DirkEilander we discussed this in conjunction with #329 but given that this has potentially bigger implications for the project because it would be a breaking change we decided to leave the issues separate. Basically if we want to add versions of the different data sets we came up with three ways of doing this. Below we've used esa_worldcover as an example:

esa_worldcover:
    - crs: 4326
      data_type: RasterDataset
      driver: raster
      version: 2020
      kwargs:
        chunks:
          x: 36000
          y: 36000
      meta:
        category: landuse
        source_license: CC BY 4.0
        source_url: https://doi.org/10.5281/zenodo.5571936
      path: landuse/esa_worldcover/esa-worldcover-2020.vrt
    - crs: 4326
      version: 2021
      data_type: RasterDataset
      driver: raster
      kwargs:
        chunks:
          x: 36000
          y: 36000
      meta:
        category: landuse
        source_license: CC BY 4.0
        source_url: https://doi.org/10.5281/zenodo.5571936
      path: landuse/esa_worldcover/esa-worldcover-2021.vrt

In the example above we simply duplicate the entire entry but add an attribute for version so the use can specify which version they want. This would probably constitute the fewest changes in terms of the yaml interface but also leads to a lot of duplication and is not very easily human readable.

esa_worldcover:
  versions:
    - 2020:
      crs: 4326
      data_type: RasterDataset
      driver: raster
      kwargs:
        chunks:
          x: 36000
          y: 36000
      meta:
        category: landuse
        source_license: CC BY 4.0
        source_url: https://doi.org/10.5281/zenodo.5571936
      path: landuse/esa_worldcover/esa-worldcover-2020.vrt
    - 2021:
      crs: 4326
      data_type: RasterDataset
      driver: raster
      kwargs:
        chunks:
          x: 36000
          y: 36000
      meta:
        category: landuse
        source_license: CC BY 4.0
        source_url: https://doi.org/10.5281/zenodo.5571936
      path: landuse/esa_worldcover/esa-worldcover-2021.vrt

In this instance we have esa_worldcover and below it a mapping called versions, from which the user can choose. This means is conceptually a bit clearer in my opinion since the versions stay self-contained but is a bit easier to understand in terms of organisation. This would also mesh easily with the solution proposed for #329 . However it doesn't solve the duplication issue. To maintain backwards compatability, if version is not specificed we can just return the first/last instance, which would not matter in the cases where we have only one version, so this would not change current functionality (too much)

esa_worldcover:
  base: 
    crs: 4326
    data_type: RasterDataset
    driver: raster
    kwargs:
      chunks:
        x: 36000
        y: 36000
    meta:
      category: landuse
      source_license: CC BY 4.0
      source_url: https://doi.org/10.5281/zenodo.5571936
    path: landuse/esa_worldcover/esa-worldcover.vrt
  versions:
    - 2020:
      kwargs:
        chunks:
          x: 36000
          y: 36000
      path: landuse/esa_worldcover/esa-worldcover-2020.vrt
    - 2021:
      kwargs:
        chunks:
          x: 36000
          y: 36000
      path: landuse/esa_worldcover/esa-worldcover-2021.vrt

In the final iteration we have a base adapter for the catalog entry, and versions record changes to that. So if the user does not specify a version we could just return the base case to make sure we don't break backwards comparability, and if a version is specified, we simply make changes to the base (like the path) and return the adjusted version. The upside of this is that it makes the format leaner, since common keys like driver and crs only have to be recorded in the base version, and the versions only have to record things that are actually different. The downside of this is that it becomes harder to tell what exactly is being used and what all the properties used will be just by looking at the yaml. As I understand, this last one is the closest to how the alias is currently being used in some cases.

The choices here are mainly which will be best for the users. The implementations of all the above options are quite managable, so we'll have to decide which of these would be best depending on the users workflows. (Dirk, again, if I've forgotten or misremembered anything, please point it out)

@DirkEilander DirkEilander changed the title remove the alias from the catalog ENH: better data source version control Jun 28, 2023
@DirkEilander
Copy link
Contributor Author

I'm in favour of the last option from the examples above. In addition we need to make sure that sources without version (see below) are also still supported.

esa_worldcover:
  crs: 4326
  data_type: RasterDataset
  driver: raster
  kwargs:
    chunks:
      x: 36000
      y: 36000
  meta:
    category: landuse
    source_license: CC BY 4.0
    source_url: https://doi.org/10.5281/zenodo.5571936
  path: landuse/esa_worldcover/esa-worldcover.vrt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataCatalog & DataAdapters issues related to the DataCatalog and DataAdapters Enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants