Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support for multiple data catalogs with same source names #329

Closed
DirkEilander opened this issue Apr 20, 2023 · 4 comments · Fixed by #438
Closed

ENH: Support for multiple data catalogs with same source names #329

DirkEilander opened this issue Apr 20, 2023 · 4 comments · Fixed by #438
Assignees
Labels
DataCatalog & DataAdapters issues related to the DataCatalog and DataAdapters Enhancement New feature or request
Milestone

Comments

@DirkEilander
Copy link
Contributor

DirkEilander commented Apr 20, 2023

Feature Type

  • Adding new functionality to HydroMT
  • Changing existing functionality in HydroMT

Enhancement Description

When a DataCatalog is initialized based on multiple data catalog (yaml) files with the same source names, only the source of last data catalog will be kept and earlier sources with the same name overwritten.

By storing the sources internally as e.g., <data_catalog>/<source_name> names become unique, we avoid overwriting, and we can have the access to the same data with the same name from different sources.

Feature Description

  • DataCatalog.update requires a new data_catalog argument which should be used to set combined name.
  • DataCatalog.__getitem__ should be able to obtain data based on the just the source_name for backwards compatibility based on some logic which data_catalog to select.

Additional Context

We need to decide on the "couple sign" (/ in the example above) to combine the data_catalog and source_name

@DirkEilander DirkEilander added Enhancement New feature or request DataCatalog & DataAdapters issues related to the DataCatalog and DataAdapters labels Apr 20, 2023
@savente93 savente93 self-assigned this May 16, 2023
@savente93
Copy link
Contributor

also see #331

@savente93
Copy link
Contributor

Discussed possibilities around this issue with @DirkEilander and putting down a quick summary of the relevant bits here:

In essence there are three ways we could structure the sources dictionary to achieve this:

{
    'esa_worldcover_2020_v100': 
    {
        'aws': RasterDataAdapter,
        'deltares': RasterDataAdapter,
        'last': 'aws',
    },
    ...:...
}

{
    'aws': {
        'esa_worldcover_2020_v100':RasterDataAdapter,
        ...:...
    },
    'deltares': {
        'esa_worldcover_2020_v100': RasterDataAdapter,
        ...:...
    }
}

or

{
    'aws:esa_worldcover_2020_v100': RasterDataAdapter,
    'deltares:esa_worldcover_2020_v100' : RasterDataAdapter,
    'esa_worldcover_2020_v100': RasterDataAdapter,
    ...:...
}

After some brainstorming we decided to go for the first option. Fistly this is more efficient both computationally as well as conceptually (users are much more likely to ask for a dataset that might have different providers, than they are to ask for "all aws data". The last option would be the closes to the current implementation, but requires a lot of extra bookkeeping on the part of the user, whereas with a nested dictionary we can do pretty much all of that bookkeeping for them. It will also mesh better with an idea we have to introduce versions and how we want to change how the alias is used, but I'll detail that part more in the relevant question. To avoid feature creep I'll just focus on this level, and after that we should probably discuss implications of versions with the team down the line. (Dirk, feel free to comment if I've forgotten, or misunderstood anything)

@DirkEilander
Copy link
Contributor Author

DirkEilander commented Jun 28, 2023

In addition to the comment above. When users specify a source in the configuration yml file the user can specify the source. We should keep in mind that we might also want to specify the data source version in a similar manner going forward (see #148)

Option 1: in one string using e.g. <source name>:<data catalog name>. (check if : is a good separator that does not conflict with yaml formatting and common source names)

setup_lulcmaps:
  lulc:  esa_worldcover_2020_v100:aws

Option 2: explicit extra argument. This would mean that the DataCatalog.get_rasterdataset, DataCatalog.get_geodataset, etc. methods get a dictionary of arguments instead of a single string.

setup_lulcmaps:
  lulc:  
    source: esa_worldcover_2020_v100
    catalog: aws

@savente93
Copy link
Contributor

After another descussion we decided to start deprecating the __getitem__ and __setitem__ for adding and retrieving sources since that interface is too restrictive. In addition we discussed that we don't want to concatenate catalog names since we want to write things away in a similar format that users are asked to provide the catalogs, which means that we do need to move toward a nested version. Taking into account planned activities for #148 we decided on the following format:

esa_worldcover:
  base: 
    crs: 4326
    data_type: RasterDataset
    driver: raster
    kwargs:
      chunks:
        x: 36000
        y: 36000
    meta:
      category: landuse
      source_license: CC BY 4.0
      source_url: https://doi.org/10.5281/zenodo.5571936
    path: landuse/esa_worldcover/esa-worldcover.vrt
  versions:
    - version: 2020
      kwargs:
        chunks:
          x: 36000
          y: 36000
      path: landuse/esa_worldcover/esa-worldcover-2020.vrt
    - version: 2021
      kwargs:
        chunks:
          x: 36000
          y: 36000
      path: landuse/esa_worldcover/esa-worldcover-2021.vrt

(note that here the versions attr is a list, not a mapping)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataCatalog & DataAdapters issues related to the DataCatalog and DataAdapters Enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants