Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOI registry assumes md5 hashing algorithm #435

Open
ionathan opened this issue Sep 11, 2024 · 2 comments · May be fixed by #437
Open

DOI registry assumes md5 hashing algorithm #435

ionathan opened this issue Sep 11, 2024 · 2 comments · May be fixed by #437
Labels
bug Report a problem that needs to be fixed

Comments

@ionathan
Copy link

ionathan commented Sep 11, 2024

Description of the problem:

While trying to load a registry from a DOI of dataverse.nl, I realized that they use SHA1. In pooch the hash algorithm is "fixed" to md5.

Full code that generated the error

import pooch
example = pooch.create(
    path=pooch.os_cache("example"),
    base_url="doi:10.34894/5SOKTV",
)
example.load_registry_from_doi()

Full error message

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[20], line 1
----> 1 example.load_registry_from_doi()

File /usr/local/lib/python3.11/site-packages/pooch/core.py:704, in Pooch.load_registry_from_doi(self)
    701 repository = doi_to_repository(doi)
    703 # Call registry population for this repository
--> 704 return repository.populate_registry(self)

File /usr/local/lib/python3.11/site-packages/pooch/downloaders.py:1162, in DataverseRepository.populate_registry(self, pooch)
   1151 """
   1152 Populate the registry using the data repository's API
   1153 
   (...)
   1157     The pooch instance that the registry will be added to.
   1158 """
   1160 for filedata in self.api_response.json()["data"]["latestVersion"]["files"]:
   1161     pooch.registry[filedata["dataFile"]["filename"]] = (
-> 1162         f"md5:{filedata['dataFile']['md5']}"
   1163     )

KeyError: 'md5'
@ionathan ionathan added the bug Report a problem that needs to be fixed label Sep 11, 2024
@dokempf
Copy link
Contributor

dokempf commented Sep 11, 2024

When I wrote this, I was not aware of the fact that DataVerse uses different checksum implementations. I agree this should be fixed, but in order to do it properly, we should first find out the full picture of how DataVerse handles checksums.

@dokempf
Copy link
Contributor

dokempf commented Oct 1, 2024

Apparently, DataVerse can be configured to work with one of four hashing algorithms: MD5, SHA-1, SHA-256, and SHA-512 Source. There is an API route to check which one is in use, but it is only intended for upload, it does not give a guarantee about what checksums might be present on existing data. I therefore think our best bet is to iterate through a hard-coded list of keys until we find one that is present in the API response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Report a problem that needs to be fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants