Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check all available checksum algorithm in DataVerse registry population #437

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dokempf
Copy link
Contributor

@dokempf dokempf commented Oct 1, 2024

Fixes #435.

@dokempf
Copy link
Contributor Author

dokempf commented Oct 1, 2024

Failures are unrelated. Should I look into fixing them?

@santisoler
Copy link
Member

santisoler commented Oct 3, 2024

Hi @dokempf! Thanks for taking care of this.

I don't think this fix solves the issue: with it the populate_registry doesn't throws an error, but it doesn't build the registry either. For example, if you run this code you'll see that the registry is still empty:

import pooch

example = pooch.create(
    path=pooch.os_cache("example"),
    base_url="doi:10.34894/5SOKTV",
)
example.load_registry_from_doi()
print(example.registry)
{}

After inspecting the content of self.api_response.json()["data"]["latestVersion"]["files"][0] (assigned to filedata in the populate_registry method), I see that the dictionary doesn't contain the "md5" key anymore. Instead the checksum data is stored inside another dictionary under the "checksum" key:

{
    "id": 425558,
    "persistentId": "",
    "filename": "README.md",
    "contentType": "text/markdown",
    "friendlyType": "Markdown Text",
    "filesize": 2324,
    "storageIdentifier": "file://190251cc731-ca82ca1d341b",
    "rootDataFileId": -1,
    "checksum": {"type": "SHA-1", "value": "0e4b27fd3d76c75c37303f47e25d74ef407d0752"},
    "tabularData": False,
    "creationDate": "2024-06-17",
    "publicationDate": "2024-06-28",
    "fileAccessRequest": True,
}

I suspect there might be a change in the DataVerse API and how they return information about the checksum.

Therefore I think the fix should check if the filedata has either a "checksum" key and get the info from there, or if it has keys like the one you included in the fix. Would be nice to find some docs on the DataVerse API to understand the changes they made (sorry, I didn't search for it yet).

On a side note, don't worry about the pylint complain, I'll fix it in another PR. We should replace it for some modern alternative (ruff or flake8) at some point.

@santisoler
Copy link
Member

My bad. I just saw that you added a link to the docs in the issue. Thanks for that!

In the sample response they added in https://guides.dataverse.org/en/latest/api/native-api.html#import-a-dataset-into-a-dataverse-collection there's the "checksum" key I mentioned.

So, my take would be:

  • Maintain support for the "md5" key (in case DataVerse instances are using the old API).
  • Add support for other algorithm keys: "sha256", "sha1", etc.
  • Add support for the new API with the "checksum" key.
  • Raise a verbose error in case we couldn't find a checksum in the API response. I think if we couldn't populate the registry we should raise an error instead of quietly exiting.

@santisoler
Copy link
Member

BTW, I merged #438. If you update this branch you won't find the pylint error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOI registry assumes md5 hashing algorithm
2 participants