Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyse openaire similarity mechanism #9

Open
pvgenuchten opened this issue Aug 7, 2024 · 7 comments
Open

Analyse openaire similarity mechanism #9

pvgenuchten opened this issue Aug 7, 2024 · 7 comments
Assignees

Comments

@pvgenuchten
Copy link
Contributor

Openaire has a mechanism to identify in which repositories a resource is included, the origins of the record are stored with the record, including a string identifier of the platform, the identifier should be used to retrieve the url of the record, so users can click from the record in soilwise to one of its origins

  • examine how openaire advertises the origins
  • identify if openaire has a mechanism to find url by Id
  • Define mapping for popular platforms between Id and url (can also be based on doi)
@BerkvensNick
Copy link
Contributor

documentation from OpenAire on their similarity mechanism:
docs

@BerkvensNick
Copy link
Contributor

BerkvensNick commented Aug 14, 2024

OpenAIRE assigns internal identifiers for each object it collects. By default, the internal identifier is generated as sourcePrefix::md5(localId) where:

sourcePrefix is a namespace prefix of 12 chars assigned to the data source at registration time
localΙd is the identifier assigned to the object by the data source

docs

@BerkvensNick
Copy link
Contributor

BerkvensNick commented Aug 14, 2024

@pvgenuchten, I had a look at this issue, not sure I fully understand it, but this is what I did:
used the api "https://api.openaire.eu/search/datasets?format=json" to extract 900 records, I then extract certain fields:

  • obj_identifier = item['header']['dri:objIdentifier']['$']
  • collectedfrom_1 = item['metadata']['oaf:entity']['oaf:result']['collectedfrom']['@name']
  • original_id = item['metadata']['oaf:entity']['oaf:result']['originalId']
  • collectedfrom_2 = item['metadata']['oaf:entity']['oaf:result']['children']['instance']['collectedfrom']['@name']
  • hostedby = item['metadata']['oaf:entity']['oaf:result']['children']['instance']['hostedby']['@name']
  • webresource = item['metadata']['oaf:entity']['oaf:result']['children']['instance']['webresource']['url']['$']

the results are in the attached csv
soilwise_openaire_doi.xls

the origin of the record is the 'sourcePrefix' (a namespace prefix of 12 chars) of the obj_identifier, e.g. _____OmicsDI::47167d2e7a363dcb907e77d4a5c948d7, the 'sourcePrefix' = '_____OmicsDI',

this 'sourcePrefix' does not always seem to be 'unique',e.g. 'doi_________' is used for Datacite,Crossref and Zenodo (see more info and examples at the bottom of the following webpage)

In some cases the sourcePrefix can be used to generate the url from the id, e.g. in case of the prefix 'doi_________' , the reoccurring pattern to generate the url = 'https://doi.org/' + original_id
(when extracting the information of the originalId from the api response there are 2 values, we have to select the value without the '50|' string-pattern to construct the url)

But this is not always the case, see the first 3 records in the attached csv, _____OmicsDI: is linked to the following record-urls: https://www.omicsdi.org/dataset/gpmdb/GPM11210027561, https://www.omicsdi.org/dataset/omics_ena_project/PRJNA267992, https://www.omicsdi.org/dataset/geo/GSE63974.
The URL "https://www.omicsdi.org/dataset" seems to be further identified based on hostedby information ['The Global Proteome Machine Database' (/gpmdb/), 'European Nucleotide Archive' (/omics_ena_project/), 'Gene Expression Omnibus' (/geo/)]

For the mapping from id to url, I think this will be a combination of
'sourcePrefix' (+ hostedby) + originalId
I think we can deduce this info by extracting multiple records and use a rule based logic based on the 'sourcePrefix'?

I have added the code I use in the notebook 'soilwise_openaire' in the github repository.

I hope I understood the question correctly, if not let me know.

@pvgenuchten
Copy link
Contributor Author

pvgenuchten commented Aug 28, 2024

thank you for the work, the main question is, for any harvested record, can we capture the platforms it has been found, and on those platforms, where is the url to access it, see for example this record. I think you are on the right track, introducing a concatenation pattern based on sourceprefix. suggestion would be to try it out for some of the popular platforms (cordis, zenodo, dataverse, gbif), then evaluate if it makes sense

image

https://api.openaire.eu/search/publications?format=json&page=16&size=200

it is apparently available in the collectedfrom and originalid section, but no direct url, i wonder if we can derive for popular platforms the direct url from those 2 properties

@pvgenuchten
Copy link
Contributor Author

on the other hand, i like what openaire states here:

image

it's probably best to only link to formal pids, because localid seems unstable over time

@pvgenuchten
Copy link
Contributor Author

pvgenuchten commented Aug 28, 2024

my suggestion would be to close this issue, but document its findings (as evidence in our reports)

@BerkvensNick
Copy link
Contributor

hi @pvgenuchten , I probably don't fully understand the issue, but there seems to be a direct url in the jsonfile for each record in the fields "children.result.instance.url" or "children.result.instance.webresources"?

image
image

in some cases you can also derive the direct url from the collectedfrom and originalID fileds, the url of the example you provide in the higher comments is e.g. sourcePrefix = "openaire____" -> doi.org/10.1029/2018WR024608 ( 'sourcePrefix' + correct originalId)
but as mentioned, in some cases the direct url will be based on 'sourcePrefix' (+ hostedby) + originalId and it will take some effort to determine this. But as you mentioned we could do this for the more popular platforms.

Fine for me to close issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants