-
Hi, I've posted the following message yesterday to the pipelines mailing list, but it doesn't appear really used anymore, so I thinkI might have a better chance here. Apologies if this is not the right place, don't hesitate to redirect. Context: We are currently developing an Early Alert system for invasive species
It is crucial for our system that we can unambiguously distinguish new Finally, the question: in the GBIF infrastructure, when exactly is a new occurrence Gbif ID issued Or, more practically: what should I say to our data providers to guarantee I'd be also interested to know if you think the concept of using GBIF ids |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
HI @niconoe, apologies for the waiting. There are 2 main variations gbifIDs for DWCA: 1) DwcTerm.occurrenceID and triplet (a combination of DwcTerm.institutionCode, DwcTerm.collectionCode and DwcTerm.catalogNumber). If the dataset contains unique occurrenceIDs and unique triplets data, we create pairs in key-value database:
If data provider changes one of 4 fields we get a collision and drop that records from the index until the provider fix data or we break that pair and tell what to use as the main key (occurrenceID, triplet or generate new id) 2) DwcTerm.occurrenceID only. If the dataset contains only unique occurrenceIDs.
No extra checks, new occurrenceID equal new gbifID gbifIDs consistency is very important for us and we try to keep it stable, but of course, sometimes we get collisions and new IDs when publishers change their data (popular issue use URL as an occurrenceID and replace http://url to https://url) |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot @muttcg, great explanation! Just to be sure I got everything correctly, can you confirm me that for my use case, to be sure I can track down "which occurrences is which" in my webapp that consumes GBIF data, I should either:
Is my understanding correct? Thanks a lot for explaining the internals already, that's super helpful to build things on top of GBIF! |
Beta Was this translation helpful? Give feedback.
-
Hi!
I'd say usage of occurrenceID only is stable and predictable, we didn't have any issues with that, only related to changes in data. The couple occurrenceID+triplet was introduced mostly for non-dwca data standards like abcd/biocase/etc, at some point, people started to migrate data from non-dwca to dwca standards and we wanted to keep the same gbifIDs. |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for the time you put in this, very helpful and appreciated! |
Beta Was this translation helpful? Give feedback.
HI @niconoe, apologies for the waiting.
There are 2 main variations gbifIDs for DWCA:
1) DwcTerm.occurrenceID and triplet (a combination of DwcTerm.institutionCode, DwcTerm.collectionCode and DwcTerm.catalogNumber).
If the dataset contains unique occurrenceIDs and unique triplets data, we create pairs in key-value database:
If data provider changes one of 4 fields we get a collision and drop that records from the index until the provider fix data or we break that pair and tell what to use as the main key (occurrenceID, triplet o…