-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many dangling datasets #817
Comments
I guess we can label 1 - 3 as users providing wrong data. The crawler should not overwrite the current datasetdescription and write a non-200 status entry to the https://demo.netwerkdigitaalerfgoed.nl/registry/registrations graph. The datasetproviders should be notified via mail (currently manual process). For 4 and 5 ( 7552 dangling datasets) is issue is really bad. About 4, I can't find a
I see similar patterns with other dangling datasets (with different last
Another strange registration (lets call it 6):
No |
Some datasets have a lot of |
They are never removed. We keep all of them for debugging purposes.
The crawler already does this? As you can see, old descriptions are preserved. Or do you mean something else?
This one looks fine, with a dateRead of last night. And it has a rating.
Good point. So apparently the registration was removed 5 or 6 Nov, but why? And who did this? It’s around the time #814 went live, but that doesn’t remove registrations. |
Hoi, wat betreft punt 5. Hier hebben wij niets aan gewijzigd. Ik heb het Sjoerd gevraagd maar die weet er ook niets van.
Ed
Van: David de Boer ***@***.***>
Verzonden: maandag 13 november 2023 12:53
Aan: netwerk-digitaal-erfgoed/dataset-register ***@***.***>
CC: Heer, Ed de ***@***.***>; Mention ***@***.***>
Onderwerp: Re: [netwerk-digitaal-erfgoed/dataset-register] Many dangling datasets (Issue #817)
The crawler should not overwrite the current datasetdescription and write a non-200 status entry to the https://demo.netwerkdigitaalerfgoed.nl/registry/registrations graph.
The crawler already does this? As you can see, old descriptions are preserved. Or do you mean something else?
About 4, I can't find a schema:Entrypoint for this dataset either. But looking at the schema:dateReads the crawler was able to read this dangling dataset up to a week ago?
Good point. So apparently the registration was removed 5 or 6 Nov, but why? And who did this? It’s around the time #814<#814> went live, but that doesn’t remove registrations.
@eddeheerna<https://github.com/eddeheerna> and @faina007<https://github.com/faina007> Please inform us about point 5 above.
—
Reply to this email directly, view it on GitHub<#817 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BAHYZ5P5HYTLYE2NGQKA4HDYEICZ3AVCNFSM6AAAAAA7HE2VP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBYGAYTKOBXGQ>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
As seen in this query, not all datasets have a rating yet. We previously thought this may be due to GraphDB crashing (netwerk-digitaal-erfgoed/infrastructure#50) but as it turns out there’s a way bigger reason: dangling datasets whose registration URL:
There are 7552 of these dangling datasets!
Examples:
https://data.spinque.com/ld/data/vangoghworldwide/datacatalog.jsonld is no longer valid, because of the
While we allow strings, that should be structured in JSON-LD as
"@value": …
for it to pass the SHACL validation. And do we really want to allow values like this anyway?https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/Picturae/catalog-picturae-schema-1.jsonld has invalid datetimes. Fixed in netwerk-digitaal-erfgoed/dataset-register-entries@6cd91b1.
https://data.dc4eu.nl/catalog/natag no longer contains http://data.dc4eu.nl/dataset/03b88faf-0273-4a5f-b554-8e4edb6d562e.
For none of the Collectienederland datasets, including http://data.collectienederland.nl/id/dataset/nederlands-openluchtmuseum, I can find a registration URL. I guess that would be https://data.collectienederland.nl/id/datacatalog, but it has either never been registered or later removed. Re-added the catalog.
I’m quite sure https://archief.nl/id/datacatalog/toegang has been registered because I checked it myself (Check paginated results from NA catalog #795) but now that registration URL has been removed. Is this us or has user
admin-na
done this? Re-added the catalog.Either these catalogs later added invalid data or our SHACL got stricter in subtle ways that no prevents these errors.
@coret @faina007 @eddeheerna Please share what you know about these cases.
The text was updated successfully, but these errors were encountered: