Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many dangling datasets #817

Open
ddeboer opened this issue Nov 11, 2023 · 4 comments
Open

Many dangling datasets #817

ddeboer opened this issue Nov 11, 2023 · 4 comments
Assignees

Comments

@ddeboer
Copy link
Member

ddeboer commented Nov 11, 2023

As seen in this query, not all datasets have a rating yet. We previously thought this may be due to GraphDB crashing (netwerk-digitaal-erfgoed/infrastructure#50) but as it turns out there’s a way bigger reason: dangling datasets whose registration URL:

  • no longer returns a valid dataset description/catalog of descriptions
  • or no longer contains the dataset.

There are 7552 of these dangling datasets!

Examples:

  1. https://data.spinque.com/ld/data/vangoghworldwide/datacatalog.jsonld is no longer valid, because of the

    "license": "not specified",

    While we allow strings, that should be structured in JSON-LD as "@value": … for it to pass the SHACL validation. And do we really want to allow values like this anyway?

  2. https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/Picturae/catalog-picturae-schema-1.jsonld has invalid datetimes. Fixed in netwerk-digitaal-erfgoed/dataset-register-entries@6cd91b1.

  3. https://data.dc4eu.nl/catalog/natag no longer contains http://data.dc4eu.nl/dataset/03b88faf-0273-4a5f-b554-8e4edb6d562e.

  4. For none of the Collectienederland datasets, including http://data.collectienederland.nl/id/dataset/nederlands-openluchtmuseum, I can find a registration URL. I guess that would be https://data.collectienederland.nl/id/datacatalog, but it has either never been registered or later removed. Re-added the catalog.

  5. I’m quite sure https://archief.nl/id/datacatalog/toegang has been registered because I checked it myself (Check paginated results from NA catalog #795) but now that registration URL has been removed. Is this us or has user admin-na done this? Re-added the catalog.

Either these catalogs later added invalid data or our SHACL got stricter in subtle ways that no prevents these errors.

@coret @faina007 @eddeheerna Please share what you know about these cases.

@ddeboer ddeboer moved this to 🏗 In progress in Dataset Register Nov 11, 2023
@ddeboer ddeboer changed the title Not all datasets have a rating yet because of crawler crashes Not all datasets have a rating yet Nov 11, 2023
@ddeboer ddeboer changed the title Not all datasets have a rating yet Many dangling datasets Nov 11, 2023
@coret
Copy link
Contributor

coret commented Nov 13, 2023

I guess we can label 1 - 3 as users providing wrong data. The crawler should not overwrite the current datasetdescription and write a non-200 status entry to the https://demo.netwerkdigitaalerfgoed.nl/registry/registrations graph. The datasetproviders should be notified via mail (currently manual process).

For 4 and 5 ( 7552 dangling datasets) is issue is really bad.

About 4, I can't find a schema:Entrypoint for this dataset either. But looking at the schema:dateReads the crawler was able to read this dangling dataset up to a week ago?

<http://data.collectienederland.nl/id/dataset/nederlands-openluchtmuseum>
        <http://schema.org/dateRead>  "2022-01-18T12:03:05.129Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        <http://schema.org/dateRead>  "2022-01-19T13:02:40.944Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ....
        <http://schema.org/dateRead>  "2023-11-03T14:01:48.449Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        <http://schema.org/dateRead>  "2023-11-04T15:01:48.017Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        <http://schema.org/dateRead>  "2023-11-05T16:02:04.652Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        a                             <http://schema.org/Dataset> .

I see similar patterns with other dangling datasets (with different last schema:dateRead dates), like:

<https://opendata.picturae.com/dataset/wba_a2a_na_a>
        <http://schema.org/dateRead>  "2021-11-09T11:24:19.878Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
		...
        <http://schema.org/dateRead>  "2022-01-25T19:00:46.584Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        a                             <http://schema.org/Dataset> .
 
<https://www.goudatijdmachine.nl/data/api/items/13000>
        <http://schema.org/dateRead>  "2021-10-25T14:32:06.527Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
		...
        <http://schema.org/dateRead>  "2023-11-09T22:01:12.372Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        a                             <http://schema.org/Dataset> .

Another strange registration (lets call it 6):

<https://data.netwerkdigitaalerfgoed.nl/Peace-Palace-Library/Peace-Movement-collection/>
        <http://schema.org/about>       <https://data.netwerkdigitaalerfgoed.nl/Peace-Palace-Library/Peace-Movement-collection> ;
        <http://schema.org/datePosted>  "2023-06-19T12:39:59.752Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

No schema:Dataset or schema:EntryPoint, and (I assume because of this) no schema:dateReads.

@coret
Copy link
Contributor

coret commented Nov 13, 2023

Some datasets have a lot of schema:dateReads. I guess "old" ones are only removed after a succesfull read?

@ddeboer
Copy link
Member Author

ddeboer commented Nov 13, 2023

Some datasets have a lot of schema:dateReads. I guess "old" ones are only removed after a succesfull read?

They are never removed. We keep all of them for debugging purposes.

The crawler should not overwrite the current datasetdescription and write a non-200 status entry to the https://demo.netwerkdigitaalerfgoed.nl/registry/registrations graph.

The crawler already does this? As you can see, old descriptions are preserved. Or do you mean something else?

Another strange registration (lets call it 6):

This one looks fine, with a dateRead of last night. And it has a rating.

About 4, I can't find a schema:Entrypoint for this dataset either. But looking at the schema:dateReads the crawler was able to read this dangling dataset up to a week ago?

Good point. So apparently the registration was removed 5 or 6 Nov, but why? And who did this? It’s around the time #814 went live, but that doesn’t remove registrations.

@eddeheerna
Copy link

eddeheerna commented Nov 14, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In progress
Development

No branches or pull requests

3 participants