Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-encoding SPARQL query (België) #772

Open
coret opened this issue Oct 21, 2022 · 16 comments
Open

Content-encoding SPARQL query (België) #772

coret opened this issue Oct 21, 2022 · 16 comments
Assignees

Comments

@coret
Copy link
Contributor

coret commented Oct 21, 2022

When searching for België in the GTAA no results are given, whilst searching for Belgie has among othersBelgië as result.

Testing by @wmelder showed the following:

The query for België via the construct_gtaa.rq query run via
curl -H Accept:text/turtle --data-urlencode "query@queries/construct_gtaa.rq" 'https://{username}:{password}@gtaa.apis.beeldengeluid.nl/sparql'
yields no results, but
curl -H "Content-type: application/x-www-form-urlencoded; charset=utf-8" -H Accept:text/turtle --data-urlencode "query@queries/construct_gtaa.rq" 'https://{username}:{password}@gtaa.apis.beeldengeluid.nl/sparql'
does give results!

It seems the Comunica client (Network of Terms) sends UTF-8, but doesn't include a character encoding header, so server-side it's considered US-ASCII (ISO-8859-1).

Should / can the charset be part of the dataset description of the GTAA within the Network of Terms (client-side solution). Of, should a default charset (utf-8) be hardcoded in the Comunica call with the option to override via de dataset description?

Some other searches which have problems with searching for terms with diacritics: Ampèrestraat (Adamlink) and Curaçaostraat (Gouda Tijdmachine). Haven't checked if adding a charset helps with these sources.

Some other search which do not have a problem with searching for terms with diacritics: Eichstätt (WO2 thesaurus), Galileïsche (AAT), Henriëtte (RKDartists)

@wmelder
Copy link
Contributor

wmelder commented Oct 21, 2022

Should / can the charset be part of the dataset description of the GTAA within the Network of Terms (client-side solution). Of, should a default charset (utf-8) be hardcoded in the Comunica call with the option to override via de dataset description?

Adding a hardcoded charset in the HTTP header would suffice. Otherwise, the receiving server doesn't know what type encoding is sent.

@wmelder
Copy link
Contributor

wmelder commented Oct 21, 2022

@wmelder
Copy link
Contributor

wmelder commented Oct 21, 2022

Adding a hardcoded charset in the HTTP header would suffice. Otherwise, the receiving server doesn't know what type encoding is sent.

On second thoughts... what if the server doesn't handle the charset properly? Or doesn't have an UTF-8 default encoding? Then it would be nice if network of terms can provide a charset that the server will handle properly. In those cases a dataset parameter should be necessary.

@ddeboer
Copy link
Member

ddeboer commented Oct 25, 2022

What is construct_gtaa.rq and where can I find it?

@wmelder
Copy link
Contributor

wmelder commented Oct 25, 2022

@ddeboer construct_gtaa.rq is basically the gtaa.rq query, but it may include VALUES for query and datasetUri, variables that are filled in from within the network of terms. To be able to use a test query file we renamed it. In itself not so exciting.

@wmelder
Copy link
Contributor

wmelder commented Oct 25, 2022

Currently these are the contents of the file:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX justskos: <http://justskos.org/ns/core#>
PREFIX text: <http://jena.apache.org/text#>

CONSTRUCT {
    ?uri a skos:Concept ;
        skos:prefLabel ?prefLabel ;
        skos:altLabel ?altLabel ;
        skos:hiddenLabel ?hiddenLabel ;
        skos:scopeNote ?scopeNote ;
        skos:broader ?broader_uri ;
        skos:narrower ?narrower_uri ;
        skos:related ?related_uri .
    ?broader_uri skos:prefLabel ?broader_prefLabel .
    ?narrower_uri skos:prefLabel ?narrower_prefLabel .
    ?related_uri skos:prefLabel ?related_prefLabel .
}
WHERE {
    VALUES ?query { "zelensky" }
    VALUES ?datasetUri {
        <http://data.beeldengeluid.nl/gtaa/Persoonsnamen>
        }
    ?uri text:query (skos:prefLabel skos:altLabel skos:hiddenLabel ?query) .
    ?uri skos:inScheme ?datasetUri ;
        justskos:status ?status .
    FILTER(?status IN ('approved', 'candidate'))

    OPTIONAL {
        ?uri skos:prefLabel ?prefLabel .
        FILTER(LANG(?prefLabel) = "nl" )
    }
    OPTIONAL {
        ?uri skos:altLabel ?altLabel .
        FILTER(LANG(?altLabel) = "nl")
    }
    OPTIONAL {
        ?uri skos:hiddenLabel ?hiddenLabel .
        FILTER(LANG(?hiddenLabel) = "nl")
    }
    OPTIONAL {
        ?uri skos:scopeNote ?scopeNote .
        FILTER(LANG(?scopeNote) = "nl")
    }
    OPTIONAL {
        ?uri skos:broader ?broader_uri .
        ?broader_uri skos:prefLabel ?broader_prefLabel .
        FILTER(LANG(?broader_prefLabel) = "nl")
    }
    OPTIONAL {
        ?uri skos:narrower ?narrower_uri .
        ?narrower_uri skos:prefLabel ?narrower_prefLabel .
        FILTER(LANG(?narrower_prefLabel) = "nl")
    }
    OPTIONAL {
        ?uri skos:related ?related_uri .
        ?related_uri skos:prefLabel ?related_prefLabel .
        FILTER(LANG(?related_prefLabel) = "nl")
    }
}
LIMIT 1000

@wmelder
Copy link
Contributor

wmelder commented Oct 25, 2022

For this issue it should be modified a bit:

    VALUES ?query { "België" }
    VALUES ?datasetUri {
        <http://data.beeldengeluid.nl/gtaa/GeografischeNamen>
        }

@ddeboer ddeboer self-assigned this Oct 25, 2022
@ddeboer
Copy link
Member

ddeboer commented Oct 25, 2022

For previous work on diacritics, see #426, netwerk-digitaal-erfgoed/network-of-terms-catalog#46 and netwerk-digitaal-erfgoed/network-of-terms-catalog#93. At least for Virtuoso sources (Adamlink), how diacritics are interpreted is out of our control.

@wmelder
Copy link
Contributor

wmelder commented Oct 25, 2022

In de sparql doc staat dat een POST met application/sparql-query altijd in UTF-8 is. Maar bij een POST met x-www-form-urlencoded staat dat er niet bij. Mogelijk beter om de application/sparql-query variant te gebruiken (met unescaped UTF-8 dus).

tip van onze ontwikkelaars...

@wmelder
Copy link
Contributor

wmelder commented Jan 14, 2025

I ran into this issue again... This time I was using the reconciliation service and it didn't match names that were obviously listed in the GTAA, For example: "Isaac Albéniz" could not be found.

Sending a request to the GTAA sparql endpoint "https://gtaa.apis.beeldengeluid.nl/sparql" results in the result properly. I noticed the HTTP response headers contain: 'Content-Type': 'text/turtle;charset=utf-8'. I am not sure how the server 'knows' what encoding is sent, but the request header contains "Accept": "application/sparql-results+json", so maybe that is enough for the server to determine the encoding for the data sent. Maybe too much information, but this is the url for the GET request sent to the server.

As I understand it, at least for the GTAA case, the solution is not replacing all the single diacritics in the query (the service can handle diacritics), but making the NoT provide the proper encoding in the header to the terminology service. If the service is not somehow notified of the encoding (utf-8) it will be unaware of what code page to use and diacritics are failing because of the default ASCII codepage.

@wmelder
Copy link
Contributor

wmelder commented Jan 14, 2025

I am considering to explore the option to have the proxy server insert a header field for the network of terms requests...

@ddeboer
Copy link
Member

ddeboer commented Jan 16, 2025

As discussed here, it seems the GTAA SPARQL server doesn’t properly decode URIs.

So, reducing the query to a minimal reproducible example, this works:

POST https://gtaa.apis.beeldengeluid.nl/sparql
Content-Type: application/sparql-query

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT * WHERE {
    <http://data.beeldengeluid.nl/gtaa/213504> skos:prefLabel ?prefLabel .
    FILTER(CONTAINS(STR(?prefLabel), "Albé"))
}

But sending this same query via URL-encoded POST doesn’t work:

POST https://gtaa.apis.beeldengeluid.nl/sparql
Content-Type: application/x-www-form-urlencoded

query = PREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0A%0ASELECT+*+WHERE+%7B%0A++++%3Chttp%3A%2F%2Fdata.beeldengeluid.nl%2Fgtaa%2F213504%3E+skos%3AprefLabel+%3FprefLabel+.%0A++++FILTER%28CONTAINS%28STR%28%3FprefLabel%29%2C+%22Alb%C3%A9%22%29%29%0A%7D

The latter method happens to be how our query client Comunica sends out queries, but must be supported anyway by SPARQL servers according to the spec.

The é is properly URI-encoded as %C3%A9, but the SPARQL server isn’t decoding that. @wmelder Can you see what’s going on in your Fuseki server or any proxies in front of that?

@wmelder
Copy link
Contributor

wmelder commented Jan 16, 2025

Maybe a silly question, but does the use of this header:
Content-Type: application/x-www-form-urlencoded
implies that the content of the data for the POST request is containing UTF-8? Is that in a specification?

@ddeboer
Copy link
Member

ddeboer commented Jan 16, 2025

Not silly at all. 😄

application/x-www-form-urlencoded is specified in the URL Standard.

While not required with application/x-www-form-urlencoded, UTF-8 is usually assumed.

Most relevant, however, seems to be that the server is not doing proper percent-decoding. In JavaScript (e.g. your browser’s console):

> encodeURI('é')
< "%C3%A9"

%C3%A9 is what we see in the request above.

The server should then decode that back to é. In JavaScript:

> decodeURI('%C3%A9')
< "é"

@wmelder
Copy link
Contributor

wmelder commented Jan 16, 2025

When I replace the header with 'Content-type': 'application/x-www-form-urlencoded;charset=utf-8', I do get proper results. That makes me believe that the decoding is done by the server. But it doesn't do that properly when no charset is given.

@ddeboer
Copy link
Member

ddeboer commented Jan 17, 2025

Interesting!

Handling HTTP requests, including application/x-www-form-urlencoded ones` should default to UTF-8, so there may be something wrong in your Fuseki config.

I wasn’t able to reproduce this issue with a Fuseki 5.2.0 server running on local data file albeniz.ttl:

PREFIX e:<http://example.com/>

e:person e:name "Albéniz" .
fuseki-server --file albeniz.ttl /albeniz
POST http://localhost:3030/albeniz
Content-Type: application/x-www-form-urlencoded
Content-Length: 158
User-Agent: IntelliJ HTTP Client/IntelliJ IDEA 2024.3.1.1
Accept-Encoding: br, deflate, gzip, x-gzip
Accept: */*

query=PREFIX+e%3A+%3Chttp%3A%2F%2Fexample.com%2F%3E%0ASELECT+*+WHERE+%7B%0A++%3Fsub+e%3Aname+%3Fname+.%0A++FILTER%28%3Fname+%3D+%22Alb%C3%A9niz%22%29%0A%7D%0A

HTTP/1.1 200 OK
Date: Fri, 17 Jan 2025 12:23:52 GMT
Vary: Accept-Encoding
Vary: Origin
Fuseki-Request-Id: 5
Cache-Control: must-revalidate,no-cache,no-store
Pragma: no-cache
Content-Type: application/sparql-results+json; charset=utf-8
Content-Encoding: gzip
Transfer-Encoding: chunked

{
  "head": {
    "vars": [
      "sub",
      "name"
    ]
  },
  "results": {
    "bindings": [
      {
        "sub": {
          "type": "uri",
          "value": "http://example.com/person"
        },
        "name": {
          "type": "literal",
          "value": "Albéniz"
        }
      }
    ]
  }
}

Which Fuseki server version are you running?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants