-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Content-encoding SPARQL query (België) #772
Comments
Adding a hardcoded charset in the HTTP header would suffice. Otherwise, the receiving server doesn't know what type encoding is sent. |
On second thoughts... what if the server doesn't handle the charset properly? Or doesn't have an UTF-8 default encoding? Then it would be nice if network of terms can provide a charset that the server will handle properly. In those cases a dataset parameter should be necessary. |
What is |
@ddeboer construct_gtaa.rq is basically the gtaa.rq query, but it may include VALUES for query and datasetUri, variables that are filled in from within the network of terms. To be able to use a test query file we renamed it. In itself not so exciting. |
Currently these are the contents of the file: PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX justskos: <http://justskos.org/ns/core#>
PREFIX text: <http://jena.apache.org/text#>
CONSTRUCT {
?uri a skos:Concept ;
skos:prefLabel ?prefLabel ;
skos:altLabel ?altLabel ;
skos:hiddenLabel ?hiddenLabel ;
skos:scopeNote ?scopeNote ;
skos:broader ?broader_uri ;
skos:narrower ?narrower_uri ;
skos:related ?related_uri .
?broader_uri skos:prefLabel ?broader_prefLabel .
?narrower_uri skos:prefLabel ?narrower_prefLabel .
?related_uri skos:prefLabel ?related_prefLabel .
}
WHERE {
VALUES ?query { "zelensky" }
VALUES ?datasetUri {
<http://data.beeldengeluid.nl/gtaa/Persoonsnamen>
}
?uri text:query (skos:prefLabel skos:altLabel skos:hiddenLabel ?query) .
?uri skos:inScheme ?datasetUri ;
justskos:status ?status .
FILTER(?status IN ('approved', 'candidate'))
OPTIONAL {
?uri skos:prefLabel ?prefLabel .
FILTER(LANG(?prefLabel) = "nl" )
}
OPTIONAL {
?uri skos:altLabel ?altLabel .
FILTER(LANG(?altLabel) = "nl")
}
OPTIONAL {
?uri skos:hiddenLabel ?hiddenLabel .
FILTER(LANG(?hiddenLabel) = "nl")
}
OPTIONAL {
?uri skos:scopeNote ?scopeNote .
FILTER(LANG(?scopeNote) = "nl")
}
OPTIONAL {
?uri skos:broader ?broader_uri .
?broader_uri skos:prefLabel ?broader_prefLabel .
FILTER(LANG(?broader_prefLabel) = "nl")
}
OPTIONAL {
?uri skos:narrower ?narrower_uri .
?narrower_uri skos:prefLabel ?narrower_prefLabel .
FILTER(LANG(?narrower_prefLabel) = "nl")
}
OPTIONAL {
?uri skos:related ?related_uri .
?related_uri skos:prefLabel ?related_prefLabel .
FILTER(LANG(?related_prefLabel) = "nl")
}
}
LIMIT 1000 |
For this issue it should be modified a bit: VALUES ?query { "België" }
VALUES ?datasetUri {
<http://data.beeldengeluid.nl/gtaa/GeografischeNamen>
} |
For previous work on diacritics, see #426, netwerk-digitaal-erfgoed/network-of-terms-catalog#46 and netwerk-digitaal-erfgoed/network-of-terms-catalog#93. At least for Virtuoso sources (Adamlink), how diacritics are interpreted is out of our control. |
tip van onze ontwikkelaars... |
I ran into this issue again... This time I was using the reconciliation service and it didn't match names that were obviously listed in the GTAA, For example: Sending a request to the GTAA sparql endpoint As I understand it, at least for the GTAA case, the solution is not replacing all the single diacritics in the query (the service can handle diacritics), but making the NoT provide the proper encoding in the header to the terminology service. If the service is not somehow notified of the encoding (utf-8) it will be unaware of what code page to use and diacritics are failing because of the default ASCII codepage. |
I am considering to explore the option to have the proxy server insert a header field for the network of terms requests... |
As discussed here, it seems the GTAA SPARQL server doesn’t properly decode URIs. So, reducing the query to a minimal reproducible example, this works:
But sending this same query via URL-encoded POST doesn’t work: POST https://gtaa.apis.beeldengeluid.nl/sparql
Content-Type: application/x-www-form-urlencoded
query = PREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0A%0ASELECT+*+WHERE+%7B%0A++++%3Chttp%3A%2F%2Fdata.beeldengeluid.nl%2Fgtaa%2F213504%3E+skos%3AprefLabel+%3FprefLabel+.%0A++++FILTER%28CONTAINS%28STR%28%3FprefLabel%29%2C+%22Alb%C3%A9%22%29%29%0A%7D The latter method happens to be how our query client Comunica sends out queries, but must be supported anyway by SPARQL servers according to the spec. The |
Maybe a silly question, but does the use of this header: |
Not silly at all. 😄
While not required with Most relevant, however, seems to be that the server is not doing proper percent-decoding. In JavaScript (e.g. your browser’s console): > encodeURI('é')
< "%C3%A9"
The server should then decode that back to > decodeURI('%C3%A9')
< "é" |
When I replace the header with |
Interesting! Handling HTTP requests, including I wasn’t able to reproduce this issue with a Fuseki 5.2.0 server running on local data file
POST http://localhost:3030/albeniz
Content-Type: application/x-www-form-urlencoded
Content-Length: 158
User-Agent: IntelliJ HTTP Client/IntelliJ IDEA 2024.3.1.1
Accept-Encoding: br, deflate, gzip, x-gzip
Accept: */*
query=PREFIX+e%3A+%3Chttp%3A%2F%2Fexample.com%2F%3E%0ASELECT+*+WHERE+%7B%0A++%3Fsub+e%3Aname+%3Fname+.%0A++FILTER%28%3Fname+%3D+%22Alb%C3%A9niz%22%29%0A%7D%0A
HTTP/1.1 200 OK
Date: Fri, 17 Jan 2025 12:23:52 GMT
Vary: Accept-Encoding
Vary: Origin
Fuseki-Request-Id: 5
Cache-Control: must-revalidate,no-cache,no-store
Pragma: no-cache
Content-Type: application/sparql-results+json; charset=utf-8
Content-Encoding: gzip
Transfer-Encoding: chunked
{
"head": {
"vars": [
"sub",
"name"
]
},
"results": {
"bindings": [
{
"sub": {
"type": "uri",
"value": "http://example.com/person"
},
"name": {
"type": "literal",
"value": "Albéniz"
}
}
]
}
} Which Fuseki server version are you running? |
When searching for België in the GTAA no results are given, whilst searching for Belgie has among othersBelgië as result.
Testing by @wmelder showed the following:
The query for België via the construct_gtaa.rq query run via
curl -H Accept:text/turtle --data-urlencode "query@queries/construct_gtaa.rq" 'https://{username}:{password}@gtaa.apis.beeldengeluid.nl/sparql'
yields no results, but
curl -H "Content-type: application/x-www-form-urlencoded; charset=utf-8" -H Accept:text/turtle --data-urlencode "query@queries/construct_gtaa.rq" 'https://{username}:{password}@gtaa.apis.beeldengeluid.nl/sparql'
does give results!
It seems the Comunica client (Network of Terms) sends UTF-8, but doesn't include a character encoding header, so server-side it's considered US-ASCII (ISO-8859-1).
Should / can the charset be part of the dataset description of the GTAA within the Network of Terms (client-side solution). Of, should a default charset (utf-8) be hardcoded in the Comunica call with the option to override via de dataset description?
Some other searches which have problems with searching for terms with diacritics: Ampèrestraat (Adamlink) and Curaçaostraat (Gouda Tijdmachine). Haven't checked if adding a charset helps with these sources.
Some other search which do not have a problem with searching for terms with diacritics: Eichstätt (WO2 thesaurus), Galileïsche (AAT), Henriëtte (RKDartists)
The text was updated successfully, but these errors were encountered: