-
Notifications
You must be signed in to change notification settings - Fork 12
Squerall Basics
Semantic Data Lake is an effort to enable querying the wealth of heterogeneous data stored in a Data Lake using Semantic Web techniques, aka, Semantic Data Lake. It makes use of time-proven Semantic Web principles: mapping language (as part of the Ontology-Based Data Access) and SPARQL query language.
To query the data lake using the Semantic Data Lake approach, users need to provide three inputs: (1) Mappings file, (2) Config file, and (3) a SPARQL query, described in the next three sections.
A virtual schema is added to the Data Lake by mapping data elements, e.g., tables and attributes to ontology concepts, e.g., classes and predicates. We benefit from RML mappings to express those schema mapping links.
An example of such mappings is given below. It maps a collection named Product (rml:source "Product"
) in a MongoDB database to an ontology class Product (rr:class bsbm:Product
), meaning that every documebt in Product document is of type bsbm:Product
. The mappings also link MongoDB collection fields label
, publisher
and producer
to ontology predicates rdfs:label
, dc:publisher
and bsbm:producer
, respectively. The _id
field found in rr:subjectMap rr:template "http://example.com/{_id}"
triple points to the primary key of MongoDB collection.
<#OfferMapping>
rml:logicalSource [
rml:source "//Offer";
nosql:store nosql:Mongodb
];
rr:subjectMap [
rr:template "http://example.com/{_id}";
rr:class schema:Offer
];
rr:predicateObjectMap [
rr:predicate bsbm:validTo;
rr:objectMap [rml:reference "validTo"]
];
rr:predicateObjectMap [
rr:predicate dc:publisher;
rr:objectMap [rml:reference "publisher"]
];
rr:predicateObjectMap [
rr:predicate bsbm:producer;
rr:objectMap [rml:reference "producer"]
];
Note the presence of the triple nosql:store nosql:MongoDB
, it contains an addition to RML mappings from the NoSQL ontology to allow stating what type of source it is being mapped.
The mappings file can either be created manually or using the following graphical utility: Squerall-GUI.
In order for data to connect to a data source, users need to provide a set of config parameters, in JSON format. This differs from data source to another, for example for a MongoDB collection, the config parameters could be: database host URL, database name, collection name, and replica set name.
{
"type": "mongodb",
"options": {
"url": "127.0.0.1",
"database": "bsbm",
"collection": "offer",
"options": "replicaSet=mongo-rs"
},
"source": "//Offer",
"entity": "Offer"
}
It is necessary to link the configured source ("source": "//Offer"
) to the mapped source (rml:logicalSource rml:source "//Offer"
, see Mapping section above)
The config file can either be created manually or using the following graphical utility: Squerall-GUI.
SPARQL queries are expressed using the Ontology terms the data was previously mapped to. SPARQL query should conform to the currently supported SPARQL fragment:
Query := Prefix* SELECT Distinguish WHERE{ Clauses } Modifiers?
Prefix := PREFIX "string:" IRI
Distinguish := DISTINCT? (“*”|(Var|Aggregate)+)
Aggregate := (AggOpe(Var) ASVar)
AggOpe := SUM|MIN|MAX|AVG|COUNT
Clauses := TP* Filter?
Filter := FILTER (Var FiltOpe Litteral)
| FILTER regex(Var, "%string%")
FiltOpe :==|!=|<|<=|>|>=
TP := VarIRIVar .|Varrdf:type IRI.
Var := "?string"
Modifiers := (LIMITk)? (ORDER BY(ASC|DESC)? Var)? (GROUP BYVar+)?
The following query operations are currently not supported:
- Sub-queries.
- Object-to-object join.
- Filter between object variables (e.g.,
FILTER (obj1 = obj2)
). - Aggregation on the subject variable.
- Join RDF and Non-RDF data on the RDF subject position (i.e., ajoin plain with URI).
- When using a SPARQL query with only one star, add the object variable to SELECT.
Data from different data sources may not be readily joinable. Squerall allows the users to declare transformations that are executed on the fly on query-time on join keys. Depending on whether the transformations change in a query-to-query basis on once for all, users have the option to declare transformations at the mapping level or query level.
Combining FNO along RML mappings allows users to declare that certain data values are not directly mapped to an ontology term, but first transformed using FNO function. For example take the last rr:predicateObjectMap
and change it as follows:
rr:predicateObjectMap [
rr:predicate <#FunctionMap>;
rr:objectMap [rml:reference "producer"]
];
Next:
<#FunctionMap>
fnml:functionValue [
rml:logicalSource "/root/data/review.parquet" ;
rr:predicateObjectMap [
rr:predicate fno:executes ;
rr:objectMap [rr:constant grel:scale] ] ;
rr:predicateObjectMap [
rr:predicate grel:valueParam1 ;
rr:objectMap [rr:reference "producer"]
] ;
rr:predicateObjectMap [
rr:predicate grel:valueParam2 ;
rr:objectMap [rr:reference "123"]
] ;
] .
Users add a clause at the very end of the basic graph patter (BGP), in this way:
TRANSFORM(?leftJoinVariable.[l/r].[transformation]+
. For example: ?author?book.r.scl(123)
, it instructs Squerall to scale all join values of the left star (e.g., author.hasBook) by 123. If it was ?author?book.r.scl(123)
, i.e. .l
instead, then apply the transformation(s) on the ID of the right star. The list of available transformations is as follows:
-
scl(int)
scale the join values up or down with a certain numerical value. - skp(val)` skip a value, so no join using it is possible.
-
substit(val1,val2)
substitute the valueval1
withval2
whenever encountered in the join values. -
replc(val1,val2)
replace a substring in the join value with another substring. -
prefix(val)
add a prefix to every join value. -
postfix(val)
add a postfix to every join valrue.
Instructing to scale all values of attribute Producer
with value 123. The same list of transformation as previously is possible, use the transformation name with the 'grel' namespace, e.g. grel:prefix
.