The STIRData harmonisation toolset consists of two data transformation tools, LinkedPipes ETL and D2RML. The Czech business registry data harmonisation pipeline and the validation service are implemented using LinkedPipes ETL, while the other data harmonisation of the other business registries is handled by D2RML. As a triplestore to store the resulting data, OpenLink Virtuoso Open-Source is used and recommended, however, other triplestores can be used as well, given that the data harmonisation processes are adjusted. To deploy the data harmonisation workflow, the data harmonisation tools need to be deployed and populated with the prepared data harmonisation pipelines and mappings.
In this section, we describe how to deploy the data harmonisation tools using Docker.
LinkedPipes ETL Docker based deployment requires Docker Compose.
Then, it can be deployed from the main
branch like this:
curl https://raw.githubusercontent.com/linkedpipes/etl/main/docker-compose.yml | docker-compose -f - up
When deployed, LP-ETL runs on http://localhost:8080 and is ready to import pipelines.
For custom deployments, see the full deployment documentation. Once deployed, see the user documentation and tutorials. The documentation of the individual LP-ETL components is also available directly from the component's configuration dialog.
We created a cli for executing d2rml transformations and dockerized the whole environment so that a user can easily execute d2rml tranformations, without having to set up complex environments.
You can find the docker image for d2rml here: d2rml-cli
To run it, use docker run command as following:
docker run -e "args=d2rml:test.ttl max_file_size:1000 output:test_out=axc" -v ${PWD}/DATA_DIRECTORY/:/data/ stirdata/d2rml-cli:latest
The args parameter contains all the arguments that we want to pass to the cli tool. These are the following parameters:
d2rml: the D2RML document to be executed (absolute path or url)
param: arguments provide parameter values in case the D2RML document is parametric
output: arguments specify where the generated triples will be saved (folder must exist)
max_file_size: the maximum number of triples in each generated output file. The default is 10000
temp_folder: a temp folder where downloaded files will be extracted if needed. It must exist.
The parameters should be given by providing the name and the value of each parameter.
Also, the user needs to add a volume to make the folder containing input and output files accessible inside the docker container. You do this by using the -v flag in docker, giving the full path of the folder and mapping it to the /data/ folder of the docker container:
-v ${PWD}/DATA_DIRECTORY/:/data/
Examples of STIRDATA executions:
Execution of Greek business registry 'Business registry data mapping' (non parametric mapping):
docker run -e "args=d2rml:https://stirdata-semantic.ails.ece.ntua.gr/api/content/el/mapping/bdd4413c-45ec-47b8-b25a-56880a0a0b6e output:br=greece max_file_size:100000 temp_folder:tmp" -v ${PWD}/playground:/data/ stirdata/d2rml-cli:latest
Execution of Cypriot business registry 'GLEIF alignment mapping' (parametric mapping):
docker run -e "args=d2rml:https://stirdata-semantic.ails.ece.ntua.gr/api/content/cy/mapping/616e882e-b1d5-4171-8713-457a6d659828 param:ID_PATTERN=[0-9]+ param:RAC_CODE=RA000181 param:ORGANIZATION_PREFIX=http://ee.data.stirdata.eu/resource/organization/ output:br=cyprus_gleif temp_folder:tmp max_file_size:100000" -v ${PWD}/playground:/data/ stirdata/d2rml-cli:latest
Both examples use the folders called "br" and "tmp" as output and temp folders respectively, which should exist under the folder called "playground", which will be mounted inside the docker container executing d2rml. Inside the output folder, the names of all .trig generated files (if more than one, depending on the max_file_size) will begin with the prefix "greece" in the first example, and with the prefix "cyprus_gleif" in the second example
In this section, we describe how to deploy the data harmonisation workflows in the deployed tools.
The harmonisation of the Czech business registry dataset is documented and the pipelines published in a separate repository. The individual pipelines can be directly imported using their raw GitHub URLs and LP-ETL's import pipeline from URL functionality:
- Source data to Czech ontology
- Czech ontology to STIRData specification
- SKOSification of CZ-NACE codes
- Mapping of Czech companies to NACE codes
- Load data to Virtuoso - this needs to be adjusted to where the target triplestore instance is.
The pipelines can be configured to run periodically using, e.g., cron
, curl
and the LinkedPipes ETL API.
The harmonization of the Greek business registry dataset is done by the following D2RML mappings:
The harmonization of the Belgian business registry dataset is done by the following D2RML mappings:
- Business registry data mapping (credentials for accessing https://kbopub.economie.fgov.be/kbo-open-data are supplied by the parameters KBO_USERNAME, KBO_PASSWORD)
- Agencies data
- Dataset metadata
- GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{4}\.[0-9]{3}\.[0-9]{3}, RAC_CODE = RA000025, ORGANIZATION_PREFIX = http://be.data.stirdata.eu/resource/organization/.
The harmonization of the Cypriot business registry dataset is done by the following D2RML mappings:
- Business registry data mapping
- Agencies data
- Dataset metadata
- GLEIF alignment mapping (parameter values: ID_PATTERN = [OBCNP][0-9]+, RAC_CODE = RA000161, ORGANIZATION_PREFIX = http://cy.data.stirdata.eu/resource/organization/.
The harmonization of the Estonian business registry dataset is done by the following D2RML mappings:
- Business registry data mapping
- Agencies data
- Dataset metadata
- GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]+, RAC_CODE = RA000181, ORGANIZATION_PREFIX = http://ee.data.stirdata.eu/resource/organization/.
The harmonization of the Finnish business registry dataset is done by the following D2RML mappings:
- Business registry data mapping
- Agencies data
- Dataset metadata
- GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{7}\-[0-9], RAC_CODE = RA000188, ORGANIZATION_PREFIX = http://fi.data.stirdata.eu/resource/organization/.
The harmonization of the French business registry dataset is done by the following D2RML mappings:
- Business registry data mapping (Legal entities)
- Business registry data mapping (Establishments)
- Agencies data
- Dataset metadata
- GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{9}, RAC_CODE = RA000189, ORGANIZATION_PREFIX = http://fr.data.stirdata.eu/resource/organization/.
The harmonization of the Latvian business registry dataset is done by the following D2RML mappings:
- Business registry data mapping
- Agencies data
- Dataset metadata
- GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{11}, RAC_CODE = RA000423, ORGANIZATION_PREFIX = http://lv.data.stirdata.eu/resource/organization/.
The harmonization of the Moldovian business registry dataset is done by the following D2RML mappings:
- Business registry data mapping
- Agencies data
- Dataset metadata
- GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]+, RAC_CODE = RA000451, ORGANIZATION_PREFIX = http://md.data.stirdata.eu/resource/organization/.
The harmonization of the Dutch business registry dataset is done by the following D2RML mappings:
The harmonization of the Norwegian business registry dataset is done by the following D2RML mappings:
- Business registry data mapping (Main units)
- Business registry data mapping (Subunits)
- Agencies data
- Dataset metadata
- GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{9}, RAC_CODE = RA000472, ORGANIZATION_PREFIX = http://no.data.stirdata.eu/resource/organization/.
The harmonization of the Romanian business registry dataset is done by the following D2RML mappings:
- Business registry data mapping
- Agencies data
- Dataset metadata
- GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{8}, RAC_CODE = RA000497, ORGANIZATION_PREFIX = http://ro.data.stirdata.eu/resource/organization/.
The harmonization of the United Kindgom business registry dataset is done by the following D2RML mappings:
- Business registry data mapping
- Agencies data
- Dataset metadata
- GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{8}, RAC_CODE = RA000585, ORGANIZATION_PREFIX = http://uk.data.stirdata.eu/resource/organization/.
A validation service profiling the published datasets with regards to the STIRData specification is deployed using LinkedPipes ETL and periodically produces a validation report in HTML, RDF Turtle and CSV.
The validation service pipeline is also directly deployable in a LinkedPipes ETL instance.
Again, the validation pipeline can be configured to run periodically using, e.g., cron
, curl
and the LinkedPipes ETL API.