STIRData data harmonisation toolset

The STIRData harmonisation toolset consists of two data transformation tools, LinkedPipes ETL and D2RML. The Czech business registry data harmonisation pipeline and the validation service are implemented using LinkedPipes ETL, while the other data harmonisation of the other business registries is handled by D2RML. As a triplestore to store the resulting data, OpenLink Virtuoso Open-Source is used and recommended, however, other triplestores can be used as well, given that the data harmonisation processes are adjusted. To deploy the data harmonisation workflow, the data harmonisation tools need to be deployed and populated with the prepared data harmonisation pipelines and mappings.

Deployment of the tools

In this section, we describe how to deploy the data harmonisation tools using Docker.

LinkedPipes ETL

LinkedPipes ETL Docker based deployment requires Docker Compose. Then, it can be deployed from the main branch like this:

curl https://raw.githubusercontent.com/linkedpipes/etl/main/docker-compose.yml | docker-compose -f - up

When deployed, LP-ETL runs on http://localhost:8080 and is ready to import pipelines.

For custom deployments, see the full deployment documentation. Once deployed, see the user documentation and tutorials. The documentation of the individual LP-ETL components is also available directly from the component's configuration dialog.

D2RML

We created a cli for executing d2rml transformations and dockerized the whole environment so that a user can easily execute d2rml tranformations, without having to set up complex environments.

You can find the docker image for d2rml here: d2rml-cli

To run it, use docker run command as following:

docker run -e "args=d2rml:test.ttl max_file_size:1000 output:test_out=axc" -v ${PWD}/DATA_DIRECTORY/:/data/ stirdata/d2rml-cli:latest

The args parameter contains all the arguments that we want to pass to the cli tool. These are the following parameters:

d2rml: the D2RML document to be executed (absolute path or url)
param: arguments provide parameter values in case the D2RML document is parametric
output: arguments specify where the generated triples will be saved (folder must exist)
max_file_size: the maximum number of triples in each generated output file. The default is 10000
temp_folder: a temp folder where downloaded files will be extracted if needed. It must exist.

The parameters should be given by providing the name and the value of each parameter.

Also, the user needs to add a volume to make the folder containing input and output files accessible inside the docker container. You do this by using the -v flag in docker, giving the full path of the folder and mapping it to the /data/ folder of the docker container:

-v ${PWD}/DATA_DIRECTORY/:/data/

Examples of STIRDATA executions:

Execution of Greek business registry 'Business registry data mapping' (non parametric mapping):

docker run -e "args=d2rml:https://stirdata-semantic.ails.ece.ntua.gr/api/content/el/mapping/bdd4413c-45ec-47b8-b25a-56880a0a0b6e output:br=greece max_file_size:100000 temp_folder:tmp" -v ${PWD}/playground:/data/ stirdata/d2rml-cli:latest

Execution of Cypriot business registry 'GLEIF alignment mapping' (parametric mapping):

docker run -e "args=d2rml:https://stirdata-semantic.ails.ece.ntua.gr/api/content/cy/mapping/616e882e-b1d5-4171-8713-457a6d659828 param:ID_PATTERN=[0-9]+ param:RAC_CODE=RA000181 param:ORGANIZATION_PREFIX=http://ee.data.stirdata.eu/resource/organization/ output:br=cyprus_gleif temp_folder:tmp max_file_size:100000" -v ${PWD}/playground:/data/ stirdata/d2rml-cli:latest

Both examples use the folders called "br" and "tmp" as output and temp folders respectively, which should exist under the folder called "playground", which will be mounted inside the docker container executing d2rml. Inside the output folder, the names of all .trig generated files (if more than one, depending on the max_file_size) will begin with the prefix "greece" in the first example, and with the prefix "cyprus_gleif" in the second example

Deployment of data harmonisation pipelines and mappings

In this section, we describe how to deploy the data harmonisation workflows in the deployed tools.

Czech business registry

The harmonisation of the Czech business registry dataset is documented and the pipelines published in a separate repository. The individual pipelines can be directly imported using their raw GitHub URLs and LP-ETL's import pipeline from URL functionality:

Source data to Czech ontology
Czech ontology to STIRData specification
SKOSification of CZ-NACE codes
Mapping of Czech companies to NACE codes
Load data to Virtuoso - this needs to be adjusted to where the target triplestore instance is.

The pipelines can be configured to run periodically using, e.g., cron, curl and the LinkedPipes ETL API.

Greek (Athens area) business registry

The harmonization of the Greek business registry dataset is done by the following D2RML mappings:

Belgian business registry

The harmonization of the Belgian business registry dataset is done by the following D2RML mappings:

Business registry data mapping (credentials for accessing https://kbopub.economie.fgov.be/kbo-open-data are supplied by the parameters KBO_USERNAME, KBO_PASSWORD)
Agencies data
Dataset metadata
GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{4}\.[0-9]{3}\.[0-9]{3}, RAC_CODE = RA000025, ORGANIZATION_PREFIX = http://be.data.stirdata.eu/resource/organization/.

Cypriot business registry

The harmonization of the Cypriot business registry dataset is done by the following D2RML mappings:

Business registry data mapping
Agencies data
Dataset metadata
GLEIF alignment mapping (parameter values: ID_PATTERN = [OBCNP][0-9]+, RAC_CODE = RA000161, ORGANIZATION_PREFIX = http://cy.data.stirdata.eu/resource/organization/.

Estonian business registry

The harmonization of the Estonian business registry dataset is done by the following D2RML mappings:

Business registry data mapping
Agencies data
Dataset metadata
GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]+, RAC_CODE = RA000181, ORGANIZATION_PREFIX = http://ee.data.stirdata.eu/resource/organization/.

Finnish business registry

The harmonization of the Finnish business registry dataset is done by the following D2RML mappings:

Business registry data mapping
Agencies data
Dataset metadata
GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{7}\-[0-9], RAC_CODE = RA000188, ORGANIZATION_PREFIX = http://fi.data.stirdata.eu/resource/organization/.

French business registry

The harmonization of the French business registry dataset is done by the following D2RML mappings:

Business registry data mapping (Legal entities)
Business registry data mapping (Establishments)
Agencies data
Dataset metadata
GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{9}, RAC_CODE = RA000189, ORGANIZATION_PREFIX = http://fr.data.stirdata.eu/resource/organization/.

Latvian business registry

The harmonization of the Latvian business registry dataset is done by the following D2RML mappings:

Business registry data mapping
Agencies data
Dataset metadata
GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{11}, RAC_CODE = RA000423, ORGANIZATION_PREFIX = http://lv.data.stirdata.eu/resource/organization/.

Moldovian business registry

The harmonization of the Moldovian business registry dataset is done by the following D2RML mappings:

Business registry data mapping
Agencies data
Dataset metadata
GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]+, RAC_CODE = RA000451, ORGANIZATION_PREFIX = http://md.data.stirdata.eu/resource/organization/.

Dutch business registry

The harmonization of the Dutch business registry dataset is done by the following D2RML mappings:

Norwegian business registry

The harmonization of the Norwegian business registry dataset is done by the following D2RML mappings:

Business registry data mapping (Main units)
Business registry data mapping (Subunits)
Agencies data
Dataset metadata
GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{9}, RAC_CODE = RA000472, ORGANIZATION_PREFIX = http://no.data.stirdata.eu/resource/organization/.

Romanian business registry

The harmonization of the Romanian business registry dataset is done by the following D2RML mappings:

Business registry data mapping
Agencies data
Dataset metadata
GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{8}, RAC_CODE = RA000497, ORGANIZATION_PREFIX = http://ro.data.stirdata.eu/resource/organization/.

United Kingdom business registry

The harmonization of the United Kindgom business registry dataset is done by the following D2RML mappings:

Business registry data mapping
Agencies data
Dataset metadata
GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{8}, RAC_CODE = RA000585, ORGANIZATION_PREFIX = http://uk.data.stirdata.eu/resource/organization/.

Validation service

A validation service profiling the published datasets with regards to the STIRData specification is deployed using LinkedPipes ETL and periodically produces a validation report in HTML, RDF Turtle and CSV.

The validation service pipeline is also directly deployable in a LinkedPipes ETL instance. Again, the validation pipeline can be configured to run periodically using, e.g., cron, curl and the LinkedPipes ETL API.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE		LICENSE
README.md		README.md
validation.jsonld		validation.jsonld

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STIRData data harmonisation toolset

Deployment of the tools

LinkedPipes ETL

D2RML

Deployment of data harmonisation pipelines and mappings

Czech business registry

Greek (Athens area) business registry

Belgian business registry

Cypriot business registry

Estonian business registry

Finnish business registry

French business registry

Latvian business registry

Moldovian business registry

Dutch business registry

Norwegian business registry

Romanian business registry

United Kingdom business registry

Validation service

About

Contributors 3

License

STIRData/data-harmonisation

Folders and files

Latest commit

History

Repository files navigation

STIRData data harmonisation toolset

Deployment of the tools

LinkedPipes ETL

D2RML

Deployment of data harmonisation pipelines and mappings

Czech business registry

Greek (Athens area) business registry

Belgian business registry

Cypriot business registry

Estonian business registry

Finnish business registry

French business registry

Latvian business registry

Moldovian business registry

Dutch business registry

Norwegian business registry

Romanian business registry

United Kingdom business registry

Validation service

About

Resources

License

Stars

Watchers

Forks

Contributors 3