Skip to content

This repository contains documentation of the data harmonisation transformations developed in STIRData

License

Notifications You must be signed in to change notification settings

STIRData/data-harmonisation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

STIRData data harmonisation toolset

The STIRData harmonisation toolset consists of two data transformation tools, LinkedPipes ETL and D2RML. The Czech business registry data harmonisation pipeline and the validation service are implemented using LinkedPipes ETL, while the other data harmonisation of the other business registries is handled by D2RML. As a triplestore to store the resulting data, OpenLink Virtuoso Open-Source is used and recommended, however, other triplestores can be used as well, given that the data harmonisation processes are adjusted. To deploy the data harmonisation workflow, the data harmonisation tools need to be deployed and populated with the prepared data harmonisation pipelines and mappings.

Deployment of the tools

In this section, we describe how to deploy the data harmonisation tools using Docker.

LinkedPipes ETL

LinkedPipes ETL Docker based deployment requires Docker Compose. Then, it can be deployed from the main branch like this:

curl https://raw.githubusercontent.com/linkedpipes/etl/main/docker-compose.yml | docker-compose -f - up

When deployed, LP-ETL runs on http://localhost:8080 and is ready to import pipelines.

For custom deployments, see the full deployment documentation. Once deployed, see the user documentation and tutorials. The documentation of the individual LP-ETL components is also available directly from the component's configuration dialog.

D2RML

We created a cli for executing d2rml transformations and dockerized the whole environment so that a user can easily execute d2rml tranformations, without having to set up complex environments.

You can find the docker image for d2rml here: d2rml-cli

To run it, use docker run command as following:

docker run -e "args=d2rml:test.ttl max_file_size:1000 output:test_out=axc" -v ${PWD}/DATA_DIRECTORY/:/data/ stirdata/d2rml-cli:latest

The args parameter contains all the arguments that we want to pass to the cli tool. These are the following parameters:

d2rml: the D2RML document to be executed (absolute path or url)
param: arguments provide parameter values in case the D2RML document is parametric
output: arguments specify where the generated triples will be saved (folder must exist)
max_file_size: the maximum number of triples in each generated output file. The default is 10000
temp_folder: a temp folder where downloaded files will be extracted if needed. It must exist.

The parameters should be given by providing the name and the value of each parameter.

Also, the user needs to add a volume to make the folder containing input and output files accessible inside the docker container. You do this by using the -v flag in docker, giving the full path of the folder and mapping it to the /data/ folder of the docker container:

-v ${PWD}/DATA_DIRECTORY/:/data/

Examples of STIRDATA executions:

Execution of Greek business registry 'Business registry data mapping' (non parametric mapping):

docker run -e "args=d2rml:https://stirdata-semantic.ails.ece.ntua.gr/api/content/el/mapping/bdd4413c-45ec-47b8-b25a-56880a0a0b6e output:br=greece max_file_size:100000 temp_folder:tmp" -v ${PWD}/playground:/data/ stirdata/d2rml-cli:latest

Execution of Cypriot business registry 'GLEIF alignment mapping' (parametric mapping):

docker run -e "args=d2rml:https://stirdata-semantic.ails.ece.ntua.gr/api/content/cy/mapping/616e882e-b1d5-4171-8713-457a6d659828 param:ID_PATTERN=[0-9]+ param:RAC_CODE=RA000181 param:ORGANIZATION_PREFIX=http://ee.data.stirdata.eu/resource/organization/ output:br=cyprus_gleif temp_folder:tmp max_file_size:100000" -v ${PWD}/playground:/data/ stirdata/d2rml-cli:latest

Both examples use the folders called "br" and "tmp" as output and temp folders respectively, which should exist under the folder called "playground", which will be mounted inside the docker container executing d2rml. Inside the output folder, the names of all .trig generated files (if more than one, depending on the max_file_size) will begin with the prefix "greece" in the first example, and with the prefix "cyprus_gleif" in the second example

Deployment of data harmonisation pipelines and mappings

In this section, we describe how to deploy the data harmonisation workflows in the deployed tools.

Czech business registry

The harmonisation of the Czech business registry dataset is documented and the pipelines published in a separate repository. The individual pipelines can be directly imported using their raw GitHub URLs and LP-ETL's import pipeline from URL functionality:

  1. Source data to Czech ontology
  2. Czech ontology to STIRData specification
  3. SKOSification of CZ-NACE codes
  4. Mapping of Czech companies to NACE codes
  5. Load data to Virtuoso - this needs to be adjusted to where the target triplestore instance is.

The pipelines can be configured to run periodically using, e.g., cron, curl and the LinkedPipes ETL API.

Greek (Athens area) business registry

The harmonization of the Greek business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping
  2. Agencies data
  3. Dataset metadata

Belgian business registry

The harmonization of the Belgian business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping (credentials for accessing https://kbopub.economie.fgov.be/kbo-open-data are supplied by the parameters KBO_USERNAME, KBO_PASSWORD)
  2. Agencies data
  3. Dataset metadata
  4. GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{4}\.[0-9]{3}\.[0-9]{3}, RAC_CODE = RA000025, ORGANIZATION_PREFIX = http://be.data.stirdata.eu/resource/organization/.

Cypriot business registry

The harmonization of the Cypriot business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping
  2. Agencies data
  3. Dataset metadata
  4. GLEIF alignment mapping (parameter values: ID_PATTERN = [OBCNP][0-9]+, RAC_CODE = RA000161, ORGANIZATION_PREFIX = http://cy.data.stirdata.eu/resource/organization/.

Estonian business registry

The harmonization of the Estonian business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping
  2. Agencies data
  3. Dataset metadata
  4. GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]+, RAC_CODE = RA000181, ORGANIZATION_PREFIX = http://ee.data.stirdata.eu/resource/organization/.

Finnish business registry

The harmonization of the Finnish business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping
  2. Agencies data
  3. Dataset metadata
  4. GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{7}\-[0-9], RAC_CODE = RA000188, ORGANIZATION_PREFIX = http://fi.data.stirdata.eu/resource/organization/.

French business registry

The harmonization of the French business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping (Legal entities)
  2. Business registry data mapping (Establishments)
  3. Agencies data
  4. Dataset metadata
  5. GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{9}, RAC_CODE = RA000189, ORGANIZATION_PREFIX = http://fr.data.stirdata.eu/resource/organization/.

Latvian business registry

The harmonization of the Latvian business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping
  2. Agencies data
  3. Dataset metadata
  4. GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{11}, RAC_CODE = RA000423, ORGANIZATION_PREFIX = http://lv.data.stirdata.eu/resource/organization/.

Moldovian business registry

The harmonization of the Moldovian business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping
  2. Agencies data
  3. Dataset metadata
  4. GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]+, RAC_CODE = RA000451, ORGANIZATION_PREFIX = http://md.data.stirdata.eu/resource/organization/.

Dutch business registry

The harmonization of the Dutch business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping
  2. Dataset metadata

Norwegian business registry

The harmonization of the Norwegian business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping (Main units)
  2. Business registry data mapping (Subunits)
  3. Agencies data
  4. Dataset metadata
  5. GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{9}, RAC_CODE = RA000472, ORGANIZATION_PREFIX = http://no.data.stirdata.eu/resource/organization/.

Romanian business registry

The harmonization of the Romanian business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping
  2. Agencies data
  3. Dataset metadata
  4. GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{8}, RAC_CODE = RA000497, ORGANIZATION_PREFIX = http://ro.data.stirdata.eu/resource/organization/.

United Kingdom business registry

The harmonization of the United Kindgom business registry dataset is done by the following D2RML mappings:

  1. Business registry data mapping
  2. Agencies data
  3. Dataset metadata
  4. GLEIF alignment mapping (parameter values: ID_PATTERN = [0-9]{8}, RAC_CODE = RA000585, ORGANIZATION_PREFIX = http://uk.data.stirdata.eu/resource/organization/.

Validation service

A validation service profiling the published datasets with regards to the STIRData specification is deployed using LinkedPipes ETL and periodically produces a validation report in HTML, RDF Turtle and CSV.

The validation service pipeline is also directly deployable in a LinkedPipes ETL instance. Again, the validation pipeline can be configured to run periodically using, e.g., cron, curl and the LinkedPipes ETL API.

About

This repository contains documentation of the data harmonisation transformations developed in STIRData

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •