You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Develop and deploy ETL (Extract, Transform, Load) Python script extracting datafiles from Zenodo and storing data in Neo4J database
Developmental process/requirements:
Data Extraction:
Implement incremental extraction to reduce extraction time and resource utilization, use ‘modified’ data field for this; run batch process extracting datafiles modified on previous day.
Select data related to ”soil”; use “soil” as keyword/parameter
Ensure extraction of all required data fields for each data file ("created", "modified", "doi", "doi_url", "title", "publication_date", "description", "access_right", "creators", "keywords", "language", "license").
Extract data in Dublincore data format; process should probably run in 3 steps: extract all datafiles from yesterday and about soil; get list of ids; send list of ids to API to retrieve them in Dublincore format: https://zenodo.org/records/id/export/dublincore
Data Transformation:
none at the moment
Data Loading:
load data into Neo4j graph database:
use for the time being following nodes and relationships (later on use selected soil ontology/data standard):
SoilwiseDataSource - PUBLISHED_ON - PublicationDate,
SoilwiseDataSource - TYPE_OF_DATASOURCE - DataSourceType,
SoilwiseDataSource - HAS_REPO_SOURCE - Source,
SoilwiseDataSource - HAS_TITLE - Title,
SoilwiseDataSource - IS_CREATED_BY - Creator,
SoilwiseDataSource - HAS_SUBJECT - Subject,
SoilwiseDataSource - HAS_TITLE_KEYWORDS - TitleKeyword,
SoilwiseDataSource - IS_LOCATED_AT - SpatialEntity,
Clean up
remove duplicate datasets in database due to modification of datafile in Zenodo repository; if two datafiles with the same identifier occur remove datafile with ’oldest’ (i.e. most in the past) modified data field
Automate script
Python script
run ETL script in Docker image every morning at 02:00, automatically extract form Zenodo and load (append) data to Neo4j database
Definition of Done:
Code changes for the ETL process are reviewed and approved by at least one other team member.
Automated incremental job run successfully for 5 continuous days.
Datafiles from the previous day can be queried and identified in Neo4J database UI.
No datafiles with the same identifier occur.
The absolute number of datafiles with a modified data field more recent than the created data field never decreases in consecutive days.
Or
Acceptance criteria (more from user perspective):
I want to find data files uploaded or modified yesterday from Zonodo
I don’t want to find duplicates, but only the most recent version of a datafile
I want to obtain information about "created", "modified", "doi", "doi_url", "title", "publication_date", "description", "access_right", "creators", "keywords", "language", "license" for each data file
The text was updated successfully, but these errors were encountered:
Description:
Develop and deploy ETL (Extract, Transform, Load) Python script extracting datafiles from Zenodo and storing data in Neo4J database
Developmental process/requirements:
SoilwiseDataSource - PUBLISHED_ON - PublicationDate,
SoilwiseDataSource - TYPE_OF_DATASOURCE - DataSourceType,
SoilwiseDataSource - HAS_REPO_SOURCE - Source,
SoilwiseDataSource - HAS_TITLE - Title,
SoilwiseDataSource - IS_CREATED_BY - Creator,
SoilwiseDataSource - HAS_SUBJECT - Subject,
SoilwiseDataSource - HAS_TITLE_KEYWORDS - TitleKeyword,
SoilwiseDataSource - IS_LOCATED_AT - SpatialEntity,
Definition of Done:
Or
Acceptance criteria (more from user perspective):
The text was updated successfully, but these errors were encountered: