Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test issue - Zenodo ETL Process #2

Open
8 tasks
BerkvensNick opened this issue May 14, 2024 · 0 comments
Open
8 tasks

Test issue - Zenodo ETL Process #2

BerkvensNick opened this issue May 14, 2024 · 0 comments

Comments

@BerkvensNick
Copy link
Contributor

BerkvensNick commented May 14, 2024

Description:

Develop and deploy ETL (Extract, Transform, Load) Python script extracting datafiles from Zenodo and storing data in Neo4J database

Developmental process/requirements:

  1. Data Extraction:
  • Implement incremental extraction to reduce extraction time and resource utilization, use ‘modified’ data field for this; run batch process extracting datafiles modified on previous day.
  • Use REST-API from Zenodo: https://zenodo.org/api/records
  • Select data related to ”soil”; use “soil” as keyword/parameter
  • Ensure extraction of all required data fields for each data file ("created", "modified", "doi", "doi_url", "title", "publication_date", "description", "access_right", "creators", "keywords", "language", "license").
  • Extract data in Dublincore data format; process should probably run in 3 steps: extract all datafiles from yesterday and about soil; get list of ids; send list of ids to API to retrieve them in Dublincore format: https://zenodo.org/records/id/export/dublincore
  1. Data Transformation:
  • none at the moment
  1. Data Loading:
  • load data into Neo4j graph database:
  • use for the time being following nodes and relationships (later on use selected soil ontology/data standard):
    SoilwiseDataSource - PUBLISHED_ON - PublicationDate,
    SoilwiseDataSource - TYPE_OF_DATASOURCE - DataSourceType,
    SoilwiseDataSource - HAS_REPO_SOURCE - Source,
    SoilwiseDataSource - HAS_TITLE - Title,
    SoilwiseDataSource - IS_CREATED_BY - Creator,
    SoilwiseDataSource - HAS_SUBJECT - Subject,
    SoilwiseDataSource - HAS_TITLE_KEYWORDS - TitleKeyword,
    SoilwiseDataSource - IS_LOCATED_AT - SpatialEntity,
  1. Clean up
  • remove duplicate datasets in database due to modification of datafile in Zenodo repository; if two datafiles with the same identifier occur remove datafile with ’oldest’ (i.e. most in the past) modified data field
  1. Automate script
  • Python script
  • run ETL script in Docker image every morning at 02:00, automatically extract form Zenodo and load (append) data to Neo4j database

Definition of Done:

  • Code changes for the ETL process are reviewed and approved by at least one other team member.
  • Automated incremental job run successfully for 5 continuous days.
  • Datafiles from the previous day can be queried and identified in Neo4J database UI.
  • No datafiles with the same identifier occur.
  • The absolute number of datafiles with a modified data field more recent than the created data field never decreases in consecutive days.

Or

Acceptance criteria (more from user perspective):

  • I want to find data files uploaded or modified yesterday from Zonodo
  • I don’t want to find duplicates, but only the most recent version of a datafile
  • I want to obtain information about "created", "modified", "doi", "doi_url", "title", "publication_date", "description", "access_right", "creators", "keywords", "language", "license" for each data file
@BerkvensNick BerkvensNick changed the title Zenodo ETL Process Test issue - Zenodo ETL Process May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant