Test issue - Zenodo ETL Process #2

BerkvensNick · 2024-05-14T09:43:11Z

Develop and deploy ETL (Extract, Transform, Load) Python script extracting datafiles from Zenodo and storing data in Neo4J database

Implement incremental extraction to reduce extraction time and resource utilization, use ‘modified’ data field for this; run batch process extracting datafiles modified on previous day.
Use REST-API from Zenodo: https://zenodo.org/api/records
Select data related to ”soil”; use “soil” as keyword/parameter
Ensure extraction of all required data fields for each data file ("created", "modified", "doi", "doi_url", "title", "publication_date", "description", "access_right", "creators", "keywords", "language", "license").
Extract data in Dublincore data format; process should probably run in 3 steps: extract all datafiles from yesterday and about soil; get list of ids; send list of ids to API to retrieve them in Dublincore format: https://zenodo.org/records/id/export/dublincore

remove duplicate datasets in database due to modification of datafile in Zenodo repository; if two datafiles with the same identifier occur remove datafile with ’oldest’ (i.e. most in the past) modified data field

Python script
run ETL script in Docker image every morning at 02:00, automatically extract form Zenodo and load (append) data to Neo4j database

Code changes for the ETL process are reviewed and approved by at least one other team member.
Automated incremental job run successfully for 5 continuous days.
Datafiles from the previous day can be queried and identified in Neo4J database UI.
No datafiles with the same identifier occur.
The absolute number of datafiles with a modified data field more recent than the created data field never decreases in consecutive days.

Or

I want to find data files uploaded or modified yesterday from Zonodo
I don’t want to find duplicates, but only the most recent version of a datafile
I want to obtain information about "created", "modified", "doi", "doi_url", "title", "publication_date", "description", "access_right", "creators", "keywords", "language", "license" for each data file

BerkvensNick changed the title ~~Zenodo ETL Process~~ Test issue - Zenodo ETL Process May 14, 2024

Provide feedback