Name		Name	Last commit message	Last commit date
parent directory ..
test		test
README.md		README.md
fetch.py		fetch.py
parse.py		parse.py
requirements.txt		requirements.txt

README.md

Harvesting metadata from ESDAC

ESDAC is a drupal website with dedicated sections for datasets, maps and documents. This folder contains 2 scripts which together bring the esdac records into SWR.

fetch.py

Fetches the html pages into the postgres database.

For datasets, first 5 list pages are collected, from each listing, the relevant page links are scraped. Then each link is fetched.
For maps (EUDASM) and documents, there are no child pages, so the metadata is directly scraped from the list page

parse.py

The parse script queries the html from the database and parses the content to Dublin Core metadata, which is placed back into the database.

Resume parameter

The harvest process should resume where it left of last time. This mechanism is triggered by a environment parameter HV_RESUME (default:true). Url's requested in previous runs are fetched from database and each url to be requested is verified if it exists in this list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

esdac

esdac

README.md

Harvesting metadata from ESDAC

fetch.py

parse.py

Resume parameter

Files

esdac

Directory actions

More options

Directory actions

More options

Latest commit

History

esdac

Folders and files

parent directory

README.md

Harvesting metadata from ESDAC

fetch.py

parse.py

Resume parameter