This folder provides tools to import three different datasets into MongoDB, that are involved in the the generation the Covid-On-The-Web RDF dataset.
These are the MongoDB collections created:
- Dataset CORD-19:
cord19_metadata
: metadata_fixed.csv (created from metadata.csv by metadata_fix.sh)cord19_json
: per-article JSON filescord19_json_light
: lightened and filterd version ofcord19_json
- Named entities:
entityfishing
: full files generated by Entity-fishingentityfishing_abstract
: copy of collectionentityfishing
where only titles and abstracts are kept (bodies are removed)entityfishing_*
: full files generated by Entity-fishing split into collectinos of max 30000 documentsentityfishing_*_body
: copy of collectionsentityfishing_*
where only bodies are keptspotlight
: full files generated by DBpedia Spotlightspotlight_abstract
: copy of collectionspotlight
where only titles and abstracts are kept (bodies are removed)ncbo
: full files generated by NCBO Bioportal Annotator where only titles and abstracts are kept (bodies are removed)ncbo_*
: full files generated by NCBO Bioportal Annotator split into collectinos of max 5000 documents
- Arguments extracted by the ACTA platform.
acta
: full set of documents generated by ACTAacta_attack
: only support relationsacta_support
: only attack relationsacta_components_*
: PICO elements per type
Script import-cord19.sh
is the entry point.
Uncomment the lines at the end of the script as needed to import datasets.
It loads the datasets and creates derived collections using the *.js
files.
Note the entity-fishing files are loaded twice: once in a single collection used to generate the NEs extracted from the titles and abstracts. Then, once n multiple collections used to generate the NEs extracted from the bodies.
Script import-tools.sh
defines functions to load groups of JSON files into MongoDB.
These are necessary as CORD-19 (as well as the Entity-fishing, Spotlight and Bioportal Annotator datasets) contains 50,000+ JSON files, possibly large, that cannot be loaded at once into MongoDB. (max 16MB at a time).
Hence the need to split the files into multiple smaller groups.