Skip to content

Latest commit

 

History

History

mongo

This folder provides tools to import three different datasets into MongoDB, that are involved in the the generation the Covid-On-The-Web RDF dataset.

These are the MongoDB collections created:

  • Dataset CORD-19:
    • cord19_metadata: metadata_fixed.csv (created from metadata.csv by metadata_fix.sh)
    • cord19_json: per-article JSON files
    • cord19_json_light: lightened and filterd version of cord19_json
  • Named entities:
    • entityfishing: full files generated by Entity-fishing
    • entityfishing_abstract: copy of collection entityfishing where only titles and abstracts are kept (bodies are removed)
    • entityfishing_*: full files generated by Entity-fishing split into collectinos of max 30000 documents
    • entityfishing_*_body: copy of collections entityfishing_* where only bodies are kept
    • spotlight: full files generated by DBpedia Spotlight
    • spotlight_abstract: copy of collection spotlight where only titles and abstracts are kept (bodies are removed)
    • ncbo: full files generated by NCBO Bioportal Annotator where only titles and abstracts are kept (bodies are removed)
    • ncbo_*: full files generated by NCBO Bioportal Annotator split into collectinos of max 5000 documents
  • Arguments extracted by the ACTA platform.
    • acta: full set of documents generated by ACTA
    • acta_attack: only support relations
    • acta_support: only attack relations
    • acta_components_*: PICO elements per type

Script import-cord19.sh is the entry point. Uncomment the lines at the end of the script as needed to import datasets. It loads the datasets and creates derived collections using the *.js files.

Note the entity-fishing files are loaded twice: once in a single collection used to generate the NEs extracted from the titles and abstracts. Then, once n multiple collections used to generate the NEs extracted from the bodies.

Script import-tools.sh defines functions to load groups of JSON files into MongoDB. These are necessary as CORD-19 (as well as the Entity-fishing, Spotlight and Bioportal Annotator datasets) contains 50,000+ JSON files, possibly large, that cannot be loaded at once into MongoDB. (max 16MB at a time). Hence the need to split the files into multiple smaller groups.