CovidOnTheWeb/src/mongo at master · Wimmics/CovidOnTheWeb

History

Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
README.md		README.md
filter-acta.js		filter-acta.js
import-cord19.sh		import-cord19.sh
import-tools.sh		import-tools.sh
lighten_cord19json.js		lighten_cord19json.js
lighten_entityfishing_abstract.js		lighten_entityfishing_abstract.js
lighten_entityfishing_body.js		lighten_entityfishing_body.js
lighten_spotlight.js		lighten_spotlight.js
metadata_fix.sh		metadata_fix.sh

README.md

This folder provides tools to import three different datasets into MongoDB, that are involved in the the generation the Covid-On-The-Web RDF dataset.

These are the MongoDB collections created:

Dataset CORD-19:
- cord19_metadata: metadata_fixed.csv (created from metadata.csv by metadata_fix.sh)
- cord19_json: per-article JSON files
- cord19_json_light: lightened and filterd version of cord19_json
Named entities:
- entityfishing: full files generated by Entity-fishing
- entityfishing_abstract: copy of collection entityfishing where only titles and abstracts are kept (bodies are removed)
- entityfishing_*: full files generated by Entity-fishing split into collectinos of max 30000 documents
- entityfishing_*_body: copy of collections entityfishing_* where only bodies are kept
- spotlight: full files generated by DBpedia Spotlight
- spotlight_abstract: copy of collection spotlight where only titles and abstracts are kept (bodies are removed)
- ncbo: full files generated by NCBO Bioportal Annotator where only titles and abstracts are kept (bodies are removed)
- ncbo_*: full files generated by NCBO Bioportal Annotator split into collectinos of max 5000 documents
Arguments extracted by the ACTA platform.
- acta: full set of documents generated by ACTA
- acta_attack: only support relations
- acta_support: only attack relations
- acta_components_*: PICO elements per type

Script import-cord19.sh is the entry point. Uncomment the lines at the end of the script as needed to import datasets. It loads the datasets and creates derived collections using the *.js files.

Note the entity-fishing files are loaded twice: once in a single collection used to generate the NEs extracted from the titles and abstracts. Then, once n multiple collections used to generate the NEs extracted from the bodies.

Script import-tools.sh defines functions to load groups of JSON files into MongoDB. These are necessary as CORD-19 (as well as the Entity-fishing, Spotlight and Bioportal Annotator datasets) contains 50,000+ JSON files, possibly large, that cannot be loaded at once into MongoDB. (max 16MB at a time). Hence the need to split the files into multiple smaller groups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mongo

mongo

README.md

Files

mongo

Directory actions

More options

Directory actions

More options

Latest commit

History

mongo

Folders and files

parent directory

README.md