biolitmine

Public Repository for BioLitMine Source Code

BioLitMine is a tool that leverages PubMed literature to find relationships between genes, MeSH terms, pathways, and authors. This repository contains code used to download and extract information from various sources to find these relationships.

Download Data

To start, all literature on PubMed will be downloaded via the NLM ftp site as zipped XML files. NLM provides two types of PubMed data:

Type	Data	BioLitMine Download Schedule
`baseline`	baseline set of MEDLINE/PubMed citation records	End of the year
`updatefiles`	daily file updates that include new, revised, and deleted citations	End of the month

MeSH Descriptor information and MeSH tree numbers are also downloaded on an annual basis.

All data are downloaded into a directory called data_input, and the code to download is located in download.

Extract XML Information

XML files are then extracted into tabular format using the R scripts in rscipts.

There are two configuration files where the default directories can be inspected and changed: rscript/config_baseline.R and rscript/config_update.R.

The extraction creates 4 output files that contain different information:

Output Files	Information
`brief_*.txt`	Literature overview, including PMID, journal title, publication/MEDLINE date, and publication type
`detail_*.txt`	Literature details, including PMID, journal title, publication/MEDLINE date, and abstract
`mesh_*.txt`	Literature MeSH details, including PMID and MeSH Descriptor name
`tempauthor_*.txt`	Author details, including first name, last name, affiliations, address, email, and associated PMIDs

Further Data Preparation

Finally, the extracted information is further processed by scripts found in extract. Steps taken include filtering out publications not related to the study of model organisms and humans, merging files together, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
download		download
extract		extract
rscripts		rscripts
LICENSE.md		LICENSE.md
README.md		README.md
run_xml_extractions_1_no_slurm.sh		run_xml_extractions_1_no_slurm.sh
start_xml_1_batch_no_slurm.sh		start_xml_1_batch_no_slurm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

biolitmine

Download Data

Extract XML Information

Further Data Preparation

About

Releases

Packages

Contributors 2

Languages

License

DRSC-FG/biolitmine

Folders and files

Latest commit

History

Repository files navigation

biolitmine

Download Data

Extract XML Information

Further Data Preparation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages