Supporting material for the paper "Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness" (medRxiv). Figures and supplementary data for that paper are in the paper/ directory.
This is open source, but we are not intending to support code for use by outside groups. To use outputs of this model, we recommend ingesting the tables strains.tsv and mutations.tsv.
Clone this repository:
git clone [email protected]:broadinstitute/pyro-cov
cd pyro-cov
Install this python package:
pip install -e .
Work with GISAID to get a data agreement. Define the following environment variables:
GISAID_USERNAME
GISAID_PASSWORD
GISAID_FEED
For example my username is fritz
and my gisaid feed is broad2
.
This downloads data from GISAID and clones repos for other data sources.
make update
This takes under an hour.
Results are cached in the results/
directory, so re-running on newly pulled data should be able to re-use alignment and PANGOlineage classification work.
make preprocess
make analyze
Plots and tables are generated by running various notebooks:
If you use this software or predictions in the paper directory please consider citing:
@article {Obermeyer2021.09.07.21263228,
author = {Obermeyer, Fritz and
Schaffner, Stephen F. and
Jankowiak, Martin and
Barkas, Nikolaos and
Pyle, Jesse D. and
Park, Daniel J. and
MacInnis, Bronwyn L. and
Luban, Jeremy and
Sabeti, Pardis C. and
Lemieux, Jacob E.},
title = {Analysis of 2.1 million SARS-CoV-2 genomes identifies mutations associated with transmissibility},
elocation-id = {2021.09.07.21263228},
year = {2021},
doi = {10.1101/2021.09.07.21263228},
publisher = {Cold Spring Harbor Laboratory Press},
URL = {https://www.medrxiv.org/content/early/2021/09/13/2021.09.07.21263228},
eprint = {https://www.medrxiv.org/content/early/2021/09/13/2021.09.07.21263228.full.pdf},
journal = {medRxiv}
}