impuSARS: SARS-CoV-2 whole-genome Imputation

This repository contains a novel tool called impuSARS to impute whole genome sequences from partially sequenced SARS-CoV-2 samples. Additionally, impuSARS provides the lineage associated to the imputed sequence.

Table of content

Installation
Quick start
Output
Example
Panel creation
Data
Dependencies
Citation
Version history

Installation

impuSARS has two installation modes: (i) Docker image or (ii) Conda environment. In the first case, all you need is having Docker installed. For conda environment, you will need having conda and curl/wget pre-installed (See Dependencies for details). In both cases, impuSARS can be easily installed by running the following command:

git clone https://github.com/babelomics/impuSARS
cd impuSARS
./install_impuSARS --mode <docker/conda>

where --mode can take the values docker or conda depending on your preferences. Docker mode will automatically build the impuSARS docker image whereas Conda mode will create a impuSARS conda environment with all dependencies installed.

Quick start

Docker mode

An all-in script is available for Unix users. The script will initialize the docker container. Imputation can be run by executing the following command:

./impuSARS --infile /path/to/<file_fasta_or_vcf> \
           --outprefix <output_prefix> \
           [--reference <reference_fasta>]
           [--panel <panel_m3vcf>]
           [--threads <num_threads>]

where:

<file_fasta_or_vcf>: both FASTA format or VCF format input are accepted. For FASTA files, unknown regions in the genoma must be masked with Ns. For VCF files, genotypes from both known variants (1) and known reference (0) positions must be included.
<output_prefix>: Prefix given to output files. Output files are generated in the same directory as the input file.
<reference_fasta>: (Optional) FASTA file including reference sequence. If not included, SARS-CoV-2 reference will be considered (Default).
<panel_m3vcf>: (Optional) Trained reference panel in M3VCF format for imputation. By default, SARS-CoV-2 reference panel will be considered. Users can create their own reference panel by the impuSARS_reference command.
<num_threads>: (Optional) Number of CPUs used for imputation. Default: 1.

Experienced (or other operating systems) users can also build this image by themselves (once the repository has been cloned) and run impuSARS directly from Docker as:

# Build image (only once)
docker build -t impusars .

# Run docker
docker run -it --rm -v <input_path>:/data impusars impuSARS \
           --infile /data/<file_fasta_or_vcf>  \
           --outprefix <output_prefix> \
           [--reference <reference_fasta>]
           [--panel <panel_m3vcf>]
           [--threads <num_threads>]

where arguments are detailed above and, additionally:

<input_path>: Directory where input file is located and output files will be generated. This directory will be mounted in the docker instance.

Conda mode

Similarly to docker, users prefering conda installation can run imputation from the conda environment as:

conda activate impusars
impuSARS --infile /path/to/<file_fasta_or_vcf> \
         --outprefix <output_prefix> \
         [--reference <reference_fasta>]
         [--panel <panel_m3vcf>]
         [--threads <num_threads>] 
conda deactivate

where arguments are equivalent to those in Docker mode.

Output

After imputation, impuSARS returns two files:

<output_prefix>.impuSARS.sequence.fa: FASTA file incluiding the whole-genome consensus sequence obtained from imputation.
<output_prefix>.impuSARS.lineage.csv: Lineage assigned with Pangolin to the previously imputed sequence.

Example

An easy example is provided for testing purposes. To test this example you can just run (after Installation):

# Docker mode
./impuSARS --infile example/sequence.fa \
           --outprefix imputation 
# Conda mode
conda activate impusars
impuSARS --infile example/sequence.fa \
         --outprefix imputation
conda deactivate

The example SARS-CoV-2 sequence has been internally sequenced and is available under the ENA Accession PRJEB43882 (see Data for details). This sequence includes a high rate of missing regions (Ns). Therefore, impuSARS will return a completely imputed genome sequence (FASTA file) and its corresponding assigned lineage (CSV file).

Panel creation

impuSARS tool now includes another all-in script for users to create their own reference panel for SARS-CoV-2 or any other viral sequences to impute. Reference panels can be created as follows:

# Docker mode
./impuSARS_reference --name <reference_prefix> \
                     --output_path <output_path> \
                     --input_fasta <input_fasta> \
                     --genome_fasta <reference_fasta> \
                     [--unknown_nn <unknown_nn>]
                     [--threads <num_threads>] 
# Conda mode
conda activate impusars
impuSARS_reference --name <reference_prefix> \
                     --output_path <output_path> \
                     --input_fasta <input_fasta> \
                     --genome_fasta <reference_fasta> \
                     [--unknown_nn <unknown_nn>]
                     [--threads <num_threads>]
conda deactivate

where:

<output_path>: Directory where the custom reference panel will be generated.
<reference_prefix>: prefix name given to the output reference panel without extension. Output will generate <reference_prefix>.m3vcf.gz reference panel file.
<input_fasta>: FASTA file including the alignment of all sequences used to train and generate the reference panel.
<genome_fasta>: FASTA file with the reference genome for the virus to impute. For example, SARS-CoV-2 reference.
<unknown_nn>: (Optional) Special character used in alignment for missing nucleotides, if any. Default: "n".
<num_threads>: (Optional) Number of CPUs used for imputation. Default: 1.

As before, experienced users can run the script directly using Docker as:

docker run -it --rm -v <input_path>:/data -v <ref_path>:/ref -v <output_path>:/output impusars \
       impuSARS_reference --name <reference_prefix> \
                          --output_path /output/ \
                          --input_fasta /data/<input_fasta_basename> \
                          --genome_fasta /ref/<genome_fasta_basename> \
                          [--unknown_nn ${unknn}] \
                          [--threads ${threads}]

where <input_path>, <ref_path> refer to directories where <input_fasta> and <genome_fasta> are respectively located whereas <input_fasta_basename> and <genome_fasta_basename> are the basenames of those files (without path).

Data

Nine internally sequenced SARS-CoV-2 samples are available at the following repository for validation purposes:

Raw sequencing data and consensus sequences:: ENA Dataset Accession ID PRJEB43882.
ImpuSARS imputed sequences and lineages:: Zenodo repository.

Also, impuSARS uses the hCoV-19/Wuhan/WIV04/2019 sequence as the official reference sequence, which is available here.

Finally, impuSARS was initially trained with a reference panel containing 239,301 sequences from GISAID (downloaded by January 7, 2021). Therefore, we would like to gratefully acknowledge all those laboratories and sequence contributors that made possible to create such a reference panel (acknowledgment). Current reference version (v2.1) contains 899,447 sequences (updated by June 17th, 2021).

Dependencies

impuSARS internally uses the following software:

BCFTools (v1.11)
Muscle (v3.8.31)
Minimac4 (v1.0.2)
Pangolin (v3.1.3)

Since impuSARS is encapsulated in a Docker image to facilitate distribution, only Docker installation is required. Docker can be downloaded for any operating system at Get Docker. In case conda installation is preferred, please note that two command packages are required:

Conda
curl or wget for downloading dependencies.

Citation

If you use impuSARS, please cite our publication:

Francisco M Ortuño, Carlos Loucera, Carlos S. Casimiro-Soriguer, Jose A. Lepe, Pedro Camacho Martinez, Laura Merino Diaz, Adolfo de Salazar, Natalia Chueca, Federico García, Javier Perez-Florido, Joaquin Dopazo. Highly accurate whole-genome imputation of SARS-CoV-2 from partial or low-quality sequences. Gigascience, 10(12):giab078, 2021. (https://academic.oup.com/gigascience/article/10/12/giab078/6448505)

Version history

V1.0 (2021-03-13): First release
V2.0 (2021-06-17): Update reference panel (v2.1) and pangolin (v3.1.3).
V3.0 (2021-10-07): Update reference panel (v3.0) and pangolin (v3.1.14). Indels imputation is now included by the new reference.
V3.1 (2021-11-10): impuSARS is now supported from a conda environment.
V4.0 (2022-06-17): Update reference panel (v4.0) and pangolin (v4.0.6)

For additional version details, please go to Releases.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
acknowledgement		acknowledgement
conda		conda
docker_files		docker_files
example		example
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
impuSARS		impuSARS
impuSARS_reference		impuSARS_reference
install_conda		install_conda
install_impuSARS		install_impuSARS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

impuSARS: SARS-CoV-2 whole-genome Imputation

Table of content

Installation

Quick start

Docker mode

Conda mode

Output

Example

Panel creation

Data

Dependencies

Citation

Version history

About

Releases 6

Packages

Languages

License

babelomics/impuSARS

Folders and files

Latest commit

History

Repository files navigation

impuSARS: SARS-CoV-2 whole-genome Imputation

Docker mode

Conda mode

About

Resources

License

Stars

Watchers

Forks

Languages