The (metagenomic) virus analysis toolkit
</style>News: in the current version 1.2 of Virana we fixed several smaller bugs such as a recent incompatibility with FTPutil
3.0 and introduced a new algorithmical backend (clustered interval trees) for vhom
using bx-python
. As a result, vhom
is now two to three orders of magnitude faster on WGS data. Thanks for everyone who reported bugs and please send any more such findings to [email protected].
Virana, the virus analysis toolkit, is a Python-based software toolkit for analyzing metatranscriptomic (and, to a degree, metagenomic) sequence data in a context of clinical metagenomics in order to:
- identify microbial nucleotide sequences in transcriptomic and genomic short read data with very high computational performance of 50M reads/hour/core (for RNA-Seq data and against all viral references, a little slower if mapping against all human, viral, and bacterial references and using WGS-data.).
- identify microbial transcripts diverged from known references at high sensitivity even at 10% nucleotide divergence or more (yes, that is a lot for short read mappers)
- assemble identified microbial genomes and transcripts based on known references and put them in a context of homologous human factors (splice variants, genomic loci)
- identify nucleotide sequences with very low abundances by pooling reads across multiple samples and using reference-based assembly of syntenic regions
- produce concise visualizations of all homologous contexts as well as an array of other output options that facilitate in-depth analysis of putative microbial signatures.
Despite its name, Virana does work not only for viruses (although it's there where it can use its advantages to the fullest) but also for other microbial organisms such as bacteria, fungi, and protists. Virana can also applied to search for viral nucleotides in model organisms other than human or in sequence data of organisms for which no sequence assembly exists; however, Virana is best used for organisms with a good reference as well as with known splice site annotations and known cDNA sequences; only in this case many of the unique features (i.e., the analysis of multimapping reads) can be used to full extent.
A recent analysis of human tumor transcriptomes displays the ability of Virana to detect diverged viral nucleotide sequences and human-viral homologs with high sensitivity.
Virana has been build from the ground up to detect nucleotide signatures of oncogenic viruses in deep sequencing data. Due to several reasons (see manuscript for some virological and oncological details), oncogenic viruses are especially difficuly to find using the present sequencing technology. What virana allows you to do is to process your cancer genome or transcriptome data as part of your normal workflow in order to identify any known viruses (or, more gnerally, microbes). Since virana can produce standard-compliant bam
as part of the analysis, the outputs generated by virana
are not lost but can used in differential expression or SNV/SV pipelines. In contrast to competing approaches, Virana does not rely on discarding reads matching to the human genome; instead, we map all reads against a combined reference index of human and micoribial genomes. Reads matching to only microbes or to microbes as well as human factors are pooled, delineated by taxnomic family, assembled based on known genomes and transcripts, and put into their homologous context. The latter is then visualized for the user to allow informed decisions about the presence of microbial signatures in the short read data. By following this approach, we gain three major advantages:
- No filtering against the human reference, which is faulty anyway and requires specifying homology cutoffs which are arbitrary and may result in loosing human-viral homologs, such as oncogenes of human origin or fusion transcripts originating from viral integration
- Faster processing since no multi-tier analysis and subtraction of human-homologus reads has to be undertaken. Instead, we employ only one primary mapping.
- Using only one mapping has as additional advantages that human-microbial chimeric transcripts (as resulting from viral integration) are directly visible from the short read mapping and alignments indicating such structural variations can be exported as part of the virana analysis at no additional computational costs.
Analysis of short read data with virana is simple and only requires four command line calls (see tutorial for detailled explanations):
# Download human and viral reference genomes
$ vref fasta -r Homo_sapiens -r Viruses -o my_genome_db.fa
# Make a short read index
$ vmap rnaindex -r my_genome_db.fa -i my_rna_index_dir
# Map short read data and find putative microbial reads
$ vmap rnamap --index my_rna_index_dir --zipped \
--reads=viral_reads_1.fq.gz --reads=viral_reads_2.fq.gz \
--virana_hits=hits.bz2
# Put all reads into their homologous context and assemble microbial genomes
$ vhom regions --references=my_genome_db.fa --virana_hits=hits.bz2 \
--output_dir=my_families
That's it. The result is a directory structure containing detailled information on interesting sequence regions of each of the taxonomic family of the microbial classes (viruses, bacteria, protists, fungi) under investigation.
Virana is currently available in two distributions: first, as a set of command-line based python modules that wrap around binaries. Second, as a tool for the web-based workflow environment Galaxy. The first version allows for greater adabtablity, more parameter choices, integration in custom pipelines, and use of the Python modules as libraries (it's all proper object oriented code). The Galaxy tool distribution, on the other hand, automatically installs all dependencies (courtesy of the Galaxy toolshed), features a graphical user-interface, and provides more options for html-based, dynamic visualization of results.
The command-line version of Virana can be installed by first cloning the GIT repository by issuing $ git clone https://github.com/schelhorn/virana.git
. If you have git installed, this will produce a new directory virana
.
$ cd virana
$ python setup.py install
then installs the python distribution into the usual location and makes four command line executables (really just wrappers around python scripts) called vsim
, vref
, vmap
, vhom
and vplot
available in your $PATH
. See the tutorial for information on how to proceed from there.
Installation of the Galaxy version happens via the Galaxy toolshed; also, you have to be an admin in your Galaxy installation. See the galaxy project documentation for how to install toolshed repos in general. The virana toolshed repository is located here and can also be found from within your own galaxy installation by searching for the term virana in the public toolshed. For the advances users, a branch of the aforementioned virana github repository also contains the Galaxy tools. If you are an advanced Galaxy user, you know what to do.
For correct functioning, virana is depending on a couple of Python packages as well as on several binaries. The Python packages should usually be installed automatically; otherwise, pip
and easy_install
are your friends. The Python packages are:
plumbum
numpy
biopython>=1.61
pysam
ftputil>=2.4
HTSeq
matplotlib
bx-python
The binaries are used by the four virana executables are STAR version 2.3.1 or higher, BWA-mem, LASTZ, RazerS3, and Jalview. For the latter, only the Java packages jalview.jar
is required; the other binaries are commonly called STAR
, bwa
, and lastz
in case you want them to locate
on your GNU-Linux system. In case you receive a FATAL INPUT ERROR: unrecoginzed parameter name "outSAMprimaryFlag"
, then you probably do not have the latest STAR
version, which you may obtain here. Another short read mapper, SMALT, is currently not fully supported but may be added in the future. For mapping, either STAR
or BWA-mem
are required, but not both. Similarly, for construction of homologous regions, either razers3
or lastz
are required, but not both. All these binaries should be installed (see documentation for each of these binaries, they usually only require ./configure
and make
) and placed into the $PATH
by you or your administrator. If placing them into the $PATH
is not convenient, however, each virana component also allows specifying the location of the binaries via command line parameters (see the tutorial).
As indicated before, virana consists of five components:
- Virana simulator,
vsim
, a very simple (or even naive) metagenomic short read simulator that is convenient to use for microbial genomes - Virana reference,
vref
, a tool that automatically obtains viral, human, fungi, and bacterial references sequences as well as their taxonomic annotation from reference archives - Virana mapper,
vmap
, a tool that employs highly sensitive and fast short read mappers to align transcriptomic or genomic reads to annotated references and supports a variety of output formats - Virana homology,
vhom
, a tool that analyzes the homology relationships within mapped reads in order to extract homologous regions, i.e. nucleotide stretches that display high sequence similarity to a pathogen and, optionally, also to human factors - Virana plot,
vplot
, a tool that generates plots visualizing statistics of determined homologous regions. This component is not featured in the tutorial.
A typical analysis may consists of three steps. First, let us obtain metagenomic short read data. Since we do not want to download large data sets in order to demonstrate virana's features, we will simulate fastq files by using the virana metagenomic short read simulator, vsim
. Note that virana is really optimized towards analyzing mixed human-viral short read data, for example as originating from RNA-Seq or WGS of human tumors. While for the sake of this tutorial we will concentrate on viral data in order to allow swift processing, all command line calls are equally true for human-viral metagenomic short read data.
For the sake of this tutorial, let us employ several viral genomes that vsim
automatically obtains from NCBI. For this example, we will use two viruses with close human homologs, the Abelson murine leukemia virus gi:9626953
and the bovine herpesvirus 4 gi:13095578
, as well as two suspected (but not yet confirmed) tumor viruses, the TT virus gi:29502191
as well as the proviral insert of XMRV gi:336087897
(which very likely is a lab contaminant, but that is another story). Let's simulate reads for the first two viruses at low coverages of 5
-fold and for the second two at higher coverages of 50
-fold. Let's use a read length of 2x200
bp and write the reads with the output prefix viral_reads
.
$ vsim reads -r 9626953:5 -r 13095578:5 -r 29502191:50 -r 336087897:50 \
--read_length=200 -o viral_reads
If you are feeling adventurous or have time to spend, you can even add human data by appending -r 224589814:2
for a two-fold coverage of human chromosome 22. This will add ~100 Mbp of short reads to the analysis which may aid in masking the faint viral signatures that are generated by the other parameters, thus making virana's task a little more realistic.
The fastq outputs of this call will be written to viral_reads_1.fq.gz
and viral_reads_2.fq.gz
, respectively and contain randomly shuffled input reads (while retaining paired order, of course) at default error rates and the specified viral genome coverages. There are more options that you may want so specify later on, such as insert size and error rates. If you're interested in learning more, run $ vsim reads -h
for a full list of arguments. Also, the --debug
flag may help you seeing what is going on when you run the command.
First, a reference database is obtained from Ensemble and NCBI Refseq in fasta format that contains the human reference sequences Homo_sapiens
(i.e., the assembled chromosomes and unplaced sequences) as well as all known viral refseq sequences Viruses
(i.e., complete genomes). We store all fasta records into a common fasta file named my_reference_db.fa
.
$ vref fasta -r Homo_sapiens -r Viruses -o my_genome_db.fa
If you encounter problems, try adding the --debug
flag. By the way, Virana supports more reference other options for references are rRNA
(from Silva), Fungi
, Plasmids
, Protozoa
, Homo_sapiens_cDNA
, and Bacteria
. Please note that the reference database of all known bacterial genomes consumes several gigabytes of storage (about 20) and that some short read mappers may have problems with that large indexes. In addition, while Virana supports analyses on cellular organisms, it is really optimized towards viruses.
For later analyses with virana, we may also want to obtain human transcripts Homo_sapiens_cDNA
(coding as well as noncoding) in order to compare them with metatranscriptomic reads. We therefore run:
$ vref fasta -r Homo_sapiens_cDNA -o my_transcriptome_db.fa
By the way, virana also supports downloading blast reference databases; see $ vref -h
and $ vref blast -h
.
Next, we would like to combine the genomic reference database into an optimized index usable by a short read mapper. Virana vmap
currently supports two short read mappers: the transcriptomic mapper STAR and the genomic mapper BWA-mem. SMALT, third short read mapper optimized towards highly polymorphic genomes (or erroneous reads) is implemented and accessible in virana but not fully supported. Please let us know if you would like to see SMALT supported.
In this tutorial, we assume that the short reads generated in the simulation step really originated from an RNA-Seq experiment, so we employ vmap rnaindex
which internally calls the STAR indexer. We store the STAR index in a folder my_index_dir
and use 16
compute threads. The option --sparse
constructs a smaller but slower index; you may leave this parameter out if you have enough storage and RAM to hold a ~30GB index in memory. The advantage of a non-sparse index are higher mapping speeds (50M reads/hour/core). Please note that the index can become quite large, about 30 GB. If you encounter problems, try the --debug
flag.
$ vmap rnaindex -r my_genome_db.fa -i my_rna_index_dir --threads 16 --sparse
Alternatively or additionally, of course, a BWA-mem index can be generated; see $ vmap dnaindex -h
for details. If you run the tutorial with the dna or the rna pipeline makes little difference; for real experimental data, this choice should be made based on the sequencing library.
Next, we map the simulated metagenomic short read RNA-Seq data against the reference index. As an output, we obtain a standard (unsorted) bam
file, a summary statistics taxonomy.txt
that contains tabular information about the taxa in the reference database that received alignments, as well as a special hit file hits.bz2
that captures homology relationships between multi-mapping reads. We store a --sample_id
within the hits file in order to be able to pool samples later on.
Mapping for RNA-Seq data is done by star
, a particularly fast and feature-rich mapper. The path to the star binary (which often is called STAR
) is specified by --star-path
. If the bianrg is in your $PATH
, you may ommit this option. Similarly, the path to the samtools
binary os specified.
We run a particulare --sensitive
mapping with several compute --threads
. Note that the --reads
argument is given twice; the order matters due to the paired end nature of the read data. If we had only single-end data, one --read
argument would have sufficed. Input reads are --zipped
(alternatively, --bzipped
for *.bz2
compressed reads or --fqz=[fqz_binary]
for .fqz
compressed reads would allow us using these compression schemes). In this particular example, we restrict our hit output to reads that are alignable to Viruses
and have sufficient matches to the references (--min_continiously_matching
and --max_relative_mismatches
). Note that these filters only apply to the hit file - the taxonomy and the bam
file report all alignments. If you would like to allow multiple kinds of pathogens, just specify the --virana_hit_filter
multiple times (such as once again for Bacteria
. Again, if you encounter problems, try adding the --debug
flag to see what's going on.
Optionally, you could also add a --splice_junctions
parameter that requires a proper splice junction annotation file for the human genome in gtf format in order to increase sensitivity of detected splice junctions during mapping. Please email the authors if you need such a file.
$ vmap rnamap --index_dir my_rna_index_dir \
--reads=viral_reads_1.fq.gz \
--reads=viral_reads_2.fq.gz \
--zipped \
--star-path='STAR' \
--samtools_path='samtools' \
--taxonomy=taxonomy.txt \
--bam=mapping.bam \
--threads=8 \
--sensitive \
--virana_hits=hits.bz2 \
--virana_hit_filter=Viruses \
--sample_id=sample_1 \
--min_continiously_matching=25 \
--max_relative_mismatches=0.2
Here, we left out human --splice_junctions
that may be relevant if your input reads contain human transcripts. Note that the bam
file can be directly used within other analysis pipelines; you may want to $ samtools sort
it, though since it is currently sorted by input read order. We may also mention additional outputs; for example, providing --unmapped_end_1
and --unmapped_end_2
allow for outputting unmapped reads as fastq files, and --chimeric_mappings
activates output of chimerically mapping reads in SAM format. See $ vmap rnamap -h
for a list of other options.
If you have chosen to go with a genomic mapping instead of a transcriptomic mapping, $ vmap dnamap -h
is your friend. In this case most options stay the same, but the mapper is not star
but bwa-mem
and the path to the mapper binary specified via the --bwa-path
option (again, ommit if in $PATH
). For DNA data, additional input formats are available: for interleaved input read data you may use a single read file if specifying --interleaved
; similarly, --bam_input
allows you to use BAM alignments from a single BAM file. Note, however, that reads in the bam are used in the order they are stored in the file - this may cause problems if the BAM contains mapped and coordinate-sorted data as bwa-mem
will then estimate the insert size distribution based on reads mapped to telomer ends of the chromosome).
Last, the hit file generated in the previous step can be analyzed with regard to the homologous (i.e., transcriptomic and genomic) contexts of the reads therein. This greatly facilitates delineation of microbial from human sequence regions (i.e., syntenic sequence stretches consisting of assembled reads that align well to several references). This is enabled by the virana homology module vhom
, running on one or several hit files (the --virana_hits
argument can be supplied multiple times and the mappings therein will be pooled prior to analysis. Alternatively, a wildcard such as --virana_hits=/path/to/*_hits.bz2
will also result in all matching files to be pooled for analysis). In order to identify homologus relationships, genomic --references
and transcriptomic --cdna
fasta files as generated before are employed. The homologus regions generated will be restricted to contain a --min_read_number
of aligned short reads, will have a certain --max_gap_length
basepairs of gaps that are not covered by any sequence reads, and the whole region including gaps will be at least --min_region_length
basepairs long. The alignment of homologous regions is conducted either using razers3
(very fast and very high sensitivity, recommended) or lastz
(significantly slower on whole-genome sequencing data but even more sensitive). Specifying the path to one of these tools (by setting either --razers3_path
or --lastz_path
) selects one of these mappers. If choosing lastz
, an additonal parameter --word_length
is available that controls the sensitivity of the sequence alignment of short reads to references. Choosing --word_length=7
gives a very sensitive result for the construction of homologous regions. In this aprticular example, we also would like to see alignment plots of all homologous regions, so we specify a --jalview_jar_dir
where the jalview.jar file of the Jalview distribution is located on your system. vhom
will generate a directory structure within the --output_dir
where each subdirectory correponds to a taxonomic family (see next Section). Last, a tabular summary statistic --region_stats
will be generated that contains high-level information about the homologous regions.
$ vhom regions \
--references=my_genome_db.fa \
--cdna=my_transcriptome_db.fa \
--virana_hits=hits.bz2 \
--min_read_number=5 \
--max_gap_length=50 \
--min_region_length=100 \
--razers3_path='razers3' \
--jalview_jar_dir=/usr/lib/jalview/
--output_dir=my_families \
--region_stats=my_families/stats.txt
Again, there are more options, especially regarding paths to binaries used by vhom
. $ vhom regions -h
and adding the --debug
flag may help.
Since we restricted the hit file to Viruses in the previous step, viral taxonomic families it is. Within the output directory viral_families
, subdirectories are generated for each viral taxonomic family that the reads you specified earlier map to. Each family directory, in turn, contains a number of homologous regions that are represented by aligned, multi-fasta consensus sequences and visualizations of these consensus sequences using jalview
. Additionaly, plotting the contents of the --region_stats
by using R
and ggplot
allows for graphics as in the published manuscript.
Virana's computational performance is mostly bound by the mapping step vmap
and, to a lesser degree and only if the sample is rich in viral nucleotides, by the homologous region constructions undertaken by vhom
. In contrast, vsim
and vref
are often only used one to generate test and reference data, respectively.
For vmap
, the limiting factor preventing very high scalability is the Python-based, single-process filtering step of the alignments that is required to produce the hit files. While modern short read mappers such as RNA-STAR are able to map on the order of 50M reads/hour/core with high scalability even in --sensitive
mode, the python-based filtering currently can use only one core and tops out at about 100-150M reads/hour. Therefore, assigning more than about two cores to the short read mappers does not make computational sense if you want to produce virana hit files using RNA-STAR. In contrast, genomic data can (and perhaps should) be mapped by Virana using bwa-mem
by configuring vmap
approbriately. In this case, and if mapping is undertaken against all bacterial and viral references, less reads are mapped per core and hour (about 10 Mreads/hour), so specifying 8-16 cores for mapping may make sense in this case.
If your main aim is to produce --sam
files or --bam
files, you can specify vmap
to do just that and enjoy practically unlimited scalability (16 cores and 0.8G reads/hour). Still, even with two cores one can analyze a typical human RNA-Seq sample in one hour or a WGS sample in about ten hours. In addition, nothing prevents you from running multiple samples in parallel (except perhaps memory demands - our use of STAR currently employs serparate memory allocations for each run. using shared memory is possible, but does not work on all systems. See STAR manual if you would like to know more). Noteably, however, there are additional parameters to vmap
that may reduce performance: --filter_complexity
, for example, reduces the speed of the Python component to about 65M reads/hour, and --quality
may reduce these values even further. Parsing of the SAM
output of the short read mappers is already maxed out by the pysam module using fifos (at around 250M reads/hour in a single thread). Still, the Python based filtering could be further improved using multiprocessing.
The execution speed of vhom
is usually on the order of minutes since only few (compared to the number of the whole read data) viral reads are expected to be of possible microbial origin. Also, it consumes very little RAM. Still, since vhom
is single-threaded, future performance imporvements may be gained if Python-based multiprocessing would be introduced. Let us know if you would require this.
We geenrally recommend using the parameters shown in the tutorial for searching for oncogenic viruses; these parameters vary only slightly from the ones used in the manuscript and should produce almost identical results. Alternatively, the values from the manscuript can be used. While most parameters of vsim
, vref
, and vmap
should be self-explanatory to users with a deep-sequencing background. For other users, a search of Biostar and SEQanswers may produce the answers.
However, several parameters of vhom
have effects on the results that perhaps are not immediately clear to the user. --references
, for example, determines which nucleotide reference are used to construct the homologous context of putative microbial reads. While use of arbitrary genomic references makes sense for genomic reads, the use of host genomic references for RNA-Seq data may lead to split regions (due to splicing or the lack thereof in genomic data) and consequently lower interpretability of the data. Therefore, construction of a purely microbial --references
set for the use of vhom
can make sense in these cases (host references should still be used in vmap
in order to allow the proper use of --cdna
in vhom
). Next, several --virana_hits
can be pooled in vhom
by specifying the parameter multiple times. This will pool these data for analysis while still retaining identities of the pooled data (by using the --sample_id
provided in vmap
). As a consequence, the filter --min_read_number
may be affected since this filter applies on the pooled data in order to exclude spurious homologous regions that attracted only very few reads attributable to biological background noise. Increasing this filter may thus make sense with many pooled data or high sequencing depths. Similarly, --min_region_length
discards very short regions that may also be attributable to noise (or a very interesting miRNA, who knows...) The --max_gap_length
controls how long gaps within homologous regions that are not covered by any reads are allowed to be; while larger values facilitate merging of regions and thus often increase interpretability, too high values encourage artifical chimeras. The --word_length
specifies how sensitive the matching of reads and references in the homologous region construction is. Lower values result in increases in spurious matches (decrease in specificity at low gains of sensitivity) while higher values increase processing speed at cost of sensitivity. Still, higher values may make sense if you have large data sets to process. Last, have a look at the command-line references by calling $ vmap regions -h
; especially the use of --region_stats
and the Jalview-based visualizations can aid in analyzing large number of samples.
Virana has been developed for and validated in the following publication:
Schelhorn S-E, Fischer M, Tolosi L, Altmueller J, Nuernberg P, et al. (2013)
Sensitive Detection of Viral Transcripts in Human Tumor Transcriptomes.
PLoS Comput Biol 9(10): e1003228. doi:10.1371/journal.pcbi.1003228
Since publication of the manuscript, we have streamlined virana
and did away with the additional BLAST processing step in the homologous region generation step implemented by vhom
. Basically, we are now using the primary mapping a little smarter and make lastz
work harder, thus gaining processing speed without loosing sensitivity. Several other aspects of the analysis pipeline featured in the manscuript are not presently part of the Virana distribution; among these are de-novo transcriptome assembly using Oases and Trinity, or TBLASTX-based annotation. However, it is straightforward to replicate these pipeliens using our parameters using the standard binaries of these tools.
Validation data that has been used in the manuscript is not available on GitHub due to size limitations. Many of these files are available in public archives as indicated in the manuscript itself. Please contact us at [email protected] for access to the remaining files that are not available elsewhere.
Several alternative approaches for detecting microbial nucleotide signatures in general and oncogenic viruses in particular have been proposed. Among these are PathSeq, RINS, CaPSID, and READSCAN. All have their advantages and disadvantages, a (perhaps biased) view of which is detailled in our manuscript.
If you have problems with Virana or suggestions, please use the github bug tracking features or send an email to [email protected].