Skip to content
pd3 edited this page Dec 2, 2014 · 20 revisions

Installing samtools and bcftools


    git clone --branch=develop git://github.com/samtools/htslib.git
    git clone --branch=develop git://github.com/samtools/bcftools.git
    git clone --branch=develop git://github.com/samtools/samtools.git
    cd bcftools; make
    cd ../samtools; make 

Variant calling via mpileup
samtools mpileup -uf ref.fa aln.bam | bcftools call -mv -Oz > calls.vcf.gz
Note that the calling can be made more sensitive/restrictive by using different prior,
the default is bcftools call -P1e-3.

Variant filtering
Variant filtering is an area of active development and is not easy. The annotations produced by variant callers provide only indirect hints about quality of the calls, an approach which worked for one dataset may not work for another. Nowadays most powerful seem machine learning approaches such as SVM (not implemented in bcftools), see an example of SVM filtering pipeline here.

The least one can do is to apply fixed-threshold filtering. One can get a feel for typical distributions of annotations from a set of known good calls vs all calls. If you don’t already have a pipeline for this, use bcftools query to extract the annotations and plot manually.

A good measure of the callset quality is often ts/tv, the ratio of transitions and transversions. Try the following command with different QUAL thresholds
bcftools filter -i'%QUAL>20' calls.vcf.gz | bcftools stats | grep TSTV

Another useful metrics is sequencing depth (DP bigger than twice the average depth indicates problematic regions and is often enriched for artefacts); the minimum number of high-quality non-reference reads (DV); proximity to indels (-g), etc. To give a concrete example, the following filter seemed to work quite well for one particular dataset (human data, exomes):

bcftools filter -sLowQual -g3 -G10 \
-e'%QUAL<10 || (RPB<0.1 && %QUAL<15) || (AC<2 && %QUAL<15) || %MAX(DV)<=3 || %MAX(DV)/%MAX(DP)<=0.3' \
calls.vcf.gz

Consensus calling

  • via vcf2fq:
    samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
  • via bcftools consensus:
    samtools mpileup -uf ref.fa aln.bam | bcftools call -mv -Oz -o calls.vcf.gz
    tabix calls.vcf.gz
    cat ref.fa | bcftools consensus calls.vcf.gz > cns.fa

Obtaining the list (or number) of samples from a VCF
To get the list of samples:
bcftools query -l file.vcf.gz
To get the number of samples:
bcftools query -l file.vcf.gz | wc -l

Querying a VCF
Retrieve the CHROM, POS and FORMAT/GT field from a VCF, include only lines where ID column is annotated
bcftools query -i'%ID!="."' -f'%CHROM %POS[ %GT]\n' file.vcf.gz

Add contig or missing INFO/FORMAT lines to a header
Some bcftools commands complain if there are missing contig, INFO or FORMAT lines in the header of your VCF. One way to add them is using bcftools annotate. Make a file with the header lines to be appended, then
bcftools annotate -h header_lines.txt file.vcf.gz
This will append the lines to your header. It will not replace or edit IDs that already exist.

Add INFO tags from a BED file
Suppose we want to create new INFO tags FOO and BAR, say FOO are strings
and BAR are integers. Create a BED file annots.bed which contains:
1 752566 752567 FooValue1 12345
1 752721 752722 FooValue2 67890
(Note that the same chromosome naming convention must be used in both BED and VCF file.)

Now we need to tell bcftools how is FOO and BAR defined. Create a header file
annots.hdr which contains:
##INFO=<ID=FOO,Number=1,Type=String,Description="Some description">
##INFO=<ID=BAR,Number=1,Type=Integer,Description="Some description">

Finally index the BED file and let bcftools do its job:
bgzip annots.bed
tabix -p bed annots.bed.gz
bcftools -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,FOO,BAR in.vcf.gz

Clone this wiki locally