Skip to content
Allyson Dekovich edited this page Sep 27, 2022 · 9 revisions

We are following GATK Best practices described here: https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-

Add Read Groups to BAM file.

This is adding something to the bam files, so we'll do this in your 3_bwa folder

Load java and samtools

Use picard tools to add read groups

java -jar /pickett_shared/software/picard-2.27.4/picard.jar AddOrReplaceReadGroups \
	I=<your_read_set>_sorted.bam \
	O=<your_read_set>_sorted.RG.bam \
	RGSM=<your_read_set> \
	RGLB=<your_read_set> \
	RGPL=illumina \
	RGPU=<your_read_set>
  • I = input file (bam or sam)
  • O = output file (sam or bam or cram)
  • RGSM = Read-Group sample name Required.
  • RGLB = Read-Group library Required.
  • RGPL = Read-Group platform (e.g. ILLUMINA, SOLID) Required.
  • RGPU = Read-Group platform unit (eg. run barcode) Required.

[Lots more information on this step] (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671?id=6472)

We are cheating by using sample name for barcode, because it does not matter in our downstream analysis

Check size. If all looks good, you can now remove the original bam file.

Now index

samtools index <your_read_set>_sorted.RG.bam

An example loop to process all samples:

for f in *_sorted.bam
do
        BASE=$( basename $f | sed 's/_sorted.bam*//g' )
        echo "BASE $BASE"
        
	java -jar /pickett_shared/software/picard-2.27.4/picard.jar \
		AddOrReplaceReadGroups \
		I=${BASE}_sorted.bam \
		O=${BASE}_sorted.RG.bam \
		RGSM=$BASE \
		RGLB=$BASE \
		RGPL=illumina \
		RGPU=$BASE
	samtools index ${BASE}_sorted.RG.bam
done

GATK4 has a combined tool to mark and sort duplicates called MarkDuplicateSpark tool. The sorts and marks duplicates based on query name and is different from Picard MarkDuplicate tool as it is coordinate sorted. * We will not mark dups because of the data type*.

GATK

mkdir 4_gatk
cd 4_gatk

# Create a links to BAM file with read groups, its index, and the reference genome
ln -s ../3_bwa/<your_read_set>_sorted.RG.bam .
ln -s ../3_bwa/<your_read_set>_sorted.RG.bam.bai .
ln -s /pickett_shared/teaching/EPP622_Fall2022/raw_data/solenopsis_invicta/genome/UNIL_Sinv_3.0.fasta

Prepare Reference for GATK

/pickett_shared/software/gatk-4.2.6.1/gatk CreateSequenceDictionary -R UNIL_Sinv_3.0.fasta
samtools faidx UNIL_Sinv_3.0.fasta

Outputs a .dict file

Call SNPs and indels on a single sample.

/pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
        -R UNIL_Sinv_3.0.fasta \
        -I <your_read_set>_sorted.RG.bam \
        -O <your_read_set>.vcf

Uses 4 threads by default.

But what we actually want to do is call the SNPs/indels using GVCF mode. This will allow us to later recall all the SNPs/indels using information combined across samples. We will also request a bam file with the realigned reads to view.

/pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
        -R UNIL_Sinv_3.0.fasta \
        -I <your_read_set>_sorted.RG.bam \
        -O <your_read_set>.g.vcf \
        -ERC GVCF \
        -bamout <your_read_set>_sorted.RG.realigned.bam

Compare the .vcf and the .g.vcf files. How are they different?

Copy your gvcf output file to a common directory

cp <your_read_set>.g.vcf /pickett_shared/teaching/EPP622_Fall2022/analysis/GVCFs

Example for loop:

for f in *_sorted.RG.bam
do
        BASE=$( basename $f | sed 's/-trimmed_sorted.RG.bam*//g' )
        echo "BASE $BASE"

        /pickett_shared/software/gatk-4.2.6.1/gatk HaplotypeCaller \
        -R UNIL_Sinv_3.0.fasta \
        -I $f \
        -O ${BASE}.g.vcf \
        -ERC GVCF \
        -bamout ${BASE}_sorted.RG.realigned.bam

done

IGV

Copy these files to your computer

  • reference genome
  • reference genome index (*fai)
  • <your_read_set>_sorted.RG.bam
  • <your_read_set>_sorted.RG.bam.bai
  • <your_read_set>_sorted.RG.realigned.bam
  • <your_read_set>_sorted.RG.realigned.bam.bai
  • <your_read_set>.vcf

View on the online IGV.

  • Are the SNPs in the vcf file supported by read evidence?
  • Can you find an example of read realignment by GATK?

Combine and recall SNPs across the whole group of samples

mkdir 5_gatk_combine
cd 5_gatk_combine/
ln -s /pickett_shared/teaching/EPP622_Fall2022/raw_data/solenopsis_invicta/genome/UNIL_Sinv_3.0.fasta
ln -s ../4_gatk/UNIL_Sinv_3.0.fasta.fai
ln -s ../4_gatk/UNIL_Sinv_3.0.dict .

Lets look at how many samples we have.

ls /pickett_shared/teaching/EPP622_Fall2022/analysis/GVCFs

We are going to use CombineGVCFs. But if you have a very large study, you should consider GenomicsDBImport.

CombineGVCFs wants every individual gvcf listed in a specific way with its own parameter flag. Lets write a for loop to generate that text

ln -s /pickett_shared/teaching/EPP622_Fall2022/analysis/GVCFs/*g.vcf .

echo "/pickett_shared/software/gatk-4.2.6.1/gatk CombineGVCFs \\
-R UNIL_Sinv_3.0.fasta \\"

for f in *g.vcf 
do
        echo "-variant $f \\"
done

echo "-O solenopsis_combined.g.vcf.gz"

That should give you something similar to

/pickett_shared/software/gatk-4.2.6.1/gatk CombineGVCFs \
-R UNIL_Sinv_3.0.fasta \
-variant SRR6922148.g.vcf \
-variant SRR6922194.g.vcf \
-variant SRR6922233.g.vcf \
-variant SRR6922241.g.vcf \
-variant SRR6922276.g.vcf \
-variant SRR6922278.g.vcf \
-variant SRR6922291.g.vcf \
-variant SRR6922294.g.vcf \
-variant SRR6922306.g.vcf \
-variant SRR6922308.g.vcf \
-variant SRR6922309.g.vcf \
-variant SRR6922311.g.vcf \
-variant SRR6922314.g.vcf \
-variant SRR6922315.g.vcf \
-variant SRR6922316.g.vcf \
-variant SRR6922318.g.vcf \
-variant SRR6922319.g.vcf \
-variant SRR6922320.g.vcf \
-variant SRR6922321.g.vcf \
-variant SRR6922354.g.vcf \
-variant SRR6922399.g.vcf \
-variant SRR6922446.g.vcf \
-variant SRR6922447.g.vcf \
-variant SRR6922448.g.vcf \
-variant SRR6922449.g.vcf \
-variant SRR6922451.g.vcf \
-variant SRR6922454.g.vcf \
-variant SRR6922470.g.vcf \
-O solenopsis_combined.g.vcf.gz

Run that. You should now have a solenopsis_combined.g.vcf.gz file. Lets recall all the variants

/pickett_shared/software/gatk-4.2.6.1/gatk --java-options "-Xmx10g" GenotypeGVCFs \
   -R UNIL_Sinv_3.0.fasta  \
   -V solenopsis_combined.g.vcf.gz \
  -O solenopsis_combined.vcf.gz

Lets look at how many snps we have

spack load bcftools
bcftools stats solenopsis_combined.vcf.gz > solenopsis_combined.vcf.stats.txt

These SNPs are unfiltered and probably have many false positives. The next recommended step is variant quality score recalibration (VQSR), however, that depends on high-quality sets of known variants to use as training. We don't have that, so we will do some hard threshold filtering. Lets start with a basic depth and quality value filtering

/pickett_shared/software/gatk-4.2.6.1/gatk VariantFiltration \
        -R UNIL_Sinv_3.0.fasta  \
        -V solenopsis_combined.vcf.gz \
        -O solenopsis_combined.Basicfilters.vcf.gz \
        -filter-name "basic_filter" -filter "QUAL > 20.0 && DP > 5"

Here are recommended filtering parameters from GATK with more explanation here. We don't have time but you would ideally split out indels and snps, then apply independent filters as described in the article.

/pickett_shared/software/gatk-4.2.6.1/gatk VariantFiltration \
        -R UNIL_Sinv_3.0.fasta  \
        -V solenopsis_combined.vcf.gz \
        -O solenopsis_combined.GATKfilters.vcf.gz \
        -filter-name "QD_filter" -filter "QD < 2.0" \
        -filter-name "FS_filter" -filter "FS > 60.0" \
        -filter-name "MQ_filter" -filter "MQ < 40.0" \
        -filter-name "SOR_filter" -filter "SOR > 4.0" \
        -filter-name "MQRankSum" -filter "MQRankSum < -12.5" \
        -filter-name "ReadPosRankSum" -filter "ReadPosRankSum < -8.0" 

The latter two give a warning, but its still working fine according to the GATK forum.

  • QD = QualByDepth
  • FS = FisherStrand - filter for strand bias
  • SOR = StrandOddsRatio
  • MQ = RMSMappingQuality
  • MQRankSum = MappingQualityRankSumTest
  • ReadPosRankSum

Use bcftools stats to look at your filtered files. how are they different from the unfiltered and from each ohter? The tool only marks sites that do not pass, it does not remove them from the file. So lets get stats on only passing variants

bcftools stats -f "PASS,." solenopsis_combined.vcf.gz >solenopsis_combined.vcf.stats.txt
bcftools stats -f "PASS,." solenopsis_combined.Basicfilters.vcf.gz >solenopsis_combined.Basicfilters.vcf.stats.txt
bcftools stats -f "PASS,." solenopsis_combined.GATKfilters.vcf.gz >solenopsis_combined.GATKfilters.vcf.stats.txt
grep 'number of SNPs:' *stats.txt
Clone this wiki locally