Skip to content

Latest commit

 

History

History
112 lines (92 loc) · 3.95 KB

variant-calling.md

File metadata and controls

112 lines (92 loc) · 3.95 KB

Variant Calling

The variantcalling.nf nextflow script will take any number of samples with paired-end reads in FASTQ format, map reads using Bowtie2, process BAM files, and finally call variants using BCFtools v1.21 and/or Freebayes v1.3.6. If part of the pipeline is unsuccessful for a sample then these errors are ignored.

Pipeline flowchart:

graph TD;
    FASTQ-->|Bowtie2| a1[Align Reads to Reference Genome];
    a1-->|GATK4| p2[Process BAM Files];
    p2-->|BCFtools| b1[Call Variants];
    p2-->|Freebayes| f1[Call Variants];
Loading

Dependencies (version tested)

  • Nextflow (24.04.4)
  • Java (18.0.2.1)
  • Python (3.10)
  • Perl (5.32.1)
  • Bowtie2 (2.5.3)
  • SAMtools (1.19.2)
  • GATK4 (4.5)
  • BCFtools (1.21)
  • Freebayes (1.3.6)

Conda Environment

Create environment using conda:
conda env create -f ./nextflow-pipelines/env/variantcalling.yml

Activate conda environment:
conda activate variantcalling or source activate variantcalling

Setup

Sample Sheet

The sample sheet is a CSV file with five columns. Make sure the column headings are exactly the same as below.

Column 1: sample (sample name)
Column 2: library (sample name if individual libraries built for each sample)
Column 3: run (if samples were sequenced on different runs use this column to specify)
Column 4: read1 (absolute path to read1.fq.gz)
Column 5: read2 (absolute path to read2.fq.gz)

Example of a sample_sheet.csv:

sample library run read1 read2
sample_01 sample_01 Dec_2022 /home/path/read1.fq.gz /home/path/read2.fq.gz
sample_02 sample_02 Dec_2022 /home/path/read1.fq.gz /home/path/read2.fq.gz
sample_03 sample_03 Jun_2023 /home/path/read1.fq.gz /home/path/read2.fq.gz
sample_04 sample_04 Jun_2023 /home/path/read1.fq.gz /home/path/read2.fq.gz

Usage

#!/bin/bash

# Activate conda environment
conda activate variantcalling

# Variables
cpus=20
sampleSheet=/path/to/sample/sheet.csv
genome=/path/to/reference/genome.fasta
outdir=/path/to/output/directory

# Index reference genome
bowtie2-build ${genome} ${genome}

# Run pipeline
nextflow run ~/nextflow-pipelines/src/variantcalling.nf \
    --sampleSheet ${sampleSheet} \
    --genome ${genome} \
    --outdir ${outdir} \
    --variantCaller "both" \
    --vcf "variants" \
    --cpus 20
Parameter                            Description
--sampleSheet sample sheet in CSV format
--genome path to reference genome (index genome before running pipeline)
--outdir path to output directory
--variantCaller "bcftools", "freebayes", or "both"
--bcftools_mpileup string of additional arguments passed to bcftools mpileup (e.g. "--min-MQ 40 --min-BQ 30")
--bcftools_call string of additional arguments passed to bcftools call (e.g. "--ploidy 2 --multiallelic-caller --variants-only")
--freebayes_params string of additional arguments passed to freebayes (e.g. "-p 2 --min-mapping-quality 40 --min-base-quality 30 --min-alternate-count 5 -g 200 --genotype-qualities")
--vcf string denoting the output VCF prefix
--test prints out a tuple of the sample ID and paths to the input paired reads (dry run)
--cpus integer denoting the number of cpus (default: 16)

Input

$ ls sample_sheet/
sample_sheet.csv

$ ls genome/
GCA_040759855.1_ASM4075985v1_genomic.fna
GCA_040759855.1_ASM4075985v1_genomic.fna.1.bt2
GCA_040759855.1_ASM4075985v1_genomic.fna.2.bt2
GCA_040759855.1_ASM4075985v1_genomic.fna.3.bt2
GCA_040759855.1_ASM4075985v1_genomic.fna.4.bt2
GCA_040759855.1_ASM4075985v1_genomic.fna.rev.1.bt2
GCA_040759855.1_ASM4075985v1_genomic.fna.rev.2.bt2

Output

$ ls ${outdir}
variants_bcftools.vcf.gz
variants_freebayes.vcf.gz