Nanoflow is a pipeline written in snakemake to automate many of the steps of quality control, de novo assemblies and genome annotation in whole genome sequencing analysis, using Oxford Nanopore sequencing data.
20190104: add github page
- Install Conda environment manager, and make sure the
~/.condarc
is your home directory.
nano ~/.condarc
Copy the following to the file.
channels:
- bioconda
- conda-forge
- defaults
- Install GCC5, by cloning Jesse's conda-gcc5 repository and create an new conda environment
nanoflow
.
cd ~
git clone https://github.com/ressy/conda-gcc5.git
cd conda-gcc5
bash setup.sh nanoflow
- Clone this repository into a local directory and install the packages into
nanoflow
environment.
git clone https://github.com/zhaoc1/nanoflow.git nanoflow
cd nanoflow
source activate nanoflow
conda install -n nanoflow -c bioconda snakemake=4.8.1
conda env update --name=nanoflow --file env.yml
- Clone Ryan Wick's Basecalling-comparison repository
mkdir local
cd local
git clone https://github.com/rrwick/Basecalling-comparison.git
- Download other packages into local directory
## Canu 1.8
wget https://github.com/marbl/canu/archive/v1.8.tar.gz
tar -xvf v1.8.tar.gz
cd canu-1.8/src
make -j 4
## Nanopolish v0.9.0
git clone --recursive https://github.com/jts/nanopolish.git
cd nanopolish
make
## Unicycler
git clone https://github.com/rrwick/Unicycler.git
cd Unicycler
python3 setup.py install
## set up for Quast
git clone https://github.com/lucian-ilie/E-MEM.git
cd E-MEM
make
- Basecalling: the raw fast5 signal data files were basecalled using ONT’s Albacore command line tool (v.2.2.7), with barcode demultiplexing and fastq output. You can perform the basecalling step either by snakemake or run the
run_albacore.sh
bash script, with proper directory info.
snakemake --configfile all_basecalling
- Preprocess: quality filter, confidently-binned, and subsampled subsample long reads
snakemake --configfile config.yml --cores 8 all_qc
-
Hybrid assembly option 1: Canu + Nanopolish (+ Circlator + Pilon)
- long reads only product: long reads only assembly polished by signal data, can be used by hybrid assembly option 3.
snakemake --configfile config.yaml --cores 8 all_draft1
-
Hybrid assembly option 2: Unicycler (default mode)
depth=X
in the FASTA header: to preserve the relative depths. This is mainly used for plasmid sequences, which should be more represented in the reads than the chromosomal sequence.
snakemake --configfile config.yaml --cores 8 all_draft2
- Hybrid assembly option 3: Unicycler (existing long reads assembly option)
snakemake --configfile config.yaml --cores 8 all_draft3
-
For the final draft genome, a common practice is to choose two of the assemblies results you are happy with, assess them with the provided reference genome, compare one to the other, and map reads back to the draft genomes to calcualate the coverage. All of these tasks are implemented in the
assembly.rules
.- We sequenced C diff isoaltes at PCMP, and therefore in the
run_prokka
rules, I used thegenus
level prokka database. If you have a different organisms to study, please build the prokka genus database by yourself and change the corresponding lines in therun_prokka
rule.
- We sequenced C diff isoaltes at PCMP, and therefore in the
snakemake --configfile config.yaml --cores 8 all_final
- Assembly assess and comparison
-
Metrics description
-
Misjoins
: locations where two adjacent sequences in the assembly should be split apart and placed at distinct locations in order to match the reference. -
Relocation
: a misjoin where a segments needs to be moved elsewhere on the chromosome. -
Misassemblies
: QUAST categories misassemblies as either local (less than 1kbp discrepancy) or extensive (more than 1 kbp discrepancy)
-
-
A good reference guide for interpretting the dot plot is available here.
-
Some good tutorials 😳
- Align two draft sequences using MUMmer.
- Evaluate the assembly using MUMmer.
- Assembly evaluation with QUAST.
- Multiple assemblies comparison using QUAST.
- Highly similar sequences with rearrangments using run-mummer3 [TODO].
- Assembly to assembly comparisons using Minimap2 [TODO].
- Microbial genomics tutorials using PacBio long reads from ABRPI-Training.
- de.NBI Nanopore Training Course.
-
Wish you knew sooner 😔
- Minimap2 and the future of BWA, by Heng Li's blog.
- Long reads assembly: indels cause interrupted genes, by Mick Watson's blog. I also have an example for this issue demo_interrupted_genes
- This paper talks about the commonly incorrect use of the max_target_seqs of BLAST.
-
Two optional features provided by Nanoflow:
- assess draft genomes using QUAST
snakemake --configfile config.yaml _all_quast --use-conda
- IGV: short/long reads mapped to draft assembly
snakemake --configfile config.yaml _all_igv
- Generate bioinformatics report refer to
bioinfo_report.Rmd
. An example output is shown in bioinfo_report.pdf.