- |2018-09-21| Rename wtdbg-1.2.8 to wtdbg2
- |2018-09-19| GFA supports
wtdbg-dot2gfa.pl dbg.3.dot >dbg.3.gfa
- |2018-09-01| New consensus module: POACNS
wtpoa-cns -t 0 -i dbg.ctg.lay -fo dbg.ctg.poacns.fa
1, wtpoa-cns implements POA to generate MSA from reads fragments within an edge. Used SSE instructions, but seems still can be further improved in speed
2, performs realignment on MSA.
3, recalibrates homopolymers, and make consensus sequence for an edge. Most reamining small errors come from homopolymer, more efforts are needed
4, joins edges' sequences into a contig.
Welcome to test POACNS and feedback.
A fuzzy Bruijn graph (FBG) approach to long noisy reads assembly
A challenge in assembling long noisy reads from third generation sequencing (TGS) is reducing its requirement of computing resource, especially for large genomes. To address this issue, I developed a novel sequence alignment algorithm and a new assembly graph for efficiently assembling large genomes using TGS data.
-
Alignment
KBM: Kmer-BIN-Mapping.
KBM groups k-mers from each non-overlapped sliding 256 bp fragments in long reads into bins. Bins of which most k-mers are high frequency, are filtered as highly repetitive ones. Then, KBM searches synteny of matched bin pairs in sequences in a dynamic programming way. A matched bin pair in two sequences is defined as two bins different by original but share a set of k-mers. The result of alignments in KBM have the same features of traditional sequence alignment, excepting the unit of KBM alignments is 256 bp bin instead of single base. -
Assembly
FBG: Fuzzy Bruijn Graph.
FBG is composed of vertices in length of 1024 bp from reads, and edges connecting vertices in their order on read paths. Comparing with DBG, the size of vertices in FBG are much bigger, thus won¡¯t be sensitive to small repeat. To tolerate high sequencing errors, FBG's vertices are found using gapped sequence alignments from KBM or other aligners, comparing with searching identical k-mers in DBG. -
Why choose wtdbg in genome assembly
There are many assemblers for long noisy reads assembly, e.g. FALCON, CANU, miniasm, and SMARTdenovo (progenitor of wtdbg). If you have a genome of 10G bp or bigger in size, wtdbg is your first or even the only option. For small but complicated genomes (< 3 G), wtdbg was often reported to yield better assembly by my friends. Besides, KBM is easy to use when you are setting up a web-server for long reads mapping (see Example 2). -
Limitation
Max read length is 0x0003FFFFU (256 Kb), longer reads will be split.
Max number of reads is 0x03FFFFFFU (64 M). If your data volume exceeds, please filter relative shorter reads.
In KBM, max read length is 0xFFFFFFFFU (4 Gb), max number of reads is 0x0FFFFFFFU (256 M).
Max number of threads is 4096.
Cannot parallelly run in multiple nodes. But you can implement it simplely usingkbm
andwtdbg --load-alignments
Developed and tested in Linux-GCC only.
Only accepts fasta/fastq format for input, '.gz' suffixed files will be piped bygzip -dc
.
git clone https://github.com/ruanjue/wtdbg2.git
cd wtdbg2
make
Supposes you have hg19.fa
as reference sequences, and reads.fq.gz
as query sequences.
kbm -t 64 -d hg19.fa -i reads.fa.gz -o reads.kbmap
- COL1
qry_name
- COL2
qry_strand
- COL3
qry_length
- COL4
qry_beg
- COL5
qry_end
- COL6
ref_name
- COL7
ref_strand
(always equals+
) - COL8
ref_length
- COL9
ref_beg
- COL10
ref_end
- COL11
match_len
(length of matched k-mers) - COL12
align_len
(length of aligned) - COL13
#kcnt
(number of matched k-mers) - COL14
#gap
(number of gapped BINs) - COL15
cigar
(256 x SAM's cigar)
Suitable for online tools, or frequently used references
kbm -t 64 -d hg19.fa -i /dev/null -W hg19.kbmidx
kbm -R hg19.kbmidx start
Now, hg19.kbmidx is cached in memory for further call.
kbm -R hg19.kbmidx -t 64 -i reads.fa -o reads.kbmap
Please note that, hg19.kbmidx can be multilple called by any processes in the same computer.
kbm -R hg19.kbmidx stop
echo "manual"
run_wtdbg_assembly.sh -h
echo "generating shell script"
run_wtdbg_assembly.sh -t 0 -i reads.fa.gz -o dbg -T >run.sh
run.sh
is ready for invoked
wtdbg2 -h
wtdbg2 --help
-t <int> Number of threads, 0: all cores, [0]
-i <string> Long reads sequences file, + *
-I <string> Error-free sequences file, +
-o <string> Prefix of output files, *
-f Force overwrite
-k <int> Kmer fsize, 0 <= k <= 25, [0]
-p <int> Kmer psize, 0 <= p <= 25, [21]
k + p <= 25, seed is <k-mer>+<p-homopolymer-compressed>
-K <float> Filter high frequency kmers, maybe repetitive, [1000]
if K >= 1, take the integer value as cutoff
else, mask the top fraction part high frequency kmers
-E <int> Min kmer frequency, [2]
-F Filter low frequency kmers by a 4G-bytes array (max_occ=3 2-bits). Here, -E must greater than 1
-S <int> Subsampling kmers, 1/(<-S>) kmers are indexed, [4]
-S is very useful in saving memeory and speeding up
please note that subsampling kmers will have less matched length
-X <int> Max number of bin(256bp) in one gap, [4]
-Y <int> Max number of bin(256bp) in one deviation, [4]
-x <int> penalty for BIN gap, [-7]
-y <int> penalty for BIN deviation, [-21]
-l <float> Min length of alignment, [2048]
-m <float> Min matched, [200]
-s <float> Max length variation of two aligned fragments, [0.2]
-q Quiet
-v Verbose, +
--help Show more options
Decrease -p
. Try -p 19
or -p 17
Decrease -S
. Try -S 2
or -S 1
Both will increase computing time.
Increase --edge-min
. Try --edge-min 4
, or higher.
Decrease --edge-min
. Try --edge-min 2 --rescue-low-cov-edges
.
--tidy-reads 5000
. Will filtered shorter sequences. If names in format of \/\d+_\d+$
, will selected the longest subread.
Suppose the prefix is dbg
- dbg.1.dot
DOT file for initialized graph - dbg.1.nodes
nodes and their positions in reads - dbg.1.reads
reads and their nodes - dbg.2.dot
DOT file after transitive reduction - dbg.3.dot
DOT file after merging bubble and remove tips - dbg.alignments
KBMAP file, all vs all alignments - dbg.binkmer
Distribution of number of k-mers in a BIN - dbg.closed_bins
Filtered BINs - dbg.clps
Reads clip information.
COL1 read_name
COL2 read_length
COL3 keep_offset
COL4 keep_length - dbg.ctg.dot
DOT file for contigs - dbg.ctg.lay
Contigs layout file. Will be read bywtdbg-cns
. This file is the main result ofwtdbg
Format:
>ctg(\d+) nodes=(\d+) len=(\d+)
E <OFFSET> <NODE1> <STRAND1> <NODE2> <STRAND2>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
S <READ_NAME> <STRAND> <REG_OFFSET> <REG_LENGTH> <REG_SEQ>
...
E ...
...
One contig contains many edges (starting with 'E'), each edge contains many regions inside reads.
Please note that one read often contains many REGs.
- dbg.events
Log file of graph simplification - dbg.frg.dot
DOT file for unitigs - dbg.frg.nodes
unitigs and their nodes - dbg.kmerdep
Distribution of k-mer depth - STDERR stream
wtdbg print runtime information on progrom's STDERR stream.--quiet
to disiable it
wtdbg-cns -t 64 -i dbg.ctg.lay -o dbg.ctg.lay.fa
The output file dbg.ctg.lay.fa
is ready for further polished by PILON
or QUIVER
.
wtpoa-cns -t 64 -i dbg.ctg.lay -o dbg.ctg.lay.fa
wtpoa-cns is slower than wtdbg-cns, but offer more accurate consensus sequences. I will update it in following development.
- Data Source http://datasets.pacb.com/2014/Human54x/fasta.html
- Command
wtdbg2 -t 96 -i pb.fa -fo dbg --tidy-reads 5000 --edge-min 2 --rescue-low-cov-edges
- Contigs
TOT 2978536704, CNT 8752, AVG 340327, MAX 11662848, N50 1925120, L50 453, N90 400128, L90 1727, Min 5120
- Runtime
real 6131.803 sec, user 201836.200 sec, sys 33956.790 sec, maxrss 117281672.0 kB, maxvsize 202422172.0 kB
- Data Source http://www.ebi.ac.uk/ena/data/view/PRJNA246220
- Command
wtdbg2 -t 96 -i wt.fa -fo dbg --tidy-reads 5000 --edge-min 4 --rescue-low-cov-edges
- Contigs
TOT 2964872448, CNT 1909, AVG 1553103, MAX 105310208, N50 23586816, L50 34, N90 3326976, L90 158, Min 5120
- Runtime
real 16806.534 sec, user 681278.770 sec, sys 79371.630 sec, maxrss 264956752.0 kB, maxvsize 443356532.0 kB
- Command
wtdbg2 -t 96 -i ../rawdata/pacbio.fa.gz -p 21 -S 2 --aln-noskip --rescue-low-cov-edges --tidy-reads 5000 -fo axolotl
- Contigs
TOT 27375160576, CNT 115355, AVG 237313, MAX 7812608, N50 606976, L50 12527, N90 144896, L90 47295, Min 5120
- Runtime
real 190237.591 sec, user 10994200.800 sec, sys 488715.030 sec, maxrss 1671005352.0 kB, maxvsize 2365400208.0 kB
- Data Source https://github.com/nanopore-wgs-consortium/NA12878
- Command
wtdbg2 -t 64 -i NA12878-ONT.fa.gz -fo dbg -S 2 --edge-min 2 --rescue-low-cov-edges
-
Contigs
TOT 2827644928, CNT 19473, AVG 145209, MAX 31366400, N50 4540672, L50 162, N90 172800, L90 1111, Min 5120
-
Runtime
real 14992.925 sec, user 649202.270 sec, sys 61638.300 sec, maxrss 256840096.0 kB, maxvsize 356668088.0 kB
To be published.
URL https://github.com/ruanjue/wtdbg2/
Jue Ruan [email protected]
Jue Ruan [email protected]