-
Notifications
You must be signed in to change notification settings - Fork 41
Home
Here you can find a basic for using wgd
to investigate potential genome
duplications in genomic data. This tutorial is still in the making, but an
incomplete guide is better than no guide.
First, installation. I have used wgd with Python3.5+ (3.5, 3.6, 3.7 and 3.8). I generally recommend to use a virtual environment for wgd, that way you won't have dependency issues or clashes. The following should suffice to install the wgd package in a Ubuntu-ish Linux environment.
sudo apt-get install python3-pip
pip3 install virtualenv --user
virtualenv venv -p python3
source venv/bin/activate
git clone https://github.com/arzwa/wgd.git
cd wgd
pip3 install .
If all went well, you should try running wgd
(venv) $ wgd
Usage: wgd [OPTIONS] COMMAND [ARGS]...
Welcome to the wgd command line interface!
_______
\ ___ `'.
_ _ .--./) ' |--.\ \
/\ \\ ///.''\\ | | \ '
`\\ //\\ //| | | | | | | '
\`// \'/ \`-' / | | | |
\| |/ /("'` | | ' .'
' \ '---. | |___.' /'
/'""'.\ /_______.'/
|| ||\_______|/
\'. __//
`'---'
wgd Copyright (C) 2018 Arthur Zwaenepoel
This program comes with ABSOLUTELY NO WARRANTY;
This is free software, and you are welcome to redistribute it
under certain conditions;
Contact: [email protected]
Options:
-v, --verbosity [info|debug] Verbosity level, default = info.
-l, --logfile TEXT File to write logs to (optional)
--version Print version number
-h, --help Show this message and exit.
Commands:
dmd All-vs.-all diamond blastp + MCL clustering.
kde Fit a KDE to a Ks distribution.
ksd Ks distribution construction.
mcl All-vs.-all blastp + MCL clustering.
mix Mixture modeling of Ks distributions.
pre Check and optionally rename CDS files Example usage (renaming) wgd...
syn Co-linearity analyses.
viz Plot histograms/densities (interactively).
wf1 Standard workflow whole paranome Ks.
wf2 Standard workflow one-vs-one ortholog Ks.
Note that this will not install external programs used in wgd
like mcl
,
codeml
, blastp
, diamond
, fasttree
, i-adhore
, ... (also note that you
typically don't need all these programs for doing analyses with wgd
;
wgd
will throw an error when it cannot find some required software
tool before running analyses, so you don't really have to worry before
getting such a 'can not run ' error message).
The basic input data is a bunch of CDS sequences. Here I'll use the CDS data available for the bladderwort (Utricularia gibba). To download the data from PLAZA:
wget ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_04/Fasta/cds.selected_transcript.ugi.fasta.gz
gunzip *gz
mv cds.*.ugi.fasta ugi.fasta
There is a little tool in wgd to quickly check the input data. Since
wgd pre
to partition a fasta file in
everything that is nicely translatable from start to stop codon and all the
rest.
wgd pre ugi.fasta
This will spit a lot to the screen:
$ wgd pre ugi.fasta
2020-03-07 19:42:14: INFO (0) Checking ugi.fasta
2020-03-07 19:42:14: ERROR Translation error (First codon 'CCG' is not a start codon) in sequence UGI.ctg09795.24681.1
2020-03-07 19:42:14: ERROR Translation error (First codon 'AAA' is not a start codon) in sequence UGI.ctg09872.24683.1
2020-03-07 19:42:14: ERROR Translation error (Sequence length 638 is not a multiple of three) in sequence UGI.ctg09892.24686.1
2020-03-07 19:42:14: ERROR Translation error (First codon 'TAG' is not a start codon) in sequence UGI.ctg09909.24689.1
[...]
2020-03-07 19:42:21: INFO 24145/25930 (93.12%) sequences are perfect CDS (in ugi.fasta.pre.good)
2020-03-07 19:42:21: INFO 1785/25930 (6.88%) sequences are not perfect CDS (in ugi.fasta.pre.bad)
About 93% of the input data is textbook CDS sequence. The rest has some issue,
and is written to the .bad
file. Of course, not all CDS sequences start with a
start codon, and if you trust your data, you should definitely not throw those
away! (you should, however, not trust any sequence with a length that is no
multiple of three!). Here I'll continue working with the 'proper' CDS data, now
in the file ugi.fasta.pre.good
.
Note: wgd pre
can also be used to rename your sequences. If you have the kind
of fasta files with very long and awkward headers, I would recommend this, as
it will make files obtained later clearer. See the --rename
and --prefix
options.
The first step is to obtain the paranome, or the collection of all paralogous
genes. This is esentially a big graph, where the nodes are all genes in the
genome, and edges represent homology (paralogy) relationships. Of course, in
the absence of different genomes to compare to, homology is a rather
ill-defined concept, as we may as well assume all genes trace back to a common
ancestor and are thus homologous. The goal of paranome inference is of course
not to lump everything together, but simply to cluster the genome in reasonably
fine-grained paralogous gene families. Ideally, we'd like to obtain paralogous
gene families with a most recent common ancestor (MRCA) that is within the time
frame where
In general, common gene family clustering methods using Markov graph clustering
(MCL) work well for this task. In wgd
an approach based on all-vs.-all
protein similarity searches and MCL clustering is implemented. Both blastp
,
and the much faster diamond
are supported. The following will run diamond
+ mcl
to obtain the paranome:
wgd dmd ugi.fasta.pre.good
If one wishes to use the full data (not only the textbook-CDS sequences), the
--nostrictcds
and --ignorestop
options can be used. Other parameters of
interest are the eee-value threshold used to construct the sequence-similarity
graph and the inflation factor for MCL, governing the coarseness of the
inferred clusters. To see all options for wgd dmd
run wgd dmd --help
.
The paranome consists of how many families?
$ wc -l wgd_dmd/ugi.fasta.pre.good.mcl
3118 wgd_dmd/ugi.fasta.pre.good.mcl
I'll rename the file for convenience to ugi.mcl
.
To compute the
Since this takes quite a bit of time (codeml performs a rather expensive numerical optimization step), one would usually run this on a computing cluster on a couple of CPUs. This data set is still feasible to process on a laptop with 4 cores if we filter out some big families.
If you have the required tools installed, you should be able to run the following:
wgd ksd ./ugi.mcl ./ugi.fasta.pre.good -mp 1000
You might want to run this on some computing cluster, since this will take
a while and could use some resources. Note that the -mp 1000
option filters
out all families with more than a thousand pairs of sequences, further decrease
this number if you want to quickly test the whole thing, since with mp 1000 it
will still take about two hours). This will run the analysis in parallel on
four CPU cores. It starts with the biggest families (which take more time). We
can monitor the analysis in the terminal, with output that looks like this.
$ wgd ksd ./wgd_dmd/ugi.mcl ./ugi.fasta.pre.good
[...]
2020-03-07 19:01:43: INFO Performing analysis on gene family GF_003081
2020-03-07 19:01:44: INFO Performing analysis on gene family GF_003082
2020-03-07 19:01:44: INFO Performing analysis on gene family GF_003086
[...]
2020-03-07 19:01:45: INFO Analysis done
2020-03-07 19:01:45: INFO Making results data frame
2020-03-07 19:02:17: INFO Removing tmp directory
2020-03-07 19:02:18: INFO Computing weights, outlier cut-off at Ks > 5
2020-03-07 19:02:19: INFO Generating plots
2020-03-07 19:02:19: INFO Will plot **node-weighted** histograms
2020-03-07 19:02:20: INFO Done
The generated output is in the wgd_ksd
directory
$ tree wgd_ksd
wgd_ksd
├── ugi.fasta.pre.good.ks.svg
└── ugi.fasta.pre.good.ks.tsv
The main output is the .tsv
file, which contains all the results computed in
the wgd ksd
pipeline:
$ head wgd_ksd/ugi.fasta.pre.good.ks.tsv
AlignmentCoverage AlignmentIdentity AlignmentLength AlignmentLengthStripped Distance Family Ka Ks Node Omega Paralog1 Paralog2 WeightOutliersIncluded WeightOutliersExcluded
UGI.Scf00428.21433.1__UGI.Scf00767.18891.1 0.20142 0.71127 2115.0 426.0 0.51238 GF_002963 0.3227 0.6327 2.0 0.51UGI.Scf00428.21433.1 UGI.Scf00767.18891.1 1.0 1.0
UGI.Scf00223.11926.1__UGI.Scf00720.18573.1 0.31967 0.74359 732.0 234.0 0.48336 GF_002829 0.2658 0.5949 2.0 0.4467 UGI.Scf00223.11926.1 UGI.Scf00720.18573.1 1.0 1.0
UGI.Scf00013.1922.1__UGI.Scf00108.8260.1 0.84677 0.87302 372.0 315.0 0.12595 GF_002055 0.0797 0.3366 2.0 0.2369 UGI.Scf00013.1922.1 UGI.Scf00108.8260.1 1.0 1.0
UGI.Scf00015.2084.1__UGI.Scf00046.4873.1 0.30266 0.78667 1239.0 375.0 0.32056 GF_000243 0.2379 1.5895 10.0 0.1497 UGI.Scf00015.2084.1 UGI.Scf00046.4873.1 1.0 1.0
UGI.Scf00015.2084.1__UGI.Scf00051.5189.1 0.24455 0.56766 1239.0 303.0 0.99481 GF_000243 0.6446 118.4869 11.00.0054 UGI.Scf00051.5189.1 UGI.Scf00015.2084.1 0.5 0.0
UGI.Scf00046.4873.1__UGI.Scf00051.5189.1 0.23002 0.5193 1239.0 285.0 1.07317 GF_000243 4.2678 71.3586 11.0 0.0598 UGI.Scf00051.5189.1 UGI.Scf00046.4873.1 0.5 0.0
UGI.Scf00051.5189.1__UGI.Scf00090.7423.1 0.23729 0.53061 1239.0 294.0 1.09756 GF_000243 13.2138 38.8342 12.0 0.3403 UGI.Scf00090.7423.1 UGI.Scf00051.5189.1 0.33333 0.0
UGI.Scf00015.2084.1__UGI.Scf00090.7423.1 0.3293 0.625 1239.0 408.0 0.61336 GF_000243 16.5071 17.2326 12.0 0.9579 UGI.Scf00090.7423.1 UGI.Scf00015.2084.1 0.33333 0.0
UGI.Scf00046.4873.1__UGI.Scf00090.7423.1 0.33898 0.62381 1239.0 420.0 0.69171 GF_000243 3.9448 34.5095 12.0 0.1143 UGI.Scf00090.7423.1 UGI.Scf00046.4873.1 0.33333 0.0
The default plot outputted by wgd (in wgd_ksd/*.svg
) is rather ugly, but gives
a clear overview of the parameter estimates for all gene duplication events,
being Ks, Ka and ω (i.e. the synonymous distance, nonsynonyous distance and
nonsynonymous to synonymous substitution rate ratio respectively).
From this distribution it's quite clear there has been some stuff going on with this genome.