-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
207 lines (160 loc) · 11.9 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
TERMS
Event - the change of character states on a branch of phylogenetic tree,
e.g. mutation in a genome site or the change of phenotypic states
Site - the integer identifier for the set of events. It corresponds to a
genome or a protein site where mutations occurred or to the set of
specific phenotypic states, e.g. possible phenotypic reactions on a drug
treatment.
DESCRIPTION
This is a stub module that contains command-line utilities for searching
for associations between events on a phylogenetic tree. Events may be
mutations in genome sites for which pairwise epistatic statistics would
be calculated or mutations and changes of phenotypic states for which
associations of mutations and phenotype changes are calculated.
The method is based on the following publications:
1. Kryazhimsky S., et. al. Prevalence of Epistasis in the Evolution of Influenza A Surface Proteins. 2011 PLoS Genetics 7(2):e1001301
2. Neverov A.D., et al. Coordinated Evolution of Influenza A Surface Proteins. 2015 PLoS Genetics 11(8):e1005404
3. Neverov A.D., et al. Episodic evolution of coadapted sets of amino acid sites in mitochondrial proteins. 2021 PLoS Genetics 17(1):e1008711.
4. Neverov A.D., et al. Coordinated evolution at amino acid sites of SARS-CoV-2 spike. Elife. 2023 Feb 8;12:e82516. doi: 10.7554/eLife.82516.
Pipeline scripts:
estimate_tau.pl - estimates average waiting time for an event along a phylogenetic lineage. This parameter is used for parametrization of epistatic statistics.
run_epistat.pl - starts the search for associations. The script requires the phylogenetic tree in the branch list format (XPARR) where for each branch the sets of events in the background and in the foreground are provided.
All events within common sites are aggregated, e.g. mutations in the common genome site or phenotype changes for the common phenotype ID.
The main result of processing the data is a list of all site pairs with corresponding p-values calculated according to the null-model which assumes that events in each site are inpependent of other sites.
estimate_fdr.pl - estimates FDR for p-values for pairs of sites.
mk_coevolmtx.pl - calculates pairwise pseudocorrelation statistics, e.i. normalized statistics that are analogs of covariances of random descrete variables on the phylogeneetic tree.
minvert/cor2pcor.R - calculates paiwise partial correlations for the matrix of pseudocorrelation statistics using the cor2pcor package for R
minvert/psicov.R - inverts pseudocorrelation matrix using Lasso.
Scrips that typically are not used directly but required for the
analysis:
epistat.pl - the main script in the pipeline, it calculates paiwise association statictics for pairs of sites.
This script is typically called from run_epistat.pl, however, it could be called separately for drawing events on the tree.
shuffle_incidence_matrix.R - randomly shuffles events on the branches of the tree preserving the number of events in each site and on each branch.
sample2xparr.pl - converts output of shuffle_incidence_matrix.R into trees in XPARR format.
fake_sample.pl - calculates p-value statistics for the tree with randomly shuffled events on branches. Used for the estimating of the false discovery rates (FDR).
Scripts command line templates and usage parameters are provided in the
header of each script file as comments. The documentation in the POD
format for scripts is in the preparation.
Input data format: The input data is a phylogenetic tree with a list of
substitutions on branches in XPARR format:
Header
RootID
BranchLine+
Header="child\tparent\tlength"
RootID="\w+" - any string containing letters or digits
BranchLine = "NodeID\tParentNodeID\tBranchLength\tSynSiteList\tNonSynMutList\tNonSynMutList"
site=\d+ - site in the reference sequence, positive number
SynSiteList = "site(;site)*" - a list of sites, in which synonymous substitutions has occured on this edge
NonSynMutList = "mutation(;mutation)*" - alist of amino acid substitutions on this edge
mutation = "AsiteB"
A and B is one letter amino acid code
You can find an example of XPARR tree in epistat_examples, e.g.
epistat_examples/intragene/alleles/sars2_s/S.xparr
XPARR formatted trees can be generated from the output of pylognetic
reconstruction program which can reconstruct both the tree and the
ancestor states. The mk_xparr.pl script can be used to generate XPARR
from tree and alignment:
mk_xparr.pl -t <tree_NWK> [options] <alignment_FASTA>
Here <tree_NWK> is a newick formated tree, where the bootstrap-slot is
filled with internal nodes' identifiers, and <alignment_FASTA> is a
FASTA alignment of both terminal and ancestor sequences.
AUTHOR
Alexey Dmitrievich Neverov neva_2000 (at) mail.ru
LICENCE
GPL v. 3.0
INSTALLATION
Install perl 5 (you can use perlbrew if you like)
Install R
Install python 2 (typically system python for linux, you can also use
pyenv with miniconda)
Install GNU Parallel
Install R packages:
BiRewire package for R:
To install this package, start R console (version "4.0") and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BiRewire")
For older versions of R, please refer to the appropriate Bioconductor release.
Enter at the R console: install.packages("corpcor").
Enter at the R console: install.packages("glasso").
Enter at the R console: install.packages("getopt").
If R packages have been installed in the local directory add the path to the directory to the R_LIBS environment variable or add the following to your .bash_profile:
if [ -n $R_LIBS ]; then
export R_LIBS=/usr/me/local/R/library:$R_LIBS
else
export R_LIBS=/usr/me/local/R/library
fi
Install Epistat:
Clone the repository or download and unzip the release archive.
NOTE: The Build.PL uses Pod::Select, part of the Pod-Parser distribution which will be removed from the Perl core since version 5.32.
In the case of occurring of the following compilation error: "Can't locate Pod/Select.pm in @INC" try
cpan install Pod::Select
type:
cd /path/to/unpacked/epistat/distributive
perl Build.PL
./Build installdeps
./Build build
./Build test
./Build install --install_base=/where/to/installed/epistat
Create two system variables:
EPISTAT_HOME pointing to '/where/to/installed/epistat/bin' directory and
EPISTAT_LIB pointing to the '/where/to/installed/epistat/lib/perl5'.
On linux you can modify your .bash_profile, for example:
export EPISTAT_HOME=/where/to/installed/epistat/bin
export EPISTAT_LIB=/where/to/installed/epistat/lib/perl5
QUICK_START
These instructions are for linux.
chmod -R 775 /where/to/installed/epistat/epistat_examples
cd /where/to/installed/epistat/epistat_examples
#example 1: The search for intragenic concordant and discordant evolution
cd intragene/alleles/sars2_s
Run these commands from the shell:
#This script estimates 'tau' parameter and prints it to S.tau. Copy the value of "TAU=" from S.tau to the field "TAU=" in S.epistat.prm.
$EPISTAT_HOME/estimate_tau.pl -b 0.95 S.xparr > S.tau
#It launches the analysis
$EPISTAT_HOME/run_epistat.pl -x S.xparr -m 2 -p 50000 S.epistat.prm run_epistat.prm
#It estimates FDRs for upper p-values
$EPISTAT_HOME/estimate_fdr.pl S.upper.pvalue.unord_pairs.sites.fdr.prm > S.upper.pvalue.unord_pairs.sites.fdr
#It estimates FDRs for lower p-values
$EPISTAT_HOME/estimate_fdr.pl S.lower.pvalue.unord_pairs.sites.fdr.prm > S.lower.pvalue.unord_pairs.sites.fdr
#It calculates the pseudocorrelations for site pairs. Negative values are ignored (NonNegativeElementFilter="1" in the S.mk_coevolmtx.prm).
$EPISTAT_HOME/mk_coevolmtx.pl –m Z-score S.mk_coevolmtx.prm
#It calculates the pseudocorrelations for site pairs. Does not ignore the negative values (NonNegativeElementFilter="0" in the S.all.mk_coevolmtx.prm).
$EPISTAT_HOME/mk_coevolmtx.pl –m Z-score S.all.mk_coevolmtx.prm
#It calculates the partial correlations for the matrix of positive pseudocorrelation statistics.
$EPISTAT_HOME/minvert/cor2pcor.R -f S.block.mtx -l 0.9 -n 0 -s .cor2pcor.n0.l90.R.out
#It builds the table of concordantly evolving pairs of sites having upper p-values<4.4e-4
#To start, you need to modify the field "Path=" in the file S.up00044.FDR10.mk_summary.cfg setting up to the actual path to the example folder, e.g.
# Path="/where/to/installed/epistat/epistat_examples/intragene/alleles/sars2_s"
$EPISTAT_HOME/mk_summary.pl -u -a S.up00044.FDR10.mk_summary.cfg S.cor2pcor.n0.l90.R.out>S.cor2pcor.n0.l90.up00044.FDR10.tab
#It builds the table of discordantly evolving pairs of sites having lower p-values<5.2e-4
#To start, you need to modify the field "Path=" in the file S.lp0052.FDR10.mk_summary.cfg setting up to the actual path to the example folder, e.g.
# Path="/where/to/installed/epistat/epistat_examples/intragene/alleles/sars2_s"
$EPISTAT_HOME/mk_summary.pl -u -a S.lp0052.FDR10.mk_summary.cfg -s 1 S.all.mtx>S.zscores.lp0052.FDR10.tab
#
#example 2: The search for associations of mutations with the changes of susceptible/resistant phenotypic states on the tree branches
cd /where/to/installed/epistat/epistat_examples/intergene/mtb_gwas/Amikacin
#Copy the value of "TAU=" from Amikacin.tau to the field "TAU=" in Amikacin.epistat.prm
$EPISTAT_HOME/estimate_tau.pl -b 0.95 Amikacin.xparr > Amikacin.tau
$EPISTAT_HOME/run_epistat.pl -x Amikacin.xparr -m 2 -p 10000 Amikacin.epistat.prm run_epistat.prm
#Estimate FDRs for upper and lower p-values at once
parallel $EPISTAT_HOME/estimate_fdr.pl "Amikacin.{}.pvalue.fdr.prm>Amikacin.{}.pvalue.fdr" ::: upper lower
#
#example 3: The search for concordant evolution of sites with control for their associations with phenotypes for 9 anti-tuberculosis drugs
# For multiple phenotype analysis the additional file with p-values of associations of sites and phenotypes is required - *.site2pheno
# Note: this analysis in longer than the analysis without control of phenotypes by the factor of <the_number_of_phenotypes>
cd /where/to/installed/epistat/epistat_examples/intragene/alleles/mtb/phen
$EPISTAT_HOME/estimate_tau.pl -b 0.95 9drugs.xparr > 9drugs.tau
$EPISTAT_HOME/run_epistat.pl -x 9drugs.xparr -m 2 -p 10000 epistat.phen.prm run_epistat.prm
#Estimate FDRs for upper and lower p-values at once
cat estimate_fdr.txt|parallel $EPISTAT_HOME/estimate_fdr.pl "{}.prm>{}"
$EPISTAT_HOME/mk_coevolmtx.pl –m Z-score 9drugs.mk_coevolmtx.prm
$EPISTAT_HOME/mk_coevolmtx.pl –m Z-score 9drugs.all.mk_coevolmtx.prm
$EPISTAT_HOME/minvert/cor2pcor.R -f 9drugs.block.mtx -l 0.9 -n 0 -s .cor2pcor.n0.l90.R.out
#Before the start modify files
# 9drugs.pairs2.upper.FDR5.mk_summary.cfg and
# 9drugs.pairs2.lower.FDR5.mk_summary.cfg
# setting up the thresholds corresponding to the 5% FDR for the upper pvalue (in the first file) an the lower p-value (in the second file)
$EPISTAT_HOME/mk_summary.pl -u -a 9drugs.pairs2.upper.FDR5.mk_summary.cfg 9drugs.cor2pcor.n0.l90.R.out>9drugs.cor2pcor.n0.l90.FDR5.tab
$EPISTAT_HOME/mk_summary.pl -u -a 9drugs.pairs2.lower.FDR5.mk_summary.cfg -s 1 9drugs.all.mtx>9drugs.z-score.FDR5.tab
1;