-
Notifications
You must be signed in to change notification settings - Fork 9
Usage
A simple workflow would be to:
- Compare methods with
testDA
function- Check the results with
summary
(andplot
) - Choose method that has the highest Score, as long as it is above 0
- Check the results with
- Run data with the chosen test with
DA."test"
function, where "test" is the name of the test (See list of methods here)- Check out your final results.
testDA
requires two inputs: First is a data.frame with the count-table, with samples as columns and features as rows. The data are by default assumed to be raw counts and compositional. If the data has been normalized externally or represent absolute abundances see here. It is advised to trim low abundant features from the dataset before running these analyses. See here for a convenient function. The second is the predictor
of interest (e.g. control/treatment). This should be a vector in the same order as the columns in the count-table.
If df
is the data.frame with the raw counts and vec
is the predictor this would be the workflow:
test <- testDA(df, predictor = vec)
summary(test)
From the output of summary(test)
we can determine the best method (highest Score), and then run that method to get our final results using the same input. For example if the t-test was the best:
final <- DA.ttt(df, predictor = vec)
final
then contains your final results.
It is often the case that several methods appear equally good based on the output from the testDA
function. There are two auxiliary functions which can help in this situation:
For example, one could check if the best methods detect the same significant features:
testa <- allDA(df, predictor = vec)
testa$adj
contains adjusted p-values for all methods.
testa$est
contains estimates from all methods.
You can then make Venn diagrams to easily check how many of the same features are detected as significant by different methods (tests
denotes which methods you want to compare):
vennDA(testa, tests = c("ttt","wil","ltt"))
See results from a single method (e.g. t.test "ttt"): View(testa$results$ttt)
This could help to differentiate methods that appear equally good from the output of testDA
po.ttt <- powerDA(df, predictor = vec, test = "ttt")
plot(po.ttt)
summary(po.ttt)
Methods are compared with "False Discovery Rate" (FDR), "Area Under the (Receiver Operator) Curve" (AUC), "Empircal power" (Power), and "False Positive Rate" (FPR).
By shuffling the predictor variable and spiking a subset of the features, we know a priori which features should be siginificant (the spiked features) and which features shouldn't (the non-spiked features).
FPR indicates the proportion of non-spiked features that were found to be significant (nominal p-value below 0.05). With randomly generated data the p-value distribution should be uniform (flat) and 5% of the raw p-values are expected to be below 0.05. We therefore want an FPR at 0.05 or lower.
FDR indicates the proportion of significant features (after multiple correction) that were not spiked and therefore shouldn't be significant. This should be as low as possible.
AUC is estimated by ranking all features by their respective p-value, and finding the area under the ROC curve. For a good method, p-values from the spiked features should be among the lowest. For a perfect method, with AUC = 1, all the lowest p-values are spiked features. An AUC of 0.5 means that p-values from the spiked features are randomly spread across the spectrum. An AUC below 0.5 means that p-values from spiked features in general are higher than for non-spiked features. Therefore, we want an AUC as high as possible; spiked features should have low p-values.
Power is the proportion of spiked features that are significant after multiple correction of the p-values. It is therefore the proportion of features you would expect to detect in a regular analysis. The higher the power, the better.
To compare methods, a Score is calculated for each method as follows: (Area Under the ROC Curve - 0.5) * Power - False Discovery Rate
The higher the Score, the better the method is estimated to be.
With summary
the 90% confidence limits of the Scores are computed. If
these overlap, it means that the methods cannot be differentiated. The
choice then relies on the trade-off between specificity (low FDR) and
sensitivity (high Power). Asterisks (*) in the summary
output means that a method is equally good as the best method.
If the best Score is zero or below, it means that no method can reliably identify
significant features with the set effect size. You can then run the test again
with either a higher effectSize for the spike-ins or with a pruned dataset (see preDA
).
-
testDA
: It shuffles predictor, spike-in data, runs all methods and compares their performance -
powerDA
: It shuffles predictor, spike-in data at different effect sizes, runs one method to evaluate its performance -
allDA
: It runs all methods on the input data to compare their final results (significance calling and effect sizes) -
DA."test"
: Collection of functions that run one method (as defined by "test") on the input data.
If you have dependencies in your data which you want to control for. E.g. repeated samples from same patients, or control/treatments inside blocks. This functionality will run paired t-tests, paired wilcoxon tests, random-effects linear models, etc. See section on implemented methods for which
models accept a paired
argument.
The paired
argument should be a factor with the ID of the
patient/subject/block (in the same order as columns in data
):
test <- testDA(df, predictor = vec, paired = subjectID)
With some methods, additional covariates can be included in the model.
These covariates will be included in the model as provided, unshuffled.
The covars
argument should be a named list with the covariables (in
the same order as columns in data
):
test <- testDA(df, predictor = vec, covars = list(Age = subject.age, Date = exp.date))
There are generally two ways to output results with a categorical
predictor with multiple levels; either there is one p-value indicating
whether the categories are similar or different (e.g. as in ANOVA), or
there is one p-value for each level, where the first level is set as
intercept and the remaining levels are tested against the intercept
(e.g. in linear regression). For some methods you can choose which
option fits you, with other methods not, but it is crucial that this
option is similar in the testDA
function as in the final analysis.
By default, for all methods possible you get one p-value for multi-class predictor variables. For ANOVA and linear models (+GLMs) you can subsequently run post-hoc tests for all pairwise comparisons (see extra features).
For linear models (lrm, llm, llm2, lma, lmc), GLMs (poi, neb, qpo, zpo, znb),
limma models (vli, lim, lli, lli2, lia, lic), edgeR (erq, erq2) and DESeq (ds2,
ds2x), you can use the out.all
argument to toggle how multi-class
predictors are treated. If out.all = TRUE
(default for multi-class
predictors) you get one p-value for the predictor (no matter how many
levels/categories it has). If out.all = FALSE
you get the p-value for
the level of the predictor specified by coeff
tested against the
intercept (default coeff
is the 2. level). Use
this if you are interested in testing two or more levels against a
baseline, and remember to set the levels of the predictor
such that
the baseline is the first level (e.g. factor(predictor, levels =
c("baseline","treatment1","treatment2"))).
Important: log2FC is by default the fold change between the 2. and 1.
level of the predictor
, no matter what out.all
is set to. Use the
coeff
argument to change this.
Linear models (lrm, llm, llm2, lma, lmc) and GLMs (poi, neb, qpo, zpo, znb) use
anova
/drop1
if out.all=TRUE
Limma models run moderated F-tests if out.all=TRUE
and moderated
t-tests otherwise. You can set allResults = TRUE
and use topTable
on
the output to get the desired results.
MetagenomeSeq ZIG is set to always output p.values from the 2. level of
the predictor
. For your final analysis use the by
argument to change
this, or set allResults = TRUE
and then use MRtable
on the output to
get the desired results.
DESeq2 is running Log Ratio Test (LRT) if out.all=TRUE
and Wald test
otherwise. log2FoldChange is by default the fold change between the
second and first level of the predictor for LRT, and fold change between
second level and either first level or overall mean for Wald test
(depends on the DESeq package version), use the coeff
argument to
change this. For your final analysis you can set allResults = TRUE
and
use results
on the output, e.g. for getting pairwise comparisons or
p-values/log2FC for covariates.
For SAMSeq, ANOVAs, Quade test, Friedman test, Kruskal-Wallis test you
always get one p-value for the predictor
.
E.g. normalized protein abundance, normalized metabolite abundance, or absolute microbiome abundance
test <- testDA(data, predictor = vec, relative = FALSE)
Setting relative=FALSE
will disable internal normalizations. It is
therefore for externally normalized abundances or absolute abundances.
If relative=FALSE
the values in data
will NOT be normalized by the
sum of the samples for "ttt", "ltt", "ltt2", "wil", "per", "lrm", "llm",
"llm2", "lim", "lli", "lli2", "pea", "spe", "aov", "lao", "lao2" and
"kru", and there will NOT be an offset of log(librarySize) for "neb",
"poi", "qpo", "zpo" and "znb". Furthermore, tests that are specifically
designed to handle sequence data are turned off: DESeq2, EdgeR,
MetagenomeSeq, ALDEx2, baySeq, CLR/ALR methods. Therefore, the data
is
analysed as provided, except for the tests that log-transform the data:
"ltt", "llm", "lli" and "lao".
If you have more than 5k features (or 2k for paired analysis)
Runtime of the different methods can vary quite a lot, and some methods
are simply unfeasible for datasets with several thousands of features.
The runtimeDA
function can estimate the runtime of the different
methods on your dataset: It runs on small subsets of the data and then
extrapolates the runtime from these subsets. It will not be super
precise if you have hundredths of thousands of features or more, but it
should give a decent estimate. The function prints how many minutes it
will take to run each method one time. If you only use 1 core the actual
runtime will be the sum of all the methods. With more cores the runtime
will decrease and approach the runtime of the slowest method:
runtimeDA(df, predictor = vec)
data
can also be a phyloseq object. In this case, the predictor
,
paired
and covars
arguments are the names of the variables in
sample_data(data)
:
test <- testDA(data, predictor = "Time", paired = "Patient", covars = c("Age","ExpDate"))