Skip to content
Jakob Russel edited this page Mar 5, 2019 · 3 revisions

Typical workflow

A simple workflow would be to:

  • Compare methods with testDA function
    • Check the results with summary (and plot)
    • Choose method that has the highest Score, as long as it is above 0
  • Run data with the chosen test with DA."test" function, where "test" is the name of the test (See list of methods here)
    • Check out your final results.

Details

testDA requires two inputs: First is a data.frame with the count-table, with samples as columns and features as rows. The data are by default assumed to be raw counts and compositional. If the data has been normalized externally or represent absolute abundances see here. It is advised to trim low abundant features from the dataset before running these analyses. See here for a convenient function. The second is the predictor of interest (e.g. control/treatment). This should be a vector in the same order as the columns in the count-table.

If df is the data.frame with the raw counts and vec is the predictor this would be the workflow:

test <- testDA(df, predictor = vec)
summary(test)

From the output of summary(test) we can determine the best method (highest Score), and then run that method to get our final results using the same input. For example if the t-test was the best:

final <- DA.ttt(df, predictor = vec)

final then contains your final results.

Several methods appear equally good

It is often the case that several methods appear equally good based on the output from the testDA function. There are two auxiliary functions which can help in this situation:

allDA runs several methods to compare their results:

For example, one could check if the best methods detect the same significant features:

testa <- allDA(df, predictor = vec) 

testa$adj contains adjusted p-values for all methods. testa$est contains estimates from all methods.

You can then make Venn diagrams to easily check how many of the same features are detected as significant by different methods (tests denotes which methods you want to compare):

vennDA(testa, tests = c("ttt","wil","ltt"))

See results from a single method (e.g. t.test "ttt"): View(testa$results$ttt)

powerDA runs one method with spike-ins with different effect sizes:

This could help to differentiate methods that appear equally good from the output of testDA

po.ttt <- powerDA(df, predictor = vec, test = "ttt")
plot(po.ttt)
summary(po.ttt)

Understand output of the testDA function

Methods are compared with "False Discovery Rate" (FDR), "Area Under the (Receiver Operator) Curve" (AUC), "Empircal power" (Power), and "False Positive Rate" (FPR).

By shuffling the predictor variable and spiking a subset of the features, we know a priori which features should be siginificant (the spiked features) and which features shouldn't (the non-spiked features).

FPR indicates the proportion of non-spiked features that were found to be significant (nominal p-value below 0.05). With randomly generated data the p-value distribution should be uniform (flat) and 5% of the raw p-values are expected to be below 0.05. We therefore want an FPR at 0.05 or lower.

FDR indicates the proportion of significant features (after multiple correction) that were not spiked and therefore shouldn't be significant. This should be as low as possible.

AUC is estimated by ranking all features by their respective p-value, and finding the area under the ROC curve. For a good method, p-values from the spiked features should be among the lowest. For a perfect method, with AUC = 1, all the lowest p-values are spiked features. An AUC of 0.5 means that p-values from the spiked features are randomly spread across the spectrum. An AUC below 0.5 means that p-values from spiked features in general are higher than for non-spiked features. Therefore, we want an AUC as high as possible; spiked features should have low p-values.

Power is the proportion of spiked features that are significant after multiple correction of the p-values. It is therefore the proportion of features you would expect to detect in a regular analysis. The higher the power, the better.

To compare methods, a Score is calculated for each method as follows: (Area Under the ROC Curve - 0.5) * Power - False Discovery Rate

The higher the Score, the better the method is estimated to be.

With summary the 90% confidence limits of the Scores are computed. If these overlap, it means that the methods cannot be differentiated. The choice then relies on the trade-off between specificity (low FDR) and sensitivity (high Power). Asterisks (*) in the summary output means that a method is equally good as the best method.

If the best Score is zero or below, it means that no method can reliably identify significant features with the set effect size. You can then run the test again with either a higher effectSize for the spike-ins or with a pruned dataset (see preDA).

Main functions

  • testDA: It shuffles predictor, spike-in data, runs all methods and compares their performance
  • powerDA: It shuffles predictor, spike-in data at different effect sizes, runs one method to evaluate its performance
  • allDA: It runs all methods on the input data to compare their final results (significance calling and effect sizes)
  • DA."test": Collection of functions that run one method (as defined by "test") on the input data.

Paired or blocked data

If you have dependencies in your data which you want to control for. E.g. repeated samples from same patients, or control/treatments inside blocks. This functionality will run paired t-tests, paired wilcoxon tests, random-effects linear models, etc. See section on implemented methods for which models accept a paired argument.

The paired argument should be a factor with the ID of the patient/subject/block (in the same order as columns in data):

test <- testDA(df, predictor = vec, paired = subjectID)

Covariates

With some methods, additional covariates can be included in the model. These covariates will be included in the model as provided, unshuffled. The covars argument should be a named list with the covariables (in the same order as columns in data):

test <- testDA(df, predictor = vec, covars = list(Age = subject.age, Date = exp.date))

Multi-class predictors

There are generally two ways to output results with a categorical predictor with multiple levels; either there is one p-value indicating whether the categories are similar or different (e.g. as in ANOVA), or there is one p-value for each level, where the first level is set as intercept and the remaining levels are tested against the intercept (e.g. in linear regression). For some methods you can choose which option fits you, with other methods not, but it is crucial that this option is similar in the testDA function as in the final analysis.

By default, for all methods possible you get one p-value for multi-class predictor variables. For ANOVA and linear models (+GLMs) you can subsequently run post-hoc tests for all pairwise comparisons (see extra features).

For linear models (lrm, llm, llm2, lma, lmc), GLMs (poi, neb, qpo, zpo, znb), limma models (vli, lim, lli, lli2, lia, lic), edgeR (erq, erq2) and DESeq (ds2, ds2x), you can use the out.all argument to toggle how multi-class predictors are treated. If out.all = TRUE (default for multi-class predictors) you get one p-value for the predictor (no matter how many levels/categories it has). If out.all = FALSE you get the p-value for the level of the predictor specified by coeff tested against the intercept (default coeff is the 2. level). Use this if you are interested in testing two or more levels against a baseline, and remember to set the levels of the predictor such that the baseline is the first level (e.g. factor(predictor, levels = c("baseline","treatment1","treatment2"))).

Important: log2FC is by default the fold change between the 2. and 1. level of the predictor, no matter what out.all is set to. Use the coeff argument to change this.

Below is a description of how the different methods treat multi-class predictors:

Linear models (lrm, llm, llm2, lma, lmc) and GLMs (poi, neb, qpo, zpo, znb) use anova/drop1 if out.all=TRUE

Limma models run moderated F-tests if out.all=TRUE and moderated t-tests otherwise. You can set allResults = TRUE and use topTable on the output to get the desired results.

MetagenomeSeq ZIG is set to always output p.values from the 2. level of the predictor. For your final analysis use the by argument to change this, or set allResults = TRUE and then use MRtable on the output to get the desired results.

DESeq2 is running Log Ratio Test (LRT) if out.all=TRUE and Wald test otherwise. log2FoldChange is by default the fold change between the second and first level of the predictor for LRT, and fold change between second level and either first level or overall mean for Wald test (depends on the DESeq package version), use the coeff argument to change this. For your final analysis you can set allResults = TRUE and use results on the output, e.g. for getting pairwise comparisons or p-values/log2FC for covariates.

For SAMSeq, ANOVAs, Quade test, Friedman test, Kruskal-Wallis test you always get one p-value for the predictor.

Absolute abundances or external normalization

E.g. normalized protein abundance, normalized metabolite abundance, or absolute microbiome abundance

test <- testDA(data, predictor = vec, relative = FALSE)

Setting relative=FALSE will disable internal normalizations. It is therefore for externally normalized abundances or absolute abundances.

If relative=FALSE the values in data will NOT be normalized by the sum of the samples for "ttt", "ltt", "ltt2", "wil", "per", "lrm", "llm", "llm2", "lim", "lli", "lli2", "pea", "spe", "aov", "lao", "lao2" and "kru", and there will NOT be an offset of log(librarySize) for "neb", "poi", "qpo", "zpo" and "znb". Furthermore, tests that are specifically designed to handle sequence data are turned off: DESeq2, EdgeR, MetagenomeSeq, ALDEx2, baySeq, CLR/ALR methods. Therefore, the data is analysed as provided, except for the tests that log-transform the data: "ltt", "llm", "lli" and "lao".

Large datasets

If you have more than 5k features (or 2k for paired analysis)

Runtime of the different methods can vary quite a lot, and some methods are simply unfeasible for datasets with several thousands of features. The runtimeDA function can estimate the runtime of the different methods on your dataset: It runs on small subsets of the data and then extrapolates the runtime from these subsets. It will not be super precise if you have hundredths of thousands of features or more, but it should give a decent estimate. The function prints how many minutes it will take to run each method one time. If you only use 1 core the actual runtime will be the sum of all the methods. With more cores the runtime will decrease and approach the runtime of the slowest method:

runtimeDA(df, predictor = vec)

If you have a phyloseq object

data can also be a phyloseq object. In this case, the predictor, paired and covars arguments are the names of the variables in sample_data(data):

test <- testDA(data, predictor = "Time", paired = "Patient", covars = c("Age","ExpDate"))