diff --git a/.gitignore b/.gitignore index 88133c6..4923aa2 100644 --- a/.gitignore +++ b/.gitignore @@ -10,3 +10,6 @@ tmp/ *.gcno revdep/ /README.html +inst/doc +/doc/ +/Meta/ diff --git a/DESCRIPTION b/DESCRIPTION index 18763e4..a749b43 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -15,7 +15,8 @@ LinkingTo: Rcpp Suggests: MASS, testthat (>= 0.8.1), - knitr + knitr, + rmarkdown Author: R Hackathon et al. (alphabetically: Ben Bolker, Marguerite Butler, Peter Cowan, Damien de Vienne, Dirk Eddelbuettel, Mark Holder, Thibaut Jombart, Steve Kembel, Francois Michonneau, David Orme, Brian O'Meara, diff --git a/vignettes/phylobase.Rmd b/vignettes/phylobase.Rmd new file mode 100644 index 0000000..643b6ca --- /dev/null +++ b/vignettes/phylobase.Rmd @@ -0,0 +1,633 @@ +--- +title: "The phylo4 classes and methods" +author: + - Ben Bolker + - Peter Cowan + - François Michonneau +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{The phylo4 classes and methods} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +```{r setup} +library(phylobase) +``` + +## Introduction + +This document describes the new 'phylo4' S4 classes and methods, which +are intended to provide a unifying standard for the representation of +phylogenetic trees and comparative data in R. The `phylobase` package +was developed to help both end users and package developers by providing +a common suite of tools likely to be shared by all packages designed for +phylogenetic analysis, facilities for data and tree manipulation, and +standardization of formats. + +This standardization will benefit *end-users* by making it easier to +move data and compare analyses across packages, and to keep comparative +data synchronized with phylogenetic trees. Users will also benefit from +a repository of functions for tree manipulation, for example tools for +including or excluding subtrees (and associated phenotypic data) or +improved tree and data plotting facilities. `phylobase` will benefit +*developers* by freeing them to put their programming effort into +developing new methods rather than into re-coding base tools. We (the +`phylobase` developers) hope `phylobase` will also facilitate code +validation by providing a repository for benchmark tests, and more +generally that it will help catalyze community development of +comparative methods in R. + +A more abstract motivation for developing `phylobase` was to improve +data checking and abstraction of the tree data formats. `phylobase` can +check that data and trees are associated in the proper fashion, and +protects users and developers from accidently reordering one, but not +the other. It also seeks to abstract the data format so that commonly +used information (for example, branch length information or the ancestor +of a particular node) can be accessed without knowledge of the +underlying data structure (i.e., whether the tree is stored as a matrix, +or a list, or a parenthesis-based format). This is achieved through +generic `phylobase` functions which which retrieve the relevant +information from the data structures. The benefits of such abstraction +are multiple: (1) *easier access to the relevant information* via a +simple function call (this frees both users and developers from learning +details of complex data structures), (2) *freedom to optimize data +structures in the future without breaking code.* Having the generic +functions in place to "translate" between the data structures and the +rest of the program code allows program and data structure development +to proceed somewhat independently. The alternative is code written for +specific data structures, in which modifications to the data structure +requires rewriting the entire package code (often exacting too high a +price, which results in the persistence of less-optimal data +structures). (3) *providing broader access to the range of tools in +`phylobase`*. Developers of specific packages can use these new tools +based on S4 objects without knowing the details of S4 programming. + +The base 'phylo4' class is modeled on the the `phylo` class in `ape`. +and extend the 'phylo4' class to include data or multiple trees +respectively. In addition to describing the classes and methods, this +vignette gives examples of how they might be used. + +## Package overview + +The phylobase package currently implements the following functions and +data structures: + +- Data structures for storing a single tree and multiple trees: and ? + +- A data structure for storing a tree with associated tip and node + data: + +- A data structure for storing multiple trees with one set of tip + data: + +- Functions for reading nexus files into the above data structures + +- Functions for converting between the above data structures and + objects as well as `phylog` objects (although the latter are now + deprecated ...) + +- Functions for editing trees and data (i.e., subsetting and + replacing) + +- Functions for plotting trees and trees with data + +## Using the S4 help system + +The help system works similarly to the help system with some small +differences relating to how methods are written. The function is a good +example. When we type we are provided the help for the default plotting +function which expects `x` and `y`. `R` also provides a way to smartly +dispatch the right type of plotting function. In the case of an object +(a class object) `R` evaluates the class of the object and finds the +correct functions, so the following works correctly. + + +```{r randtree, fig.keep="none"} +library(ape) +set.seed(1) ## set random-number seed +rand_tree <- rcoal(10) ## Make a random tree with 10 tips +plot(rand_tree) +``` + +However, typing still takes us to the default `plot` help. We have to +type to find what we are looking for. This is because generics are +simply functions with a dot and the class name added. + +The generic system is too complicated to describe here, but doesn't +include the same dot notation. As a result doesn't work, `R` still finds +the right plotting function. + +```{r convtree, fig.keep="none"} +library(phylobase) +# convert rand_tree to a phylo4 object +rand_p4_tree <- as(rand_tree, "phylo4") +plot(rand_p4_tree) +``` + +All fine and good, but how to we find out about all the great features +of the `phylobase` plotting function? `R` has two nifty ways to find it, +the first is to simply put a question mark in front of the whole call: + +```{r doc0, eval=FALSE, purl=FALSE} +`?`(plot(rand_p4_tree)) +``` + +`R` looks at the class of the object and takes us to the correct help +file (note: this only works with objects). The second ways is handy if +you already know the class of your object, or want to compare to +generics for different classes: + +```{r doc1, eval=FALSE, purl=FALSE} +`?`(method, plot("phylo4")) +``` + +More information about how documentation works can be found in the +methods package, by running the following command. + +```{r doc2, eval=FALSE, purl=FALSE} +help('Documentation', package=\"methods\") +``` + +## Trees without data + +You can start with a tree --- an object of class `phylo` from the `ape` +package (e.g., read in using the `read.tree()` or `read.nexus()` +functions), and convert it to a `phylo4` object. + +For example, load the raw *Geospiza* data: + +```{r + because +they are not defined: + +```{r nodelabelgeodata} +nodeLabels(g1) +``` + +You can also retrieve the node labels with \texttt{labels(g1,"internal")}. + +A simple way to assign the node numbers as labels (useful for various checks) is + +```{r assignnode_name} +nodeLabels(g1) <- paste("N", nodeId(g1, "internal"), sep="") +head(g1, 5) +``` + +The \texttt{summary} method gives a little extra information, including +information on the distribution of branch lengths: + +```{r sumgeodata} +summary(g1) +``` + +Print tip labels: + +```{r tiplabelgeodata} +tipLabels(g1) +``` + +(`labels(g1,"tip")` would also work.) + +You can modify labels and other aspects of the tree --- for example, to convert +all the labels to lower case: + +```{r modlabelsgeodata} +tipLabels(g1) <- tolower(tipLabels(g1)) +``` + +You could also modify selected labels, e.g. to modify the labels in positions 11 +and 13 (which happen to be the only labels with uppercase letters): + +```{r modifylabels, eval=FALSE, purl=FALSE} +tipLabels(g1)[c(11, 13)] <- c("platyspiza", "pinaroloxias") +``` + +Note that for a given tree, `phylobase` always return the `tipLabels` +in the same order. + +Print node numbers (in edge matrix order): + +```{r nodenumbergeodata} +nodeId(g1, type='all') +``` + +Does it have information on branch lengths? + +```{r hasbrlengeodata} +hasEdgeLength(g1) +``` + +It does! What do they look like? + +```{r edgeLength-geodata} +edgeLength(g1) +``` + +Note that the root has `` as its length. + +Print edge labels (also empty in this case --- therefore all `NA`): + +```{r edgelabelgeodata} +edgeLabels(g1) +``` + +You can also use this function to label specific edges: + +```{r edgelabel-assign-geodata} +edgeLabels(g1)["23-24"] <- "an edge" +edgeLabels(g1) +``` + +The edge labels are named according to the nodes they connect +(ancestor-descendant). You can get the edge(s) associated with a particular +node: + +```{r getEdge-geodata} +getEdge(g1, 24) # default uses descendant node +getEdge(g1, 24, type="ancestor") # edges using ancestor node +``` + +These results can in turn be passed to the function \texttt{edgeLength} to +retrieve the length of a given set of edges: + +```{r getEdge-edgeLength} +edgeLength(g1)[getEdge(g1, 24)] +edgeLength(g1)[getEdge(g1, 24, "ancestor")] +``` + +Is it rooted? + +```{r rootedgeodata} +isRooted(g1) +``` + +Which node is the root? + +```{r rootnodegeodata} +rootNode(g1) +``` + +Does it contain any polytomies? + +```{r polygeodata} +hasPoly(g1) +``` + +Is the tree ultrametric? + +```{r ultrametric-geodata} +isUltrametric(g1) +``` + +You can also get the depth (distance from the root) of any given node or the +tips: + +```{r nodeDepth-geodata} +nodeDepth(g1, 23) +depthTips(g1) +``` + +## Trees with data + +The `phylo4d` class matches trees with data, or combines them with a data +frame to make a `phylo4d` (tree-with-data) object. + +Now we'll take the _Geospiza_ data from `geospiza_raw$data` +and merge it with the tree. First, let's prepare the data: + +```{r} +g1 <- as(geospiza_raw$tree, "phylo4") +geodata <- geospiza_raw$data +``` + +However, since *G. olivacea* is included in the tree but not in the data +set, we will initially run into some trouble: + + +```{r geomergedata, error=TRUE, purl=FALSE} +g2 <- phylo4d(g1, geodata) +``` + +```{r echo=FALSE, results='hide'} +geodata <- geospiza_raw$data +``` + +To deal with *G. olivacea* missing from the data, we have a few choices. +The easiest is to use to allow to create the new object with a warning +(you can also use to proceed without warnings): + +```{r geomerge2, tidy=FALSE, warning=TRUE, purl=FALSE} +g2 <- phylo4d(g1, geodata, missing.data="warn") +``` + +```{r, echo=FALSE, results='hide'} +g2 <- phylo4d(g1, geodata, missing.data="OK", extra.data="OK") +``` + +Another way to deal with this would be to use `prune()` to drop the +offending tip from the tree first: + +```{r \>= summary(g2) @ + +Or use `tdata()` to extract the data (i.e., `tdata(g2)`). By default, +`tdata()` will retrieve tip data, but you can also get internal node +data only () or --- if the tip and node data have the same format --- +all the data combined (). + +If you want to plot the data (e.g. for checking the input), +`plot(tdata(g2))` will create the default plot for the data --- in this +case, since it is a data frame, this will be a `pairs` plot of the +data. + +## Subsetting + +The `subset` command offers a variety of ways of extracting portions of +a `phylo4` or `phylo4d` tree, keeping any tip/node data consistent. + +tips.include + +: give a vector of tips (names or numbers) to retain + +tips.exclude + +: give a vector of tips (names or numbers) to drop + +mrca + +: give a vector of node or tip names or numbers; extract the clade + containing these taxa + +node.subtree + +: give a node (name or number); extract the subtree starting from this + node + +Different ways to extract the *fuliginosa*-*scandens* clade: + +```{r geoextract, results='hide'} +subset(g2, tips.include=c("fuliginosa", "fortis", "magnirostris", + "conirostris", "scandens")) +subset(g2, node.subtree=21) +subset(g2, mrca=c("scandens", "fortis")) +``` + +One could drop the clade by doing + +```{r geodrop, results='hide'} +subset(g2, tips.exclude=c("fuliginosa", "fortis", "magnirostris", + "conirostris", "scandens")) +subset(g2, tips.exclude=names(descendants(g2, MRCA(g2, c("difficilis", +"fortis"))))) +``` + + +## Tree-walking + +`phylobase` provides many functions that allows users to explore +relationships between nodes on a tree (tree-walking and tree traversal). +Most functions work by specifying the `phylo4` (or `phylo4d`) object as +the first argument, the node numbers/labels as the second argument +(followed by some additional arguments). + +`getNode` allows you to find a node based on its node number or its +label. It returns a vector with node numbers as values and labels as +names: + +```{r getnode} +data(geospiza) +getNode(geospiza, 10) +getNode(geospiza, "pauper") +``` + +If no node is specified, they are all returned, and if a node can't be +found it's returned as a `NA`. It is possible to control what happens +when a node can't be found: + +```{r getnode2} +getNode(geospiza) +getNode(geospiza, 10:14) +getNode(geospiza, "melanogaster", missing="OK") # no warning +getNode(geospiza, "melanogaster", missing="warn") # warning! +``` + +`children` and `ancestor` give the immediate neighboring nodes: + +```{r children} +children(geospiza, 16) +ancestor(geospiza, 16) +``` + +while `descendants` and `ancestors` can traverse the tree up to the tips +or root respectively: + +```{r descendants} +descendants(geospiza, 16) # by default returns only the tips +descendants(geospiza, "all") # also include the internal nodes +ancestors(geospiza, 20) +ancestors(geospiza, 20, "ALL") # uppercase ALL includes self +``` + +`siblings` returns the other node(s) associated with the same ancestor: + +```{r siblings} +siblings(geospiza, 20) +siblings(geospiza, 20, include.self=TRUE) +``` + +`MRCA` returns the most common recent ancestor for a set of tips, and +shortest path returns the nodes connecting 2 nodes: + +```{r mrca} +MRCA(geospiza, 1:6) +shortestPath(geospiza, 4, "pauper") +``` + +## multiPhylo4 classes + +`multiPhylo4` classes are not yet implemented but will be coming soon. + +## Examples + +### Constructing a Brownian motion trait simulator + +This section will describe a way of constructing a simulator that +generates trait values for extant species (tips) given a tree with +branch lengths, assuming a model of Brownian motion. + +We can use to coerce the tree into a variance-covariance matrix form, +and then use `mvrnorm` from the `MASS` package to generate a set of +multivariate normally distributed values for the tips. (A benefit of +this approach is that we can very quickly generate a very large number +of replicates.) This example illustrates a common feature of working +with `phylobase` --- combining tools from several different packages to +operate on phylogenetic trees with data. + +We start with a randomly generated tree using `rcoal()` from `ape` to +generate the tree topology and branch lengths: + +```{r rtree2} +set.seed(1001) +tree <- as(rcoal(12), "phylo4") +``` + +Next we generate the phylogenetic variance-covariance matrix (by +coercing the tree to a `phylo4vcov` object) and pick a single set of +normally distributed traits (using to pick a multivariate normal deviate +with a variance-covariance matrix that matches the structure of the +tree). + +```{r vcvphylo} +vmat <- as(tree, "phylo4vcov") +vmat <- cov2cor(vmat) +library(MASS) +trvec <- mvrnorm(1, mu=rep(0, 12), Sigma=vmat) +``` + +The last step (easy) is to convert the `phylo4vcov` object back to a +`phylo4d` object: + +```{r plotvcvphylo} +treed <- phylo4d(tree, tip.data=as.data.frame(trvec)) +plot(treed) +``` + +## Definitions/slots + +This section details the internal structure of the `phylo4`, +`multiphylo4` (coming soon!), `phylo4d`, and `multiphylo4d` (coming +soon!) classes. The basic building blocks of these classes are the +`phylo4` object and a dataframe. The `phylo4` tree format is largely +similar to the one used by `phylo` class in the package `ape`[^1]. + +We use "edge" for ancestor-descendant relationships in the phylogeny +(sometimes called "branches") and "edge lengths" for their lengths +("branch lengths"). Most generally, "nodes" are all species in the tree; +species with descendants are "internal nodes" (we often refer to these +just as "nodes", meaning clear from context); "tips" are species with no +descendants. The "root node" is the node with no ancestor (if one +exists). + +### phylo4 + +Like `phylo`, the main components of the `phylo4` class are: + +edge + +: a 2-column matrix of integers, with $N$ rows for a rooted tree or + $N-1$ rows for an unrooted tree and column names `ancestor` and + `descendant`. Each row contains information on one edge in the tree. + See below for further constraints on the edge matrix. + +edge.length + +: numeric list of edge lengths (length $N$ (rooted) or $N-1$ + (unrooted) or empty (length 0)) + +tip.label + +: character vector of tip labels (required), with length=# of tips. + Tip labels need not be unique, but data-tree matching with + non-unique labels will cause an error + +node.label + +: character vector of node labels, length=# of internal nodes or 0 (if + empty). Node labels need not be unique, but data-tree matching with + non-unique labels will cause an error + +order + +: character: "preorder", "postorder", or "unknown" (default), + describing the order of rows in the edge matrix. , "pruningwise" and + "cladewise" are accepted for compatibility with `ape` + +The edge matrix must not contain `NA`s, with the exception of the root +node, which has an `NA` for `ancestor`. `phylobase` does not enforce an +order on the rows of the edge matrix, but it stores information on the +current ordering in the slot --- current allowable values are "unknown" +(the default), "preorder" (equivalent to "cladewise" in `ape`) or +"postorder" [^2]. + +The basic criteria for the edge matrix are similar to those of `ape`, as +documented it's tree specification[^3]. This is a modified version of +those rules, for a tree with $n$ tips and $m$ internal nodes: + +- Tips (no descendants) are coded $1,\ldots, n$, and internal nodes + ($\ge 1$ descendant) are coded $n + 1, \ldots , n + m$ ($n + 1$ is + the root). Both series are numbered with no gaps. + +- The first (ancestor) column has only values $> n$ (internal nodes): + thus, values $\le n$ (tips) appear only in the second (descendant) + column + +- all internal nodes (not including the root) must appear in the + first (ancestor) column at least once [unlike `ape`, which + nominally requires each internal node to have at least two + descendants (although it doesn't absolutely prohibit them and has a + function to get rid of them), `phylobase` does allow these + "singleton nodes" and has a method `hasSingle` for detecting them]. + Singleton nodes can be useful as a way of representing changes along + a lineage; they are used this way in the `ouch` package. + +- the number of occurrences of a node in the first column is related + to the nature of the node: once if it is a singleton, twice if it is + dichotomous (i.e., of degree 3 [counting ancestor as well as + descendants]), three times if it is trichotomous (degree 4), and so + on. + +`phylobase` does not technically prohibit reticulations (nodes or tips +that appear more than once in the descendant column), but they will +probably break most of the methods. Disconnected trees, cycles, and +other exotica are not tested for, but will certainly break the methods. + +We have defined basic methods for `phylo4`:`show`, `print`, and a +variety of accessor functions (see help files). `summary` does not seem +to be terribly useful in the context of a "raw" tree, because there is +not much to compute. + +### phylo4d + +The `phylo4d` class extends `phylo4` with data. Tip data, and (internal) +node data are stored separately, but can be retrieved together or +separately with `tdata(x,"tip")`, `tdata(x,"internal")` or +`tdata(x,"all")`. There is no separate slot for edge data, but these can +be stored as node data associated with the descendant node. + +[^1]: + +[^2]: see for more + information on orderings. (`ape`'s "pruningwise" is "bottom-up" + ordering). + +[^3]: [ape.mpl.ird.fr/misc/FormatTreeR_28July2008.pdf](ape.mpl.ird.fr/misc/FormatTreeR_28July2008.pdf){.uri} diff --git a/vignettes/phylobase.Rnw b/vignettes/phylobase.Rnw deleted file mode 100644 index 4f81ba6..0000000 --- a/vignettes/phylobase.Rnw +++ /dev/null @@ -1,674 +0,0 @@ -\documentclass{article} -%\VignetteEngine{knitr::knitr} -%\VignetteIndexEntry{phylo4: classes and methods for phylogenetic trees and data} - -\usepackage[utf8]{inputenc} % for UTF-8/single quotes from sQuote() -\usepackage{graphicx} -\usepackage{array} -\usepackage{url} - - -%% Use a little bit more of the page -%% borrowed from Rd.sty, of r-project.org -\addtolength{\textheight}{12mm} -\addtolength{\topmargin}{-9mm} % still fits on US paper -\addtolength{\textwidth}{24mm} % still fits on US paper -\setlength{\oddsidemargin}{10mm} -\setlength{\evensidemargin}{\oddsidemargin} - -\newcommand{\code}[1]{{{\tt #1}}} - -\title{The \code{phylo4} S4 classes and methods} -\author{Ben Bolker, Peter Cowan \& Fran\c{c}ois Michonneau} -\date{\today} - -\begin{document} - - -<>= -library(knitr) -opts_chunk$set( - fig.keep='none', dev='pdf', fig.width=6, fig.height=6, - latex.options.color="usenames,dvipsnames" -) -@ - - -\maketitle -\tableofcontents - -\section{Introduction} - -This document describes the new \code{phylo4} S4 classes and methods, which are -intended to provide a unifying standard for the representation of phylogenetic -trees and comparative data in R. The \code{phylobase} package was developed to -help both end users and package developers by providing a common suite of tools -likely to be shared by all packages designed for phylogenetic analysis, -facilities for data and tree manipulation, and standardization of formats. - -This standardization will benefit \emph{end-users} by making it easier to move -data and compare analyses across packages, and to keep comparative data -synchronized with phylogenetic trees. Users will also benefit from a repository -of functions for tree manipulation, for example tools for including or excluding -subtrees (and associated phenotypic data) or improved tree and data plotting -facilities. \code{phylobase} will benefit \emph{developers} by freeing them to -put their programming effort into developing new methods rather than into -re-coding base tools. We (the \code{phylobase} developers) hope \code{phylobase} -will also facilitate code validation by providing a repository for benchmark -tests, and more generally that it will help catalyze community development of -comparative methods in R. - -A more abstract motivation for developing \code{phylobase} was to improve data -checking and abstraction of the tree data formats. \code{phylobase} can check -that data and trees are associated in the proper fashion, and protects users and -developers from accidently reordering one, but not the other. It also seeks to -abstract the data format so that commonly used information (for example, branch -length information or the ancestor of a particular node) can be accessed without -knowledge of the underlying data structure (i.e., whether the tree is stored as -a matrix, or a list, or a parenthesis-based format). This is achieved through -generic \code{phylobase} functions which which retrieve the relevant information -from the data structures. The benefits of such abstraction are multiple: (1) -\emph{easier access to the relevant information} via a simple function call -(this frees both users and developers from learning details of complex data -structures), (2) \emph{freedom to optimize data structures in the future without - breaking code.} Having the generic functions in place to ``translate'' between -the data structures and the rest of the program code allows program and data -structure development to proceed somewhat independently. The alternative is code -written for specific data structures, in which modifications to the data -structure requires rewriting the entire package code (often exacting too high a -price, which results in the persistence of less-optimal data structures). (3) -\emph{providing broader access to the range of tools in - \code{phylobase}}. Developers of specific packages can use these new tools -based on S4 objects without knowing the details of S4 programming. - -The base \code{phylo4} class is modeled on the the \code{phylo} class in -\code{ape}. \code{phylo4d} and \code{multiphylo4} extend the \code{phylo4} -class to include data or multiple trees respectively. In addition to describing -the classes and methods, this vignette gives examples of how they might be used. - -\section{Package overview} - -The phylobase package currently implements the following functions and data structures: - -\begin{itemize} -\item Data structures for storing a single tree and multiple trees: - \code{phylo4} and \code{multiPhylo4}? -\item A data structure for storing a tree with associated tip and node data: - \code{phylo4d} -\item A data structure for storing multiple trees with one set of tip data: - \code{multiPhylo4d} -\item Functions for reading nexus files into the above data structures -\item Functions for converting between the above data structures and \code{ape - phylo} objects as well as \code{ade4} \code{phylog} objects (although the - latter are now deprecated \ldots) -\item Functions for editing trees and data (i.e., subsetting and replacing) -\item Functions for plotting trees and trees with data -\end{itemize} - -\section{Using the S4 help system} - -The \code{S4} help system works similarly to the \code{S3} help system with some -small differences relating to how \code{S4} methods are written. The -\code{plot()} function is a good example. When we type \code{?plot} we are -provided the help for the default plotting function which expects \code{x} and -\code{y}. \code{R} also provides a way to smartly dispatch the right type of -plotting function. In the case of an \code{ape phylo} object (a \code{S3} class -object) \code{R} evaluates the class of the object and finds the correct -functions, so the following works correctly. - -<>= -library(ape) -set.seed(1) ## set random-number seed -rand_tree <- rcoal(10) ## Make a random tree with 10 tips -plot(rand_tree) -@ - -However, typing \code{?plot} still takes us to the default \code{plot} help. We -have to type \code{?plot.phylo} to find what we are looking for. This is -because \code{S3} generics are simply functions with a dot and the class name -added. - -The \code{S4} generic system is too complicated to describe here, but doesn't -include the same dot notation. As a result \code{?plot.phylo4} doesn't work, -\code{R} still finds the right plotting function. - -<>= -library(phylobase) -# convert rand_tree to a phylo4 object -rand_p4_tree <- as(rand_tree, "phylo4") -plot(rand_p4_tree) -@ - -All fine and good, but how to we find out about all the great features of the -\code{phylobase} plotting function? \code{R} has two nifty ways to find it, the -first is to simply put a question mark in front of the whole call: - -<>= -`?`(plot(rand_p4_tree)) -@ - -\code{R} looks at the class of the \code{rand\_p4\_tree} object and takes us to -the correct help file (note: this only works with \code{S4} objects). The -second ways is handy if you already know the class of your object, or want to -compare to generics for different classes: - -<>= -`?`(method, plot("phylo4")) -@ - -More information about how \code{S4} documentation works can be found in the -methods package, by running the following command. - -<>= -help('Documentation', package="methods") -@ - -\section{Trees without data} - -You can start with a tree --- an object of class \code{phylo} from the -\code{ape} package (e.g., read in using the \code{read.tree()} or -\code{read.nexus()} functions), and convert it to a \code{phylo4} object. - -For example, load the raw \emph{Geospiza} data: -<>= -library(phylobase) -data(geospiza_raw) -## what does it contain? -names(geospiza_raw) -@ - -Convert the \code{S3} tree to a \code{S4 phylo4} object using the \code{as()} -function: - -<>= -(g1 <- as(geospiza_raw$tree, "phylo4")) -@ - -The (internal) nodes appear with labels \verb++ because -they are not defined: - -<>= -nodeLabels(g1) -@ - -You can also retrieve the node labels with \code{labels(g1,"internal")}). - -A simple way to assign the node numbers as labels (useful for various checks) is - -<<>>= -nodeLabels(g1) <- paste("N", nodeId(g1, "internal"), sep="") -head(g1, 5) -@ - -The \code{summary} method gives a little extra information, including -information on the distribution of branch lengths: - -<>= -summary(g1) -@ - -Print tip labels: -<>= -tipLabels(g1) -@ - -(\code{labels(g1,"tip")} would also work.) - -You can modify labels and other aspects of the tree --- for example, to convert -all the labels to lower case: - -<>= -tipLabels(g1) <- tolower(tipLabels(g1)) -@ - -You could also modify selected labels, e.g. to modify the labels in positions 11 -and 13 (which happen to be the only labels with uppercase letters): - -<>= -tipLabels(g1)[c(11, 13)] <- c("platyspiza", "pinaroloxias") -@ - -Note that for a given tree, \code{phylobase} always return the \code{tipLabels} -in the same order. - -Print node numbers (in edge matrix order): -<>= -nodeId(g1, type='all') -@ - -Does it have information on branch lengths? -<>= -hasEdgeLength(g1) -@ - -It does! What do they look like? -<>= -edgeLength(g1) -@ - -Note that the root has \verb++ as its length. - -Print edge labels (also empty in this case --- therefore all \code{NA}): - -<>= -edgeLabels(g1) -@ - -You can also use this function to label specific edges: -<>= -edgeLabels(g1)["23-24"] <- "an edge" -edgeLabels(g1) -@ - -The edge labels are named according to the nodes they connect -(ancestor-descendant). You can get the edge(s) associated with a particular -node: - -<>= -getEdge(g1, 24) # default uses descendant node -getEdge(g1, 24, type="ancestor") # edges using ancestor node -@ - -These results can in turn be passed to the function \code{edgeLength} to -retrieve the length of a given set of edges: - -<>= -edgeLength(g1)[getEdge(g1, 24)] -edgeLength(g1)[getEdge(g1, 24, "ancestor")] -@ - -Is it rooted? - -<>= -isRooted(g1) -@ - -Which node is the root? -<>= -rootNode(g1) -@ - -Does it contain any polytomies? -<>= -hasPoly(g1) -@ - -Is the tree ultrametric? -<>= -isUltrametric(g1) -@ - -You can also get the depth (distance from the root) of any given node or the -tips: -<>= -nodeDepth(g1, 23) -depthTips(g1) -@ - -\section{Trees with data} - -The \code{phylo4d} class matches trees with data, or combines them with a data -frame to make a \code{phylo4d} (tree-with-data) object. - -Now we'll take the \emph{Geospiza} data from \verb+geospiza_raw$data+ and merge -it with the tree. First, let's prepare the data: - -<>= -g1 <- as(geospiza_raw$tree, "phylo4") -geodata <- geospiza_raw$data -@ - - -However, since \emph{G. olivacea} is included in the tree but -not in the data set, we will initially run into some trouble: - -<>= -g2 <- phylo4d(g1, geodata) -@ - -<>= -geodata <- geospiza_raw$data -@ - -To deal with \emph{G. olivacea} missing from the data, we have a few -choices. The easiest is to use \code{missing.data="warn"} to allow \code{R} to -create the new object with a warning (you can also use \code{missing.data="OK"} -to proceed without warnings): - -<>= -g2 <- phylo4d(g1, geodata, missing.data="warn") -@ - -<>= -g2 <- phylo4d(g1, geodata, missing.data="OK", extra.data="OK") -@ - -Another way to deal with this would be to use \code{prune()} to drop the -offending tip from the tree first: - -<>= -g1sub <- prune(g1, "olivacea") -g1B <- phylo4d(g1sub, geodata) -@ - -The difference between the two objects is that the species \emph{G. olivacea} is -still present in the tree but has no data (i.e., \verb+NA+) associated with -it. In the other case, \textit{G. olivacea} is not included in the tree -anymore. The approach you choose depends on the goal of your analysis. - -You can summarize the new object with the function \code{summary}. It breaks -down the statistics about the traits based on whether it is associated with the -tips for the internal nodes: -<>= -summary(g2) -@ - -Or use \code{tdata()} to extract the data (i.e., \code{tdata(g2)}). By default, -\code{tdata()} will retrieve tip data, but you can also get internal node data -only (\code{tdata(tree, "internal")}) or --- if the tip and node data have the -same format --- all the data combined (\code{tdata(tree, "allnode")}). - -If you want to plot the data (e.g. for checking the input), -\code{plot(tdata(g2))} will create the default plot for the data --- in this -case, since it is a data frame [\textbf{this may change in future versions but - should remain transparent}] this will be a \code{pairs} plot of the data. - -\section{Subsetting} - -The \code{subset} command offers a variety of ways of extracting portions of a -\code{phylo4} or \code{phylo4d} tree, keeping any tip/node data consistent. - -\begin{description} -\item[tips.include]{give a vector of tips (names or numbers) to retain} -\item[tips.exclude]{give a vector of tips (names or numbers) to drop} -\item[mrca]{give a vector of node or tip names or numbers; extract the clade - containing these taxa} -\item[node.subtree]{give a node (name or number); extract the subtree starting - from this node} -\end{description} - -Different ways to extract the \emph{fuliginosa}-\emph{scandens} clade: - -<>= -subset(g2, tips.include=c("fuliginosa", "fortis", "magnirostris", - "conirostris", "scandens")) -subset(g2, node.subtree=21) -subset(g2, mrca=c("scandens", "fortis")) -@ - -One could drop the clade by doing - -<>= -subset(g2, tips.exclude=c("fuliginosa", "fortis", "magnirostris", - "conirostris", "scandens")) -subset(g2, tips.exclude=names(descendants(g2, MRCA(g2, c("difficilis", - "fortis"))))) - -@ - -% This isn't implemented yet - -% Another approach is to pick the subtree graphically, by plotting the tree and -% using \code{identify}, which returns the identify of the node you click on -% with the mouse. -% -% <>= -% plot(g1) -% n1 <- identify(g1) -% subset(g2,node.subtree=n1) -% @ - -\section{Tree-walking} - -\code{phylobase} provides many functions that allows users to explore -relationships between nodes on a tree (tree-walking and tree traversal). Most -functions work by specifying the \code{phylo4} (or \code{phylo4d}) object as the -first argument, the node numbers/labels as the second argument (followed by some -additional arguments). - -\code{getNode} allows you to find a node based on its node number or its -label. It returns a vector with node numbers as values and labels as names: - -<>= -data(geospiza) -getNode(geospiza, 10) -getNode(geospiza, "pauper") -@ - -If no node is specified, they are all returned, and if a node can't be found -it's returned as a \verb+NA+. It is possible to control what happens when a node -can't be found: - -<>= -getNode(geospiza) -getNode(geospiza, 10:14) -getNode(geospiza, "melanogaster", missing="OK") # no warning -getNode(geospiza, "melanogaster", missing="warn") # warning! -@ - -\code{children} and \code{ancestor} give the immediate neighboring nodes: - -<>= -children(geospiza, 16) -ancestor(geospiza, 16) -@ - -while \code{descendants} and \code{ancestors} can traverse the tree up to the -tips or root respectively: - -<>= -descendants(geospiza, 16) # by default returns only the tips -descendants(geospiza, "all") # also include the internal nodes -ancestors(geospiza, 20) -ancestors(geospiza, 20, "ALL") # uppercase ALL includes self -@ - -\code{siblings} returns the other node(s) associated with the same ancestor: - -<>= -siblings(geospiza, 20) -siblings(geospiza, 20, include.self=TRUE) -@ - -\code{MRCA} returns the most common recent ancestor for a set of tips, and -shortest path returns the nodes connecting 2 nodes: - -<>= -MRCA(geospiza, 1:6) -shortestPath(geospiza, 4, "pauper") -@ - -\section{multiPhylo4 classes} - -\code{multiPhylo4} classes are not yet implemented but will be coming soon. - -\section{Examples} - -\subsection{Constructing a Brownian motion trait simulator} - -This section will describe a way of constructing a simulator that generates -trait values for extant species (tips) given a tree with branch lengths, -assuming a model of Brownian motion. - -We can use \code{as(tree,"phylo4vcov")} to coerce the tree into a -variance-covariance matrix form, and then use \code{mvrnorm} from the -\code{MASS} package to generate a set of multivariate normally distributed -values for the tips. (A benefit of this approach is that we can very quickly -generate a very large number of replicates.) This example illustrates a common -feature of working with \code{phylobase} --- combining tools from several -different packages to operate on phylogenetic trees with data. - -We start with a randomly generated tree using \code{rcoal()} from \code{ape} to -generate the tree topology and branch lengths: - -<>= -set.seed(1001) -tree <- as(rcoal(12), "phylo4") -@ - -Next we generate the phylogenetic variance-covariance matrix (by coercing the -tree to a \code{phylo4vcov} object) and pick a single set of normally -distributed traits (using \code{MASS:mvrnorm} to pick a multivariate normal -deviate with a variance-covariance matrix that matches the structure of the -tree). - -<>= -vmat <- as(tree, "phylo4vcov") -vmat <- cov2cor(vmat) -library(MASS) -trvec <- mvrnorm(1, mu=rep(0, 12), Sigma=vmat) -@ - -The last step (easy) is to convert the \code{phylo4vcov} object back to a -\code{phylo4d} object: - -<>= -treed <- phylo4d(tree, tip.data=as.data.frame(trvec)) -plot(treed) -@ - -% \subsubsection{The hard way} - -% <>= -% ## add node labels so we can match to data -% nodeLabels(tree) <- as.character(nodeId(tree, "internal")) -% ## ordering will make sure that we have ancestor value -% ## defined before descendant -% tree <- reorder(tree, "preorder") -% edgemat <- edges(tree) -% ## set aside space for values -% nodevals <- numeric(nrow(edgemat)) -% ## label data in edge matrix order -% names(nodevals) <- labels(tree, "all")[nodeId(tree, "all")] -% ## variance is proportional to edge length; drop first -% ## element of edge length, which is NA -% dvals <- rnorm(nrow(edgemat) - 1, sd=edgeLength(tree)[-1]^2) -% ## indexing: ind[node number] gives position in edge matrix -% ind <- order(nodeId(tree, "all")) -% for (i in 2:nrow(edgemat)) { -% ## value of ancestor node plus change -% nodevals[i] <- nodevals[ind[edgemat[i, 1]]] + dvals[i - 1] -% } -% nodevals <- data.frame(nodevals) -% treed2 <- phylo4d(tree, all.data=nodevals) -% @ - - -% ======================================== -% = Table of commands, worth the effort? = -% ======================================== -% \begin{tabular}{>{\tt}ll} -% \hline -% \rm Method & Description\\ -% \hline -% tdata & Retrieve tip data\\ -% plot & plot tree with data if present\\ -% \hline -% \end{tabular} - - -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%% Appendices %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\appendix -\section{Definitions/slots} - -This section details the internal structure of the \code{phylo4}, -\code{multiphylo4} (coming soon!), \code{phylo4d}, and \code{multiphylo4d} -(coming soon!) classes. The basic building blocks of these classes are the -\code{phylo4} object and a dataframe. The \code{phylo4} tree format is largely -similar to the one used by \code{phylo} class in the package -\code{ape}\footnote{\url{http://ape.mpl.ird.fr/}}. - -We use ``edge'' for ancestor-descendant relationships in the phylogeny -(sometimes called ``branches'') and ``edge lengths'' for their lengths (``branch -lengths''). Most generally, ``nodes'' are all species in the tree; species with -descendants are ``internal nodes'' (we often refer to these just as ``nodes'', -meaning clear from context); ``tips'' are species with no descendants. The -``root node'' is the node with no ancestor (if one exists). - -\subsection{phylo4} -Like \code{phylo}, the main components of -the \code{phylo4} class are: -\begin{description} -\item[edge]{a 2-column matrix of integers, - with $N$ rows for a rooted tree or - $N-1$ rows for an unrooted tree and - column names \code{ancestor} and \code{descendant}. - Each row contains information on one edge in the tree. - See below for further constraints on the edge matrix.} -\item[edge.length]{numeric list of edge lengths - (length $N$ (rooted) or $N-1$ (unrooted) or empty (length 0))} -\item[tip.label]{character vector of tip labels (required), with length=\# of - tips. Tip labels need not be unique, but data-tree matching with non-unique - labels will cause an error} -\item[node.label]{character vector of node labels, length=\# of internal nodes - or 0 (if empty). Node labels need not be unique, but data-tree matching - with non-unique labels will cause an error} -\item[order]{character: ``preorder'', ``postorder'', or ``unknown'' (default), - describing the order of rows in the edge matrix. , ``pruningwise'' and - ``cladewise'' are accepted for compatibility with \code{ape}} -\end{description} - -The edge matrix must not contain \code{NA}s, with the exception of the root -node, which has an \code{NA} for \code{ancestor}. \code{phylobase} does not -enforce an order on the rows of the edge matrix, but it stores information on -the current ordering in the \code{@order} slot --- current allowable values are -``unknown'' (the default), ``preorder'' (equivalent to ``cladewise'' in -\code{ape}) or ``postorder'' \footnote{see - \url{http://en.wikipedia.org/wiki/Tree_traversal} for more information on - orderings. (\code{ape}'s ``pruningwise'' is ``bottom-up'' ordering).}. - -The basic criteria for the edge matrix are similar to those of \code{ape}, as -documented it's tree -specification\footnote{\url{ape.mpl.ird.fr/misc/FormatTreeR_28July2008.pdf}}. This -is a modified version of those rules, for a tree with $n$ tips and $m$ internal -nodes: - -\begin{itemize} -\item Tips (no descendants) are coded $1,\ldots, n$, - and internal nodes ($\ge 1$ descendant) - are coded $n + 1, \ldots , n + m$ - ($n + 1$ is the root). - Both series are numbered with no gaps. -\item The first (ancestor) - column has only values $> n$ (internal nodes): thus, values $\le n$ - (tips) appear only in the second (descendant) column) -\item all internal nodes [not including the root] must appear in the first - (ancestor) column at least once [unlike \code{ape}, which nominally requires - each internal node to have at least two descendants (although it doesn't - absolutely prohibit them and has a \code{collapse.singles} function to get rid - of them), \code{phylobase} does allow these ``singleton nodes'' and has a - method \code{hasSingle} for detecting them]. Singleton nodes can be useful as - a way of representing changes along a lineage; they are used this way in the - \code{ouch} package. - -\item the number of occurrences of a node in the first column is related to the - nature of the node: once if it is a singleton, twice if it is dichotomous - (i.e., of degree 3 [counting ancestor as well as descendants]), three times if - it is trichotomous (degree 4), and so on. -\end{itemize} - -\code{phylobase} does not technically prohibit reticulations (nodes or tips that -appear more than once in the descendant column), but they will probably break -most of the methods. Disconnected trees, cycles, and other exotica are not -tested for, but will certainly break the methods. - -We have defined basic methods for \code{phylo4}:\code{show}, \code{print}, and a -variety of accessor functions (see help files). \code{summary} does not seem to -be terribly useful in the context of a ``raw'' tree, because there is not much -to compute. - -\subsection{phylo4d} - -The \code{phylo4d} class extends \code{phylo4} with data. Tip data, and -(internal) node data are stored separately, but can be retrieved together or -separately with \code{tdata(x,"tip")}, \code{tdata(x,"internal")} or -\code{tdata(x,"all")}. There is no separate slot for edge data, but these can be -stored as node data associated with the descendant node. - - -% \subsection{multiphylo4} - -\end{document}