Given an outgroup among your UCE data, and a fossil calibration for the split between your outgroup and ingroup, this program estimates the MRCA of your ingroup based on sequence divergence. To roughly account for rate variation among the taxa (and incomplete lineage sorting!), if the divergence between the ingroup species is greater than that between the outgroup/ingroup species, these loci are ditched i.e. only loci where "divergence between ingroup species" < "divergence between ingroup and outgroup species" are used, because these should be loci that are evolving in a more clock-like manner (which is what we are assuming by not explicitly accounting for rate variation between lineages/gene-tree and species-tree discordance, as this is a tree-free approach). Because of this, your ingroup and outgroup also need to be reciprocally monophyletic, and no additional out-outgroups should be included. See section below on removing unwanted sequences...
This program is designed to be run inside a folder that has a separate *.fasta or *.nex file for each of your UCE loci. Within that folder you should place your UCE_divergence.settings file (described below), UCE_divergence.sh, shortennames.R (the library stringr needs to have previously been installed as an R-dependency using install.packages("stringr")), and UCE_divergence.R. You can then execute UCE_divergence_times by bash UCE_divergence.sh.
First line: Your UCE_divergence.settings file is just a simple text file with a different option on each line. The first of these is whether your data is *.fasta or *.nexus. If it is *.nexus, is the extension *.nexus or *.nex (some of the programs care about the specific extension name)? First line = fasta, nexus or nex e.g.
nexus
Second line: Your next option is which program you would rather use to estimate the pairwise divergences between your UCE loci: fastphylo or paup. -- fastphylo has a correction for saturation of sequences which paup doesn't have, which can be specified by "F". I have tested a few values and 0.95 seems to give the best minimum clade age estimates. However, it is more limited in distance models than paup: (possible values=JC, K2P, TN93, HAMMING,K2P) -- paup has a lot of options for distances. For the distance model itself you have JC, F81, TAJNEI, K2P, F84, HKY85, K3P, TAMNEI, GTR, ML, LOGDET, UPHOLT, NEILI. You can have RATES=EQUAL RATES=GAMMA. GAMMA rates not allowed with LOGDET. You can have SHAPE = real-value, PINVAR = real-value, and many more options as specified in paup (http://www.ucl.ac.uk/cecd/downloads/PAUP%20command%20reference%20manual.pdf)
nexus
fastphylo
If you select paup, make sure you have it installed and within your path. If you select fastphylo, make sure you have this installed and within your path. Additionally, if you are starting from *.nexus (or *.nex = same thing, different extension name) files, and want to use fastphylo, make sure you have seqmagick installed and in your path (this is going to convert from the *.nexus to *.fasta files that fastphylo can deal with). You also need to make sure that the path to the seqmagick python libs are in your PYTHONPATH e.g. PYTHONPATH="${PYTHONPATH}:/home/a499a400/bin/seqmagick/lib/python" export PYTHONPATH
Third line: Specifying the options that you want to use with whatever evolutionary distance program you have selected
Paup options example: DISTANCE=NEILI RATES=GAMMA SHAPE=300 MISSDIST=IGNORE
fastphylo options example: -D HAMMING -F 0.95
Note: I believe that the MISSDIST=IGNORE option in Paup is probably critical for this pipeline to run correctly if you have missing data.
nexus
fastphylo
-D HAMMING -F 0.95
Fourth line: Here you need your fossil calibration in whatever units you want the final estimate of TMRCA of your ingroup in.
nexus
fastphylo
-D HAMMING -F 0.95
85
Fifth line and onwards: Here, you need to specify the outgroups that are present in your sequence (*.fasta, *.nexus) files. If you have multiple outgroups, each of these should go on a separate line e.g.
nexus
fastphylo
-D HAMMING -F 0.95
85
mantheyus_phuwanensis_fmnh255495b
mantheyus_phuwanensis_fmnh262580
If you've got multiple non-monophyletic outgroups, you'll need to jettison these for the program to run correctly. To do this, you'll need to get a "species_assignment" file together (as described in step 3) from https://github.com/laninsky/Phase_hybrid_from_next_gen/tree/master/post-processing. We need to run the remove_uncertains_UCE.R code in this repository, rather than the remove_uncertains.R code in the Phas_hybrid_from_next_gen repository, because we do not want to jettison any sequences that don't have sequence data.
This code works on fasta files, so if you are coming from *.nex or *.nexus files you'll need to run the first two blocks of the following code before you can run the remove_uncertains_UCE.R code. For your species_assignments file (see link above for more detail), you don't need to worry about designating any species as 'hybrid' for our application here, but do make sure the sequences you want to get rid of are labelled as uncertain.
if [ "$filetype" = "nexus" ]; then
for f in `ls *.nexus`; do
newname=`echo $f | sed "1s/.nexus//"`
mv $newname.nexus $newname.nex
done
fi
if [ "$filetype" != "fasta" ]; then
for f in `ls *.nex*`; do
newname=`echo $f | sed "1s/.nex//"`
seqmagick convert --alphabet dna-ambiguous $newname.nex $newname.fasta
done
fi
unset i
for i in `ls *.fasta`;
do mv $i temp;
Rscript remove_uncertains_UCE.R
mv temp.fa $i;
rm -rf temp;
done;
Once you have done this, your sequences are now in *.fasta format, so make sure you change the first line in your UCE_divergence.settings file to reflect this before attempting to run the UCE_divergence_times pipeline.
You should cite any of these awesome programs that you use (depending on the options you specify in your UCE_divergence.settings file)
R: R Core Team. 2015. R: A language and environment for statistical computing. URL http://www.R-project.org/. R Foundation for Statistical Computing, Vienna, Austria. https://www.r-project.org/
Stringr: Hadley Wickham (2012). stringr: Make it easier to work with strings.. R package version 0.6.2. http://CRAN.R-project.org/package=stringr (for up-to-date citation information run citation("stringr" in R)
PAUP: Swofford, David L. "{PAUP*. Phylogenetic analysis using parsimony (* and other methods). Version 4.}." (2003). http://people.sc.fsu.edu/~dswofford/paup_test/
Fastphylo: Khan, Mehmood A., Isaac Elias, Erik Sjölund, Kristina Nylander, Roman V. Guimera, Richard Schobesberger, Peter Schmitzberger, Jens Lagergren, and Lars Arvestad. "Fastphylo: Fast tools for phylogenetics." BMC bioinformatics 14, no. 1 (2013): 334. http://fastphylo.sourceforge.net/
Seqmagick: seqmagick is written and maintained by the Matsen Group at the Fred Hutchinson Cancer Research Center. http://seqmagick.readthedocs.org/en/latest/
If you want to cite the pipeline:
Alexander, A. 2015. UCE_divergence_times v1.0.0. Available from https://github.com/laninsky/UCE_divergence_times
Grismer, J.L., Schulte, J.A., Alexander, A., Wagner, P., Travers, S.L., Buehler, M.D., Welton, L.J. and Brown, R.M., 2016. The Eurasian invasion: phylogenomic data reveal multiple Southeast Asian origins for Indian Dragon Lizards. BMC evolutionary biology, 16(1), p.43.