A tutorial on multiple sequence alignment
All phylogeny-inference methods require sets of homologous characters as input. When nucleotide sequences are used for phylogenetic analyses, a first step is therefore usually to infer which nucleotides in the sequences of different taxa are homologous to each other, so that differences among these nucleotides result only from changes that occurred in the evolution of the sequences. This inference of homology among nucleotides of different sequences is most often done by methods that fall into the class of "multiple sequence alignment".
- Outline
- Dataset
- Requirements
- Alignment and visualization with MAFFT and AliView
- Automated alignment filtering with BMGE
- Identification of additional homologous sequences
- Codon-based manual alignment curation
In this tutorial, I will present the use of one of the fastest and most popular tools for multiple sequence alignment, the program MAFFT (Katoh and Standley 2013). I will further demonstrate how to detect and exclude alignment regions within which nucleotide homology may be questionable, how to identify additional homologous sequences using a public sequence database, NCBI's GenBank, and how to use these sequences to complement existing datasets.
The dataset used in this tutorial is a small subset of the data used in Matschiner et al. (2017) to estimate divergence times of African and Neotropical cichlid fishes in relation to the break-up of the Gondwanan continents India, Madagascar, Africa, and South America. Based on the full dataset as well as 149 fossils used for time calibration, we concluded in that study that cichlids, even though they are with few exceptions strictly freshwater fishes, must have traversed the Atlantic Ocean at a time when it was already hundreds of kilometers wide to reach South America, and that they most likely also crossed the Indian Ocean as well as the Mozambique Channel between Africa and Madagascar.
The dataset used here includes sequences for two genes; the mitochondrial 16S gene coding for 16S ribosomal RNA and the nuclear RAG1 gene coding for recombination activating protein 1. The sequences represent the 41 species listed in the table below.
ID | Species | Group |
---|---|---|
Acanthapomotis | Acantharchus pomotis | Non-cichlid |
Acropomjaponic | Acropoma japonicum | Non-cichlid |
Aeoliscstrigat | Aeoliscus strigatus | Non-cichlid |
Ambassispcxxxx | Ambassis sp. | Non-cichlid |
Aplochepanchax | Aplocheilus panchax | Non-cichlid |
Ateleopjaponic | Ateleopus japonicus | Non-cichlid |
Beryxxxsplende | Beryx splendens | Non-cichlid |
Cichlaxtemensi | Cichla temensis | Neotropical cichlids |
Chloropagassiz | Chlorophthalmus agassizi | Non-cichlid |
Danioxxrerioxx | Danio rerio | Non-cichlid |
Ectodusdescamp | Ectodus descampsii | African cichlids |
Etroplumaculat | Etroplus maculatus | Indian cichlids |
Gasteroaculeat | Gasterosteus aculeatus | Non-cichlid |
Geophagbrasili | Geophagus brasiliensis | Neotropical cichlids |
Herichtcyanogu | Herichthys cyanoguttatus | Neotropical cichlids |
Kurtusxgullive | Kurtus gulliveri | Non-cichlid |
Latesxxcalcari | Lates calcarifer | Non-cichlid |
Maylandzebraxx | Maylandia zebra | African cichlids |
Monocirpolyaca | Monocirrhus polyacanthus | Non-cichlid |
Mugilxxcephalu | Mugil cephalus | Non-cichlid |
Muraenomarmora | Muraenolepis marmorata | Non-cichlid |
Neolampbrichar | Neolamprologus brichardi | African cichlids |
Niphonxspinosu | Niphon spinosus | Non-cichlid |
Oncorhyketaxxx | Oncorhynchus keta | Non-cichlid |
Opistogaurifro | Opistognathus aurifrons | Non-cichlid |
Oreochrnilotic | Oreochromis niloticus | African cichlids |
Oryziaslatipes | Oryzias latipes | Non-cichlid |
Paratilpolleni | Paratilapia polleni | Malagasy cichlids |
Paretromaculat | Paretroplus maculatus | Malagasy cichlids |
Percalanovemac | Percalates novemaculeata | Non-cichlid |
Percichtruchax | Percichthys trucha | Non-cichlid |
Percopstransmo | Percopsis transmontana | Non-cichlid |
Polymixjaponic | Polymixia japonica | Non-cichlid |
Ptychocgrandid | Ptychochromis grandidieri | Malagasy cichlids |
Pundaminyerere | Pundamilia nyererei | African cichlids |
Synbranmarmora | Synbranchus marmoratus | Non-cichlid |
Tautogoadspers | Tautogolabrus adspersus | Non-cichlid |
Triacananomalu | Triacanthodes anomalus | Non-cichlid |
Tylochrpolylep | Tylochromis polylepis | African cichlids |
Zancluscornutu | Zanclus cornutus | Non-cichlid |
Zeusxxxfaberxx | Zeus faber | Non-cichlid |
-
MAFFT: Installation instructions and precompiled versions of MAFFT are available on the MAFFT webpage. While the installation of the program should be easy on all operating systems, all steps of this tutorial can also be conducted using the server version of MAFFT; installation of this software is therefore optional.
-
AliView: To visualize sequence alignments, the software AliView (Larsson 2014) is recommended. The installation of AliView is described at http://www.ormbunkar.se/aliview/ and should be possible on all operating systems.
-
BMGE: The program BMGE (Block Mapping and Gathering with Entropy) (Criscuolo and Gribaldo 2010) is useful to identify and remove poorly aligned regions of sequence alignments. The latest version of BMGE is provided as a Java jar file at ftp://ftp.pasteur.fr/pub/gensoft/projects/BMGE/ (choose login as guest to access the ftp server). Place this file in a convenient location on your analysis directory.
We'll start by aligning sequences of the mitochondrial 16S gene with the program MAFFT and we will visualize and improve the alignment using the software AliView.
-
Download the file
16s.fasta
containing 16S sequences to your analysis directory. Have a look at the file in a text editor, or on the command line, using, for example, theless
command:less 16s.fasta
You'll see that each record consists of an ID and a sequence, of which the ID is always on a single line that starts with a ">" symbol, followed by lines containing the sequence. The sequences are not aligned yet; this is the reason why they contain no gaps and differ in length. Instead of the 14-character IDs used in this file, other naming schemes could be applied; however, I strongly recommend the use of short and simple IDs because in phylogenetic analyses, many programs or scripts may not work if you use actual latin or common species names that contain spaces or hyphens.
-
Open the website for the online version of MAFFT. This site provides a web interface to the MAFFT alignment program (Katoh et al. 2017). Instead of using the website, you may also use MAFFT on your computer if you succeeded in installing it.
-
Under the heading "Advanced settings" on the MAFFT server website (scroll down to see it) you'll find the available alignment options. In the first gray box with the title "Strategy" you can choose between global and local alignment methods. The "G-INS-i" method implements the global Needleman-Wunsch algorithm (Needleman and Wunsch 1970), the "L-INS-i" method implements the local "Smith-Waterman" algorithm (Smith and Waterman 1981). For simplicity, keep the default "Auto" option. If you use the command-line version of MAFFT on your own computer instead of the MAFFT server, the equivalent command would be this:
mafft --auto 16s.fasta > 16s_aln.fasta
-
In the third gray box in the "Advanced settings" section, which has the title "Parameters", you could change the scoring matrix. For amino-acid sequences you could choose any of the BLOSUM matrices which are equivalent to PAM matrices. For nucleotide sequences, one can choose among "1PAM / K=2", "20PAM / K=2", and "200PAM / K=2". It's unclear to me why they list PAM matrices here also for nucleotide sequences and this is also not explained in their paper or the manual, but at least the MAFFT manual makes it clear that the K (kappa) parameter represents the ratio of scores for transitions and transversions. For now, keep all default options. Click the "Submit" button.
-
After a few seconds, the alignment should be ready, and a new page will open that shows the alignment in HTML format.
Question 1: Can you find out which of the MAFFT alignment algorithms (FFT-NS-1, FFT-NS-2, G-INS-1, FFT-NS-i, E-INS-i, L-INS-i, G-INS-i, or Q-INS-i) has been used? (see answer)
-
Download the alignment in Fasta format to your computer. To do so, right-click the link to "Fasta format" at the very top of the page. Name the file
16s_aln.fasta
. -
Repeat the same, this time with a gap-opening penalty of 2 instead of the default value of 1.53. Name the alignment file resulting from this analysis
16s_op2_aln.fasta
. If you use the command-line version of MAFFT, the equivalent command would be this:mafft --auto --op 2 16s.fasta > 16s_op2_aln.fasta
-
Open file
16s_aln.fasta
in AliView. -
To get an overview of the alignment, zoom in and out using the
+
and-
keys on your keyboard (you may also have to hold the shift key). -
Without closing the AliView window, open the file
16s_op2_aln.fasta
in a second AliView window. Compare the total alignment lengths shown in the status bar at the bottom right.Question 2: Which alignment is longer? (see answer)
-
In both AliView windows, scroll to the region between positions 1250 and 1350.
Question 3: What main difference between the alignments do you notice? (see answer)
-
Close the window for file
16s_op2_aln.fasta
. -
In the window for
16s_aln.fasta
, identify a poorly aligned region (e.g. around positions 1020 to 1040) and try to realign it. To do so, select the region by clicking in the ruler at the top of the alignment as shown in the below screenshot.
-
With the poorly aligned region selected, click on "Realign selected block" in AliView's "Align" menu.
Question 4: Does the alignment in this region look better now? (see answer)
As you can see, the alignment of 16S sequences contains a mix of highly variable as well as conserved regions. Thus, the homology of nucleotides is rather obvious in some parts of the gene but can be ambiguous in other parts. To avoid issues resulting from alignment errors in the downstream phylogenetic analyses, we will identify poorly aligned regions based on the proportion of gaps and the genetic variation found within these regions, and we will exclude them from the alignment.
-
To exclude unreliably aligned regions from the 16S alignment, use the software BMGE. To check if the program works on your computer, and to see the available options, open a command-line window (e.g. the Terminal application on Mac OSX) and type the following command:
java -jar BMGE.jar -?
-
If the above worked, type this command:
java -jar BMGE.jar -i 16s_aln.fasta -t DNA -of 16s_filtered.fasta -oh 16s_filtered.html
With the above command, BMGE writes a filtered alignment in Fasta format in file
16s_filtered.fasta
, and visualizes the filtered alignment in HTML format in file16s_filtered.html
. -
Open file
16s_filtered.html
in your browser. Scroll through the alignment and note the black alignment blocks. -
At the very top of the alignment, you'll see two values plotted for each site in light gray and black. The gap proportion is shown with light gray equal signs and ranges from 0 to 1. Black colons indicate what the authors of BMGE call a "smoothed entropy-like score" (Criscuolo and Gribaldo 2010). Basically, this is a measure of the nucleotide diversity at this site (see Criscuolo and Gribaldo 2010 for more information). You'll note that the black alignment blocks coincide with regions of low gap proportion and low entropy, which are the most suitable alignment positions for phylogenetic inference. Our selection of alignment blocks is based on default settings of BMGE for the entropy score cut-off (option
-h
), the gap-rate cut-off (-g
), and the minimum block size (-b
). By default, BMGE selects sites with an entropy score below 0.5 (-h 0.5
) and a gap proportion below 0.2 (-g 0.2
), and only if these form a block of at least 5 sites with these properties (-b 5
). -
Repeat the BMGE block selection with custom settings for entropy-score cut-off, gap-rate cut-off, and minimum block size, and note how this changes the overall number of selected sites and the distribution of selected blocks in the alignment. For example, increase the allowed proportion of gaps using
-g 0.3
:java -jar BMGE.jar -i 16s_aln.fasta -t DNA -g 0.3 -of 16s_g03_filtered.fasta -oh 16s_g03_filtered.html
-
The standard output of BMGE to the Terminal tells you how many sites (characters) remain selected. Note the difference between the last two runs.
-
In addition to file
16s_filtered.html
, also open file16s_g03_filtered.html
in a separate browser window. Scroll through the alignment. You'll note that many more regions are now marked in black due to the increase in the allowed proportion of gaps per site. -
Open the file
16s_filtered.fasta
in AliView. Note that it is now shorter and appears much more compressed than the previous alignment. Save the file in Phylip format as16s_filtered.phy
, using the "Save as Phylip (full names & padded)" option in AliView's "File" menu. Also save the file in Nexus format as16s_filtered.nex
with the "Save as Nexus" option. -
Open the Phylip and Nexus files in a text editor to see the differences between the file formats.
As mentioned above, the sequence data used in this tutorial are part of the dataset of Matschiner et al. (2017). While some of the sequence data used in that study were generated by us, most sequences were taken from publicly available databases such as GenBank and BOLD (Barcode of Life Data System, which in some cases holds sequences of the mitochondrial COI gene that are not available on GenBank). The second set of sequences used in this tutorial, which you will find in file rag1.fasta
, consists of sequences for the nuclear RAG1 gene; however, compared to the set of 16S sequences, the three species listed below are not yet represented.
ID | Species | Group |
---|---|---|
Cichlaxtemensi | Cichla temensis | Neotropical cichlids |
Geophagbrasili | Geophagus brasiliensis | Neotropical cichlids |
Herichtcyanogu | Herichthys cyanoguttatus | Neotropical cichlids |
- To fill in these missing sequence data you can retrieve homologous RAG1 sequences for Neotropical cichlids from GenBank, using the sequence of an African cichlid species as a query in a BLAST search for the most closely matching sequence in the database. I suggest to use the 16S sequence of Oreochromis niloticus (ID "Oreochrnilotic") from file
rag1.fasta
as the query. On the BLAST webpage, you can use the "Organism" field to limit the search only to sequences of a particular species. Retrieve the longest sequences for each of the three Neotropical cichlid species listed above.
- Add the three retrieved sequences to file
rag1.fasta
, and rename the sequences, giving them the IDs "Cichlaxtemensi", "Geophagbrasili", and "Herichtcyanogu".
If anything should go wrong in this part of the tutorial, you can find a complete version of the RAG1 dataset that already includes the sequences of the three Neotropical cichlid species in file rag1_combined.fasta
.
For protein-coding sequences, it can often be useful to identify the reading frame (i.e. which positions of the alignment correspond to first, second, and third codon positions) and the translation of the nucleotide code into amino-acid sequences. As insertions or deletions ("indels") with lengths that are not multiples of three would disrupt the reading frame and thus cause large changes to the protein structure, these are usually strongly selected against and rarely found in alignments of protein-coding sequences. Similarly, indels observed in protein-coding sequences often begin with a first codon position and end with a third codon position. Thus, in cases where the placement of indels is ambiguous, information about the reading frame can be used to optimize the positioning of these indels.
-
Open again the website for the online version of MAFFT if the program could not be installed on your computer.
-
Upload the file
rag1.fasta
, including the three sequences that you added in the previous part of this tutorial, to the MAFFT server. Keep all default options and click the "Submit" button. If you use the command-line version of MAFFT on your own computer, execute the following command:mafft --auto rag1.fasta > rag1_aln.fasta
-
When the alignment is ready, find out which of MAFFT's algorithms was used this time.
-
Download the alignment in Fasta format to your computer and name the file
rag1_aln.fasta
. -
Open file
rag1_aln.fasta
in AliView. -
Click the fourth icon from the left in the tool bar to translate nucleotide sequence to amino acids (or use the Command-T key combination on a Mac).
-
Click the sigma-sign icon in the center of the tool bar to count stop codons, as in the screenshot shown below.
-
Change the reading frame using the box to the right of the sigma-sign icon and also count stop codons for the second and the third reading frame. Note how the overall color pattern of the alignment changes when you select different reading frames.
Question 5: Does one reading frame appear more conserved than the other two? (see answer) Use this reading frame from now on.
-
Notice how the reading frame is interrupted between positions 1980 and 2000 by two insertions in the sequence of Muraenolepis marmorata. This means that both insertions would have had to evolve at the same time, otherwise the change of the reading frame would have affected the entire rest of the protein in Muraenolepis marmorata, for the time between the first and the second insertion event. A more likely explanation may be that only a single insertion occurred and that the other is an artifact resulting from misalignment.
-
Select the region from position 1981 to 1998, again by clicking in the ruler above the alignment. Click on "Realign selected block" in AliView's "Align" menu. This should result in an alternative alignment of the region, with only a single insertion from position 1989 to position 1991.
-
The precise location of the insertion may still be a bit ambiguous. Therefore, cut out the six base pairs from position 1987 to position 1992. To do so, select these six base pairs by clicking in the ruler above the alignment, and choose "Delete selected" from AliView's "Edit" menu.
-
Also remove the regions with many gaps at the end and the beginning of the alignment. Obviously, these gaps do not result from insertions or deletions, but from missing data. At the end of the alignment, trim all positions after (and including) position 2998. At the beginning of the alignment, trim all positions before (and including) position 1629. The alignment should then start with a codon that is "GGT" or "GGC" for most species, and it should end with a codon that is "AAA" or "AAG" for most species. It should also include not a single stop codon anymore. Verify if this is the case by again clicking on the sigma-sign icon at the center of the tool bar.
-
Save the alignment in Fasta, Phylip, and Nexus format. Use the names
rag1_filtered.fasta
,rag1_filtered.phy
, andrag1_filtered.nex
. These will be used again in tutorial Substitution Model Selection.
- Question 1: As specified directly above the first gray box and also near the end of the page, the accurate but comparatively slow "L-INS-i" algorithm for local alignment with the Smith-Waterman algorithm was used.
- Question 2: The alignment in file
16s_aln.fasta
should be 1,839 bp long while that in file16s_op2_aln.fasta
should be slightly shorter with 1,832 bp. This is to be expected when the penalty for gap openings is increased.
- Question 3: More indels were inferred in file
16s_aln.fasta
compared to file16s_op2_aln.fasta
. In both versions, this part of the alignment appears to be unreliable.
- Question 4: Even after realigning, I would consider the region between positions 1,020 and 1,040 unreliably aligned even though it looks better afterwards.
- Question 5: Yes, while only a single stop codon is found with the first reading frame, 892 and 1,214 stop codons should be detected with the second and third reading frames, respectively.