Full article access https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10761335/
✨ Precision: Nail the classification of closely related viral strains and recombinants.
✨ Speed: Rapid analysis and classification of large N viral genome datasets.
✨ Needs low-resource settings.
In the world of viral genomics, CGRphylo stands out by accurately classifying closely related viral strains, including tricky recombinants. Imagine its power during epidemic outbreaks, where thousands of viral sequences overwhelm resources. CGRphylo is the superhero pipeline that steps in – efficient, accessible, and designed for both high and low-resource settings.
🚀 CGRphylo processed 69 SARS-CoV-2 genomes 5 times faster than Clustal-Omega. 🌐 But wait, there's more! For a dataset of 106 genomes, CGRphylo outpaced Clustal-Omega by an incredible 13.7 times.
In the world of MSAs (Clustal-Omega), computational costs skyrocket as datasets grow. Not for CGRphylo! Adding a sequence is a breeze – just one frequency matrix calculation, breaking free from the computational intensity that others face. For MSA, the computational cost increases as additional sequences require pairwise comparisons with all other sequences, making the process more computationally intensive as the dataset grows.
Thind Singh Amarinder and Sinha Somdatta*, Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants, Current Genomics 2023; 24 (3) . https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10761335/
Line-to-line Rscript is available in cgrPhlyo.r (and CGRphylo.rmd) script. You can find out what to aspect in the o/p by following cgrPhlyo.pdf (or CGRphylo.html).
Input required is a set of two or more genome sequences in FASTA format. The other input required by the user is “word length (K value)” for which the frequencies of all the words in the sequences are calculated. Users can also specify the Out-group for the construction of the Neighbor-Joining Tree.
These Sequences are then used to calculate frequencies of words at user-specified word lengths. , each sequence gets its own frequency matrices. There is also an option to visualize the CGR plots. Based on these word frequencies the pair-wise distance (Euclidean-default, or Square Euclidean, or Manhattan) is calculated between all pairs of sequences in input data. Package integrate other packages where distance matrices can be converted to NJ tree or other and visualized. We provide the option to convert distance matrix into MEGA and phylip distance formats, which is a widely used program for visualization.
The DNA sequences to be analyzed can be uploaded in the form fasta file. All the sequences must be in FASTA format. Check the example file (Input_recom_SARS_cov2.fasta) for details.
setwd(".")
source('cgat_function.r')
source('distances_n_other.r')
source('cgrplot.r')
file <- "Input_recom_SARS_cov2.fasta"
library("seqinr")
fastafile <- seqinr::read.fasta(file = file, seqtype = "DNA", as.string = TRUE, set.attributes = FALSE)
library(stringr)
N_filter <- 50 ## filter sequence with n bases > this value
fasta_filtered <- fastafile_new(fastafile, N_filter) ## create filtered sequence file
#write fasta file from filtered sequences
seqinr::write.fasta(sequences=fasta_filtered,names =names(fasta_filtered),file.out=paste("recombinant_XBB.1_Filter",N_filter,".fasta",sep = ''))
If not interested in exploring, Box/CGR plots etc; proceed directly for section ##Create frequency object for sequences for specific "Word Length"
create_meta
function extracts various types of information from the sequences and stores them into data frame.
meta <- create_meta(fastafile, N_filter) ## create seq features information
print(paste("std dev for seq length is",sd(meta$length),sep=" "))
print(paste("Median of the seq length is",median(meta$length),sep=" "))
print("Range of the seq length")
range(meta$length)
#boxplot(meta$length, ylab="Sequence length") ## overall
dotchart(meta$length, labels = meta$name, xlab = "Sequence length", pch = 21, bg = "green", pt.cex = 1, cex = 0.7)
dotchart(meta$GC_content, labels = meta$name, xlab = "GC content", pch = 21, bg = "green", pt.cex = 1, cex = 0.7)
# In this example the first part of the sequence name {i.e. beforere_ } is the strain name.
meta$strains <- as.character(lapply(meta$name, function(x) strsplit(x, '_')[[1]][1])) ## split strains names
library(ggplot2)
ggplot(meta, aes(x = strains, y =length, color = strains )) + geom_boxplot()+ ylab("Sequence length (group level)")+coord_flip()
#boxplot(log2(meta$length+1))
len_trim <- min(meta$length)
## Save meta
#write.csv(meta,paste("length_and_names",N_filter,".csv"))
CGRs for each sequence can be visualized by selecting the sequence. cgrplot
function creates the 'x' and 'y' coordinates for each base pair (to plot on CRG plot).
source('cgrplot.r') ## Load CGR plot function
cgr1 <- cgrplot(1) ## enter the number of sequence from "fasta_filtered"
plot(cgr1[,1],cgr1[,2], main=paste("CGR plot of", names(fasta_filtered)[1],sep=''),
xlab = "", ylab = "", cex=0.2, pch = 4, frame = TRUE)
#compare 2 CGR plots
cgr2 <- cgrplot(4)
library('RColorBrewer')
pal <- brewer.pal(8,"Dark2")
par(mfrow=c(1,2))
plot(cgr1[,1],cgr1[,2], main=paste("CGR plot of ", names(fasta_filtered)[1],sep=''),
xlab = "", ylab = "", cex.main=0.5, cex=0.4, pch = 4, frame = TRUE)
plot(cgr2[,1],cgr2[,2], main=paste("CGR plot of ", names(fasta_filtered)[2],sep=''),
xlab = "", ylab = "",cex.main=0.5,cex=0.4,pch = 4, frame = TRUE)
The clustering of the sequences is based on the distances calculated from the frequencies of DNA words. The word length to be used for the calculation can be specified. This default word length used is 6. cgat
function does this job.
k_mer <- 6 ## define the value of K
Freq_mat_obj <- list()
sequence_new <- names(fasta_filtered)
for(n in 1:length(fasta_filtered)) {
##skip passing whole data to the function ##increase speed ## "fasta_filtered" name is Locked.
Freq_mat_obj[[n]] <- cgat(k_mer,n, len_trim) # executing one seq at a time #k-mer,seq_length,trimmed_length
print(paste("processing sequence : ",n , sequence_new[n], sep=" "))
}
names(Freq_mat_obj) <- sequence_new
Users can use any of Euclidean (default), Square Euclidean, or Manhattan distance for the distance matrices. matrixDistance
function takes inputs in the following way for distance calculations.
j <- length(sequence_new)
distance <- matrix(0,j,j) ## defining the empty matrix from distance calculations
row.names(distance) <- sequence_new[1:j]
colnames(distance) <- sequence_new[1:j]
for(n in 1:j) {
for(cr in 1:j) {
distance[n,cr] <- matrixDistance(Freq_mat_obj[[n]],Freq_mat_obj[[cr]],distance_type ='Euclidean' ) ##'Euclidean','S_Euclidean' and Manhattan
}}
distance <- round(distance, 5)
distance[1:5,1:3]
#plot(hclust(as.dist(distance)))
For tree visualization with third-party tools, results can be saved into Mega or Phylip format using saveMegaDistance
and savePhylipDistance
respectively.
Mega_file_name <- paste('Recombi_N_filtering_',N_filter,'_mega_distance_file_for_k_',k_mer,'.meg')
### save
saveMegaDistance(Mega_file_name, distance) ## inputs are file name (with path) and distance
PhylipFile_name <- paste('Phylip_N_filtering_',N_filter,'distances_k_',k_mer,'.txt')
## inputs are file name (with path) and distance
savePhylipDistance(PhylipFile_name, distance, mode= 'original') ## relaxed or original/other
##typical phylip allows 10 charcter for texa name ##new relaxed formart allows 250
## http://www.phylo.org/index.php/help/relaxed_phylip
Distance Matrix (as shown below) contains a pairwise distance matrix generated using input sequences. By default, CGRphylo uses the Euclidean distance method to calculate pairwise distances between multiple whole genome sequences.
Creating and Saving tree into Newick, Nexus format (pipeline use ape
and treeio
Bioconductor package)
Outtree, users can save outtree files created by the NJ method of Bioconductor package 'ape'. These files contain information about trees generated in standard NEWICK format or other. These files are compatible with standard bioinformatics tools like TreeView, MEGA, etc. to create, view, edit, and customize trees.
tree <-write.tree(my_nj, file = "", append = FALSE, digits = 10, tree.names = FALSE)
ape::write.tree(my_nj, file='Newick_NJ_tree.txt') #for Newick format #write out trees in a text file
ape::write.nexus(my_nj, file='Nexus_NJ_tree.nex') ##for Nexus format
## reading tree from file
newick.tree <- ape::read.tree("Newick_NJ_tree.txt")
nexus.tree <- ape::read.nexus("Nexus_NJ_tree.nex")
Fig: Cladogram of the recombinant XBB of Omicron sub lineages (BJ.1 and BA.2.7.5) and its parental sequences (A) k=5, (B) k=6, (C) Multiple sequence alignment (MSA) and NJ clustering, and (D) MSA-Maximum likelihood method.
Fig Cladogram of 8 different strains of SARS Cov-2 using - (A) k=4, (B) k=5, (C) k=6, (D) k=7, (E) Multiple sequence alignment (MSA) and NJ clustering, and (F) MSA-Maximum likelihood method.
Chaos Game Representation (CGR) is an iterative mapping technique to construct a two-dimensional representation of genomic sequences (Jeffrey, 1990). CGRs have been conventionally used to visualize large nucleotide sequences. However, apart from visualization, CGRs can be used to compare DNA sequences, construct cladograms and address various biological problems.
It efficiently classifies sequences based on both inter-species and intra-species variation in a computationally less intense manner. It analyses whole genome variations using an alignment-free and scale-invariant method resulting in trees that can be used to interpret similarity between multiple whole genome sequences, even when they are closely related.
Word Frequency is the frequency of all different k-letter words corresponding to the CGR map. The following figure shows various K-letter words (above) and their calculated frequencies (below) at k=3
Another fascinating property of the CGR plot is its fractal nature. The iterative process of plotting points on the CGR plot creates intricate and self-similar patterns at different scales. This means that each square box in the plot contains a smaller version of the entire plot, exhibiting similarity to the overall pattern. This characteristic of self-replication is typical of fractals, complex geometric structures that reveal repeating patterns at various levels of magnification.
We acknowledge the National Network for Mathematical and Computational Biology (NNMCB), DST, India for the internship programme at IISER Mohali for the initial part of the project.