GeomScale · SotirisTouliopoulos · Aug 24, 2024 · Aug 24, 2024 · Aug 24, 2024 · Aug 24, 2024
diff --git a/_posts/2024-08-24-GSoC2024-dingo.md b/_posts/2024-08-24-GSoC2024-dingo.md
@@ -0,0 +1,230 @@
+---
+layout: single
+title: "Enhancing dingo: from metabolic models reduction to pathways prediction"
+date: 2024-08-24
+author: Sotiris Touliopoulos
+author_profile: true
+read_time: true
+comments: true
+share: true
+related: true
+hidden: false
+---
+
+
+# Enhancing dingo: from metabolic models reduction to pathways prediction
+
+
+<div style="text-align:center;">
+  <br>
+  <b>Sotiris Touliopoulos</b>
+  <br>
+  <b>Contributor to Google Summer of Code 2024 with GeomScale</b>
+  <br>
+</div>
+
+
+> #### Mentors of this project: Vissarion Fisikopoulos, Apostolos Chalkis and Haris Zafeiropoulos.
+
+
+## Goal of the project
+
+The goal of this project is to enhance dingo by incorporating pre- and post-sampling features to leverage the increased statistical value of flux sampling.
+
+A preprocessing class, which reduces metabolic networks by removing specific reactions, is integrated into dingo, along with a post-processing function that calculates correlation metrics from pairwise reaction fluxes.
+
+Additionally, functions for clustering and generating graphs from the correlation matrix are implemented to group reactions and predict metabolic pathways.
+
+Separate functions for visualizing matrices, dendrograms, and graphs are also provided.
+
+
+## Preprocess for metabolic models reduction
+
+- Large metabolic models contain numerous reactions and metabolites. Sampling the flux space of such models requires a significant computational time due to the high dimensionsionality of the convex polytope. 
+- Model preprocessing can mitigate this issue by removing specific reactions, thereby reducing the dimensional space and computational time.
+- Reduced models have decreased complexity and can help researchers better understand basic principles of metabolism. The reduction process is automated and unbiased when a specific objective function is set.
+
+When an object is initialized from the `PreProcess` class, users can call the `reduce` function on this object to remove three types of reactions:
+
+- Blocked reactions: cannot carry a flux in any condition.
+- Zero-flux reactions: cannot carry a flux while maintaining at least 90% of the maximum growth rate.
+- Metabolically less-efficient reactions: require a reduction in growth rate if used <a href="#ref-1">[1]</a>.
+
+Users can choose to remove an additional set of reactions, by setting the `extend` parameter  of the `reduce` function to `True`. These reactions do not affect the value of the objective function when removed.
+
+Code that illustrates how to use this class:
+```python
+cobra_model = load_json_model("ext_data/e_coli_core.json")
+
+obj = PreProcess(cobra_model, 
+                 # tolerance value to identify significant flux changes
+                 tol = 1e-6, 
+                 # boolean variable that define whether to open all exchange reactions to very high flux ranges.
+                 open_exchanges = False)
+
+removed_reactions, reduced_dingo_model = obj.reduce(extend = False)
+```
+
+Reduction with the `PreProcess` class has been tested with various models from the BiGG database <a href="#ref-2">[2]</a>. Figures below show the number of remained reactions, after applying the `reduce` function:
+
+<center>
+    <img src="https://sotiristouliopoulos.github.io/dingo/img/reduction_results_plot.png"><br><br>
+    <img src="https://sotiristouliopoulos.github.io/dingo/img/reduction_results_plot2.png"><br><br>
+</center>
+
+This process significantly reduces model complexity. In some cases, it yields core models, which are useful for examining essential metabolic pathways.
+
+
+## Identification of correlated reactions
+
+- Reactions in biochemical pathways can be positively correlated, negatively correlated, or uncorrelated.
+- Positive correlation means that if reaction A is active, then reaction B is also active and vice versa.
+- Negative correlation means that if reaction A is active, then reaction B is inactive and vice versa.
+- Zero correlation means that the status of reaction A is independent of the status of reaction B and vice versa.
+
+Users can sample the flux space of a selected metabolic model using dingo's `PolytopeSampler` class. Then, the `correlated_reactions` function can help them identify possible correlations in pairs of reactions. This function generates a correlation matrix based on the pearson correlation coefficient of pairwise reactions fluxes. 
+
+Users can apply cutoff values to filter the matrix, replacing all values that do not meet the threshold with 0. For instance, if a pearson cutoff is set to 0.80, all correlation values below this are replaced with 0. However, pearson is not the only available correlation metric.
+
+Users can also filter correlated reactions using copulas. If you're unfamiliar, copulas allow you to examine how two random variables (in this case, reaction fluxes) are related. You can read more about them [here](https://waterprogramming.wordpress.com/2017/11/11/an-introduction-to-copulas/). Below is an example plot of a copula showing the relationship between two random variables:
+
+![](https://www.researchgate.net/publication/362369405/figure/fig4/AS:11431281085484712@1663777261143/Distribution-of-five-types-of-copulas-a-PDF-of-Gaussian-Copula-b-PDF-of-t-Copula.jpg)
+
+Copulas are a useful statistical tool for validating significant correlations. However, manually inspecting copula plots can introduce user bias and be time-consuming given the number of paired reactions. An alternative to manual curation is using a metric called the "indicator," which sums the probability values across each copula's diagonal and divides them. The value of the indicator, whether positive or negative, helps validate or reject a possible correlation. Users can set an indicator cutoff that functions similarly to the pearson cutoff.
+
+In addition to calculating correlations, a `plot_corr_matrix` function is available for visualizing the correlation matrix as a heatmap. This visualization helps users easily identify significant correlations.
+
+Code that illustrates how to use these functions:
+```python
+dingo_model = MetabolicNetwork.from_json("ext_data/e_coli_core.json")
+
+sampler = PolytopeSampler(dingo_model)
+steady_states = sampler.generate_steady_states()
+
+corr_matrix, indicator_dict = correlated_reactions(
+                              steady_states,
+                              pearson_cutoff = 0.99,
+                              indicator_cutoff = 2, 
+                              # number of cells to compute the copula
+                              cells = 10,
+                              # value that defines the width of the copula’s diagonal
+                              cop_coeff = 0.3,
+                              # boolean variable that when True, keeps only the lower triangular matrix
+                              lower_triangle = False)
+
+plot_corr_matrix(corr_matrix,
+                 # list containing reactions’ names
+                 reactions,
+                 # parameter that defines image saving format
+                 format = "svg")
+```
+
+The `e_coli_core.json` model we use represents the core metabolism of the E. coli. You can find the model and related data [here](http://bigg.ucsd.edu/models/e_coli_core).
+
+Here’s an example of a heatmap created from a symmetrical correlation matrix without cutoffs:
+<center>
+    <img src="https://sotiristouliopoulos.github.io/dingo/img/corr_matrix_no_cutoffs.png"><br><br>
+</center>
+
+We can focus on correlations from reactions that belong to the glycolytic and pentose pathways to get some insights from the heatmap. If you're unfamiliar with these reactions' topology, you can visit [ESCHER](https://escher.github.io/#/) <a href="#ref-3">[3]</a> and load the core metabolism map of E. coli.
+
+We observe strong pairwise correlations among all reactions in the pentose phosphate pathway (`G6PDH2R, PGL, GND, RPE, RPI, TKT1, TALA, TKT2`). However, for the glycolytic pathway (`PGI, PFK, FBA, TPI, GAPD, PGK, PGM, ENO, PYK`), two reactions, `PFK` and `PYK`, show decreased correlation with the others. 
+
+From the literature, it is known that `PFK` and `PYK` are catalyzed by enzymes that regulate glycolysis. These enzymes do not contribute to gluconeogenesis (the reverse pathway of glycolysis) because their reactions are unidirectional. When gluconeogenesis occurs, these enzymes have little or no activity, while other reactions function in reverse. This explains their lower correlation compared to other glycolytic reactions.
+
+The gluconeogenic reactions that convert the products of `PFK` and `PYK` back into their substrates are `FBP` and `PPS`, respectively. These reactions may become active when there is an excess of `D-fructose-1,6-bisphosphate` or `pyruvate`, and the model must reduce their concentrations to achieve a steady-state condition.
+
+This example shows how examining the correlation matrix can help researchers study metabolic pathways. However, this process can be challenging in genome-scale models due to the large number of reactions and pathways. Clustering and graph analysis can help automate this process.
+
+
+## Clustering of the correlation matrix
+
+- Clustering based on a dissimilarity matrix reveals groups of reactions with similar correlation values across the matrix.
+- Reactions within the same cluster may contribute to the same pathways.
+
+The `cluster_corr_reactions` function hierarchically clusters the correlation matrix, and the `plot_dendrogram` function plots the resulting dendrogram.
+
+Code that illustrates how to use these functions:
+```python
+dissimilarity_matrix, labels, clusters = cluster_corr_reactions(
+                                         corr_matrix,
+                                         # list containing reactions’ names
+                                         reactions,
+                                         # defines the type of clustering linkage (options: single, average, complete, ward)
+                                         linkage = "ward",
+                                         # defines a threshold to cut the dendrogram at a specific height
+                                         t = 10.0,
+                                         # A boolean variable; if True, the dissimilarity matrix is calculated by subtracting absolute values from 1
+                                         correction = True)
+
+plot_dendrogram(dissimilarity_matrix,
+                reactions,
+                # specifies whether reaction names will appear on the x-axis
+                plot_labels = True,
+                t = 10.0,
+                linkage = "ward")
+```
+
+Here is an example of a dendrogram created from a correlation matrix with a pearson cutoff of 0.9999:
+<center>
+    <img src="https://sotiristouliopoulos.github.io/dingo/img/dendrogram_pearson.png"><br><br>
+</center>
+
+We observe well-separated and distinct clusters. Some clusters contain reactions from the glycolytic and pentose phosphate pathways, as discussed earlier. Additionally, there are clusters with reactions from the citric acid cycle and other lesser-known pathways.
+
+In the cluster representing the pentose phosphate pathway, we notice that `PGI`, a glycolytic reaction, is included. This is likely because it shares a common metabolite (`D-glucose-6-phosphate`) with `G6PDH2r`, the first reaction in the pentose phosphate pathway.
+
+However, applying a stricter pearson cutoff (e.g., 0.99999) removes `PGI` from this cluster, leaving only the reactions of the pentose phosphate pathway.
+
+This is an early but promising indication that clustering the correlation matrix can reveal groups of reactions that contribute to the same pathways.
+
+In the future, other clustering methods such as `k-means`, `HDBSCAN`, and `biclustering` at the sample level will be explored to identify more specific clusters. Let me briefly explain: Some pathways have distinct phases, each serving a unique role in metabolism. Both glycolysis and the pentose phosphate pathways can be divided into two phases.
+
+For example, the pentose phosphate pathway has an oxidative and a non-oxidative phase. Reactions such as `G6PDH2R`, `PGL`, and `GND` belong to the oxidative phase, which produces `NADPH`. Hierarchical clustering may not distinguish these phases, but using alternative methods not solely based on the correlation matrix could reveal two distinct clusters representing the two phases.
+
+
+## Graphs from the correlation matrix
+
+- Creating graphs can reveal networks of correlated reactions, which may correspond to metabolic pathways.
+- These graphs are generated from the correlation matrix, where reactions are represented as nodes and correlation coefficients as edges.
+
+The `graph_corr_matrix` function generates graphs from the correlation matrix, and the `plot_graph` function plots these graphs.
+
+Code that illustrates how to use these functions:
+```python
+graphs, layouts = graph_corr_matrix(corr_matrix,
+                                    reactions,
+                                    # boolean value; if True, transforms the correlation matrix values into absolute values
+                                    correction = True,
+                                    # nested list containing sublists of reactions within the same cluster, created by the `cluster_corr_reactions` function.
+                                    clusters = clusters)
+
+plot_graph(
+           # graph object returned by the graph_corr_matrix function
+           graph,
+           # layout corresponding to the given graph object 
+           layout)
+
+```
+Here’s an example of a graph created from a correlation matrix without cutoffs:
+<center>
+    <img src="https://sotiristouliopoulos.github.io/dingo/img/graph_no_cutoffs.png"><br><br>
+</center>
+
+The width of each edge represents the correlation strength, scaled from 0 to 1. In addition to the initial graph, users can create and plot subgraphs that represent reaction subnetworks. These subnetworks can serve as supplementary information to clustering results, allowing researchers to identify similarities and differences between the two methods and adjust the structure of desired clusters accordingly.
+
+However, the insights gained from graphs extend beyond pathway prediction. The more connected a node (reaction) is, the more essential its role likely is in metabolism. Conversely, nodes with fewer connections may represent reactions that can be removed without significantly affecting the system.
+
+
+## Conclusion
+
+Flux sampling offers significant statistical value, enhancing research in metabolic models. In this blog, we explored the pre- and post-sampling features integrated into the dingo package. From model reduction to pathway prediction, these features are designed to assist researchers in studying fundamental aspects of metabolic function.
+
+
+## References
+
+- <a id="ref-1"></a> [1] Lewis NE, Hixson KK, Conrad TM, et al. Omic data from evolved E. coli are consistent with computed optimal growth from genome‐scale models. Molecular Systems Biology. 2010;6(1). doi:https://doi.org/10.1038/msb.2010.47
+‌
+- <a id="ref-2"></a> [2] King ZA, Lu J, Dräger A, et al. BiGG Models: A platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Research. 2015;44(D1):D515-D522. doi:https://doi.org/10.1093/nar/gkv1049
+‌
+- <a id="ref-3"></a> [3] King ZA, Dräger A, Ebrahim A, Sonnenschein N, Lewis NE, Palsson BO. Escher: A Web Application for Building, Sharing, and Embedding Data-Rich Visualizations of Biological Pathways. Gardner PP, ed. PLOS Computational Biology. 2015;11(8):e1004321. doi:https://doi.org/10.1371/journal.pcbi.1004321