-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GSoC 2024 dingo blog #35
base: gh-pages
Are you sure you want to change the base?
Changes from 16 commits
4321e4e
9ddeb06
5b3dd1a
f3c9572
5d2d8f0
d391d7f
a10d723
dbb44c1
14a5d4a
ff5bc0e
c612267
9dd10bd
2e6d556
6bd9d22
d262f46
32bb529
44433b1
9071e7c
c6cb069
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,230 @@ | ||
--- | ||
layout: single | ||
title: "Enhancing dingo: from metabolic models reduction to pathways prediction" | ||
date: 2024-08-24 | ||
author: Sotiris Touliopoulos | ||
author_profile: true | ||
read_time: true | ||
comments: true | ||
share: true | ||
related: true | ||
hidden: false | ||
--- | ||
|
||
|
||
# Enhancing dingo: from metabolic models reduction to pathways prediction | ||
|
||
|
||
<div style="text-align:center;"> | ||
<br> | ||
<b>Sotiris Touliopoulos</b> | ||
<br> | ||
<b>Contributor to Google Summer of Code 2024 with GeomScale</b> | ||
<br> | ||
</div> | ||
|
||
|
||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
> #### Mentors of this project: Vissarion Fisikopoulos, Apostolos Chalkis and Haris Zafeiropoulos. | ||
|
||
|
||
## Goal of the project | ||
|
||
The goal of this project is to enhance dingo by incorporating pre- and post-sampling features to leverage the increased statistical value of flux sampling. | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
A preprocessing class, which reduces metabolic networks by removing specific reactions, is integrated into dingo, along with a post-processing function that calculates correlation metrics from pairwise reaction fluxes. | ||
|
||
Additionally, functions for clustering and generating graphs from the correlation matrix are implemented to group reactions and predict metabolic pathways. | ||
|
||
Separate functions for visualizing matrices, dendrograms, and graphs are also provided. | ||
|
||
|
||
## Preprocess for metabolic models reduction | ||
|
||
- Large metabolic models contain numerous reactions and metabolites. Sampling the flux space of such models requires a significant computational time due to the high dimensionsionality of the convex polytope. | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Model preprocessing can mitigate this issue by removing specific reactions, thereby reducing the dimensional space and computational time. | ||
- Reduced models have decreased complexity and can help researchers better understand basic principles of metabolism. The reduction process is automated and unbiased when a specific objective function is set. | ||
|
||
When an object is initialized from the `PreProcess` class, users can call the `reduce` function on this object to remove three types of reactions: | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- Blocked reactions: cannot carry a flux in any condition. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the meaning of "carry a flux"? Is it standard? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it is standard. It is mentioned in this article: https://doi.org/10.1038/msb.2010.47 |
||
- Zero-flux reactions: cannot carry a flux while maintaining at least 90% of the maximum growth rate. | ||
- Metabolically less-efficient reactions: require a reduction in growth rate if used <a href="#ref-1">[1]</a>. | ||
|
||
Users can choose to remove an additional set of reactions, by setting the `extend` parameter of the `reduce` function to `True`. These reactions do not affect the value of the objective function when removed. | ||
|
||
Code that illustrates how to use this class: | ||
```python | ||
cobra_model = load_json_model("ext_data/e_coli_core.json") | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
obj = PreProcess(cobra_model, | ||
# tolerance value to identify significant flux changes | ||
tol = 1e-6, | ||
# boolean variable that define whether to open all exchange reactions to very high flux ranges. | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
open_exchanges = False) | ||
|
||
removed_reactions, reduced_dingo_model = obj.reduce(extend = False) | ||
``` | ||
|
||
Reduction with the `PreProcess` class has been tested with various models from the BiGG database <a href="#ref-2">[2]</a>. Figures below show the number of remained reactions, after applying the `reduce` function: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason to have two instead of one plot? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For better visibility. Lines overlapped when all reactions were in a single plot |
||
|
||
<center> | ||
<img src="https://sotiristouliopoulos.github.io/dingo/img/reduction_results_plot.png"><br><br> | ||
<img src="https://sotiristouliopoulos.github.io/dingo/img/reduction_results_plot2.png"><br><br> | ||
</center> | ||
|
||
This process significantly reduces model complexity. In some cases, it yields core models, which are useful for examining essential metabolic pathways. | ||
|
||
|
||
## Identification of correlated reactions | ||
|
||
- Reactions in biochemical pathways can be positively correlated, negatively correlated, or uncorrelated. | ||
- Positive correlation means that if reaction A is active, then reaction B is also active and vice versa. | ||
- Negative correlation means that if reaction A is active, then reaction B is inactive and vice versa. | ||
- Zero correlation means that the status of reaction A is independent of the status of reaction B and vice versa. | ||
|
||
Users can sample the flux space of a selected metabolic model using dingo's `PolytopeSampler` class. Then, the `correlated_reactions` function can help them identify possible correlations in pairs of reactions. This function generates a correlation matrix based on the pearson correlation coefficient of pairwise reactions fluxes. | ||
|
||
Users can apply cutoff values to filter the matrix, replacing all values that do not meet the threshold with 0. For instance, if a pearson cutoff is set to 0.80, all correlation values below this are replaced with 0. However, pearson is not the only available correlation metric. | ||
|
||
Users can also filter correlated reactions using copulas. If you're unfamiliar, copulas allow you to examine how two random variables (in this case, reaction fluxes) are related. You can read more about them [here](https://waterprogramming.wordpress.com/2017/11/11/an-introduction-to-copulas/). Below is an example plot of a copula showing the relationship between two random variables: | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
![](https://www.researchgate.net/publication/362369405/figure/fig4/AS:11431281085484712@1663777261143/Distribution-of-five-types-of-copulas-a-PDF-of-Gaussian-Copula-b-PDF-of-t-Copula.jpg) | ||
|
||
Copulas are a useful statistical tool for validating significant correlations. However, manually inspecting copula plots can introduce user bias and be time-consuming given the number of paired reactions. An alternative to manual curation is using a metric called the "indicator," which sums the probability values across each copula's diagonal and divides them. The value of the indicator, whether positive or negative, helps validate or reject a possible correlation. Users can set an indicator cutoff that functions similarly to the pearson cutoff. | ||
|
||
In addition to calculating correlations, a `plot_corr_matrix` function is available for visualizing the correlation matrix as a heatmap. This visualization helps users easily identify significant correlations. | ||
|
||
Code that illustrates how to use these functions: | ||
```python | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
dingo_model = MetabolicNetwork.from_json("ext_data/e_coli_core.json") | ||
|
||
sampler = PolytopeSampler(dingo_model) | ||
steady_states = sampler.generate_steady_states() | ||
|
||
corr_matrix, indicator_dict = correlated_reactions( | ||
steady_states, | ||
pearson_cutoff = 0.99, | ||
indicator_cutoff = 2, | ||
# number of cells to compute the copula | ||
cells = 10, | ||
# value that defines the width of the copula’s diagonal | ||
cop_coeff = 0.3, | ||
# boolean variable that when True, keeps only the lower triangular matrix | ||
lower_triangle = False) | ||
|
||
plot_corr_matrix(corr_matrix, | ||
# list containing reactions’ names | ||
reactions, | ||
# parameter that defines image saving format | ||
format = "svg") | ||
``` | ||
|
||
The `e_coli_core.json` model we use represents the core metabolism of the E. coli. You can find the model and related data [here](http://bigg.ucsd.edu/models/e_coli_core). | ||
|
||
Here’s an example of a heatmap created from a symmetrical correlation matrix without cutoffs: | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
<center> | ||
<img src="https://sotiristouliopoulos.github.io/dingo/img/corr_matrix_no_cutoffs.png"><br><br> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be nice to have figures of high resolution; when you are trying to zoom in to check the reaction names you get pixels now. |
||
</center> | ||
|
||
We can focus on correlations from reactions that belong to the glycolytic and pentose pathways to get some insights from the heatmap. If you're unfamiliar with these reactions' topology, you can visit [ESCHER](https://escher.github.io/#/) <a href="#ref-3">[3]</a> and load the core metabolism map of E. coli. | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
We observe strong pairwise correlations among all reactions in the pentose phosphate pathway (`G6PDH2R, PGL, GND, RPE, RPI, TKT1, TALA, TKT2`). However, for the glycolytic pathway (`PGI, PFK, FBA, TPI, GAPD, PGK, PGM, ENO, PYK`), two reactions, `PFK` and `PYK`, show decreased correlation with the others. | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
From the literature, it is known that `PFK` and `PYK` are catalyzed by enzymes that regulate glycolysis. These enzymes do not contribute to gluconeogenesis (the reverse pathway of glycolysis) because their reactions are unidirectional. When gluconeogenesis occurs, these enzymes have little or no activity, while other reactions function in reverse. This explains their lower correlation compared to other glycolytic reactions. | ||
|
||
The gluconeogenic reactions that convert the products of `PFK` and `PYK` back into their substrates are `FBP` and `PPS`, respectively. These reactions may become active when there is an excess of `D-fructose-1,6-bisphosphate` or `pyruvate`, and the model must reduce their concentrations to achieve a steady-state condition. | ||
|
||
This example shows how examining the correlation matrix can help researchers study metabolic pathways. However, this process can be challenging in genome-scale models due to the large number of reactions and pathways. Clustering and graph analysis can help automate this process. | ||
|
||
|
||
## Clustering of the correlation matrix | ||
|
||
- Clustering based on a dissimilarity matrix reveals groups of reactions with similar correlation values across the matrix. | ||
- Reactions within the same cluster may contribute to the same pathways. | ||
|
||
The `cluster_corr_reactions` function hierarchically clusters the correlation matrix, and the `plot_dendrogram` function plots the resulting dendrogram. | ||
|
||
Code that illustrates how to use these functions: | ||
```python | ||
dissimilarity_matrix, labels, clusters = cluster_corr_reactions( | ||
corr_matrix, | ||
# list containing reactions’ names | ||
reactions, | ||
# defines the type of clustering linkage (options: single, average, complete, ward) | ||
linkage = "ward", | ||
# defines a threshold to cut the dendrogram at a specific height | ||
t = 10.0, | ||
# A boolean variable; if True, the dissimilarity matrix is calculated by subtracting absolute values from 1 | ||
correction = True) | ||
|
||
plot_dendrogram(dissimilarity_matrix, | ||
reactions, | ||
# specifies whether reaction names will appear on the x-axis | ||
plot_labels = True, | ||
t = 10.0, | ||
linkage = "ward") | ||
``` | ||
|
||
Here is an example of a dendrogram created from a correlation matrix with a pearson cutoff of 0.9999: | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
<center> | ||
<img src="https://sotiristouliopoulos.github.io/dingo/img/dendrogram_pearson.png"><br><br> | ||
</center> | ||
|
||
We observe well-separated and distinct clusters. Some clusters contain reactions from the glycolytic and pentose phosphate pathways, as discussed earlier. Additionally, there are clusters with reactions from the citric acid cycle and other lesser-known pathways. | ||
|
||
In the cluster representing the pentose phosphate pathway, we notice that `PGI`, a glycolytic reaction, is included. This is likely because it shares a common metabolite (`D-glucose-6-phosphate`) with `G6PDH2r`, the first reaction in the pentose phosphate pathway. | ||
|
||
However, applying a stricter pearson cutoff (e.g., 0.99999) removes `PGI` from this cluster, leaving only the reactions of the pentose phosphate pathway. | ||
|
||
This is an early but promising indication that clustering the correlation matrix can reveal groups of reactions that contribute to the same pathways. | ||
|
||
In the future, other clustering methods such as `k-means`, `HDBSCAN`, and `biclustering` at the sample level will be explored to identify more specific clusters. Let me briefly explain: Some pathways have distinct phases, each serving a unique role in metabolism. Both glycolysis and the pentose phosphate pathways can be divided into two phases. | ||
|
||
For example, the pentose phosphate pathway has an oxidative and a non-oxidative phase. Reactions such as `G6PDH2R`, `PGL`, and `GND` belong to the oxidative phase, which produces `NADPH`. Hierarchical clustering may not distinguish these phases, but using alternative methods not solely based on the correlation matrix could reveal two distinct clusters representing the two phases. | ||
|
||
|
||
## Graphs from the correlation matrix | ||
|
||
- Creating graphs can reveal networks of correlated reactions, which may correspond to metabolic pathways. | ||
- These graphs are generated from the correlation matrix, where reactions are represented as nodes and correlation coefficients as edges. | ||
|
||
The `graph_corr_matrix` function generates graphs from the correlation matrix, and the `plot_graph` function plots these graphs. | ||
|
||
Code that illustrates how to use these functions: | ||
```python | ||
graphs, layouts = graph_corr_matrix(corr_matrix, | ||
reactions, | ||
# boolean value; if True, transforms the correlation matrix values into absolute values | ||
correction = True, | ||
# nested list containing sublists of reactions within the same cluster, created by the `cluster_corr_reactions` function. | ||
clusters = clusters) | ||
|
||
plot_graph( | ||
# graph object returned by the graph_corr_matrix function | ||
graph, | ||
# layout corresponding to the given graph object | ||
layout) | ||
|
||
``` | ||
Here’s an example of a graph created from a correlation matrix without cutoffs: | ||
<center> | ||
<img src="https://sotiristouliopoulos.github.io/dingo/img/graph_no_cutoffs.png"><br><br> | ||
</center> | ||
|
||
The width of each edge represents the correlation strength, scaled from 0 to 1. In addition to the initial graph, users can create and plot subgraphs that represent reaction subnetworks. These subnetworks can serve as supplementary information to clustering results, allowing researchers to identify similarities and differences between the two methods and adjust the structure of desired clusters accordingly. | ||
|
||
However, the insights gained from graphs extend beyond pathway prediction. The more connected a node (reaction) is, the more essential its role likely is in metabolism. Conversely, nodes with fewer connections may represent reactions that can be removed without significantly affecting the system. | ||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
## Conclusion | ||
|
||
Flux sampling offers significant statistical value, enhancing research in metabolic models. In this blog, we explored the pre- and post-sampling features integrated into the dingo package. From model reduction to pathway prediction, these features are designed to assist researchers in studying fundamental aspects of metabolic function. | ||
|
||
|
||
## References | ||
|
||
SotirisTouliopoulos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- <a id="ref-1"></a> [1] Lewis NE, Hixson KK, Conrad TM, et al. Omic data from evolved E. coli are consistent with computed optimal growth from genome‐scale models. Molecular Systems Biology. 2010;6(1). doi:https://doi.org/10.1038/msb.2010.47 | ||
| ||
- <a id="ref-2"></a> [2] King ZA, Lu J, Dräger A, et al. BiGG Models: A platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Research. 2015;44(D1):D515-D522. doi:https://doi.org/10.1093/nar/gkv1049 | ||
| ||
- <a id="ref-3"></a> [3] King ZA, Dräger A, Ebrahim A, Sonnenschein N, Lewis NE, Palsson BO. Escher: A Web Application for Building, Sharing, and Embedding Data-Rich Visualizations of Biological Pathways. Gardner PP, ed. PLOS Computational Biology. 2015;11(8):e1004321. doi:https://doi.org/10.1371/journal.pcbi.1004321 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SotirisTouliopoulos! That looks good. A general comment is that the text is more like a gsoc final submission. I would make a few modifications so that it would be more like a blog-post.