Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC 2024 dingo blog #35

Open
wants to merge 19 commits into
base: gh-pages
Choose a base branch
from
Open
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
230 changes: 230 additions & 0 deletions _posts/2024-08-24-GSoC2024-dingo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SotirisTouliopoulos! That looks good. A general comment is that the text is more like a gsoc final submission. I would make a few modifications so that it would be more like a blog-post.

layout: single
title: "Enhancing dingo: from metabolic models reduction to pathways prediction"
date: 2024-08-24
author: Sotiris Touliopoulos
author_profile: true
read_time: true
comments: true
share: true
related: true
hidden: false
---


# Enhancing dingo: from metabolic models reduction to pathways prediction


<div style="text-align:center;">
<br>
<b>Sotiris Touliopoulos</b>
<br>
<b>Contributor to Google Summer of Code 2024 with GeomScale</b>
<br>
</div>


SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved
> #### Mentors of this project: Vissarion Fisikopoulos, Apostolos Chalkis and Haris Zafeiropoulos.


## Goal of the project

The goal of this project is to enhance dingo by incorporating pre- and post-sampling features to leverage the increased statistical value of flux sampling.
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved

A preprocessing class, which reduces metabolic networks by removing specific reactions, is integrated into dingo, along with a post-processing function that calculates correlation metrics from pairwise reaction fluxes.

Additionally, functions for clustering and generating graphs from the correlation matrix are implemented to group reactions and predict metabolic pathways.

Separate functions for visualizing matrices, dendrograms, and graphs are also provided.


## Preprocess for metabolic models reduction

- Large metabolic models contain numerous reactions and metabolites. Sampling the flux space of such models requires a significant computational time due to the high dimensionsionality of the convex polytope.
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved
- Model preprocessing can mitigate this issue by removing specific reactions, thereby reducing the dimensional space and computational time.
- Reduced models have decreased complexity and can help researchers better understand basic principles of metabolism. The reduction process is automated and unbiased when a specific objective function is set.

When an object is initialized from the `PreProcess` class, users can call the `reduce` function on this object to remove three types of reactions:
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved

- Blocked reactions: cannot carry a flux in any condition.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of "carry a flux"? Is it standard?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is standard. It is mentioned in this article: https://doi.org/10.1038/msb.2010.47

- Zero-flux reactions: cannot carry a flux while maintaining at least 90% of the maximum growth rate.
- Metabolically less-efficient reactions: require a reduction in growth rate if used <a href="#ref-1">[1]</a>.

Users can choose to remove an additional set of reactions, by setting the `extend` parameter of the `reduce` function to `True`. These reactions do not affect the value of the objective function when removed.

Code that illustrates how to use this class:
```python
cobra_model = load_json_model("ext_data/e_coli_core.json")
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved

obj = PreProcess(cobra_model,
# tolerance value to identify significant flux changes
tol = 1e-6,
# boolean variable that define whether to open all exchange reactions to very high flux ranges.
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved
open_exchanges = False)

removed_reactions, reduced_dingo_model = obj.reduce(extend = False)
```

Reduction with the `PreProcess` class has been tested with various models from the BiGG database <a href="#ref-2">[2]</a>. Figures below show the number of remained reactions, after applying the `reduce` function:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to have two instead of one plot?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better visibility. Lines overlapped when all reactions were in a single plot


<center>
<img src="https://sotiristouliopoulos.github.io/dingo/img/reduction_results_plot.png"><br><br>
<img src="https://sotiristouliopoulos.github.io/dingo/img/reduction_results_plot2.png"><br><br>
</center>

This process significantly reduces model complexity. In some cases, it yields core models, which are useful for examining essential metabolic pathways.


## Identification of correlated reactions

- Reactions in biochemical pathways can be positively correlated, negatively correlated, or uncorrelated.
- Positive correlation means that if reaction A is active, then reaction B is also active and vice versa.
- Negative correlation means that if reaction A is active, then reaction B is inactive and vice versa.
- Zero correlation means that the status of reaction A is independent of the status of reaction B and vice versa.

Users can sample the flux space of a selected metabolic model using dingo's `PolytopeSampler` class. Then, the `correlated_reactions` function can help them identify possible correlations in pairs of reactions. This function generates a correlation matrix based on the pearson correlation coefficient of pairwise reactions fluxes.

Users can apply cutoff values to filter the matrix, replacing all values that do not meet the threshold with 0. For instance, if a pearson cutoff is set to 0.80, all correlation values below this are replaced with 0. However, pearson is not the only available correlation metric.

Users can also filter correlated reactions using copulas. If you're unfamiliar, copulas allow you to examine how two random variables (in this case, reaction fluxes) are related. You can read more about them [here](https://waterprogramming.wordpress.com/2017/11/11/an-introduction-to-copulas/). Below is an example plot of a copula showing the relationship between two random variables:
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved

![](https://www.researchgate.net/publication/362369405/figure/fig4/AS:11431281085484712@1663777261143/Distribution-of-five-types-of-copulas-a-PDF-of-Gaussian-Copula-b-PDF-of-t-Copula.jpg)

Copulas are a useful statistical tool for validating significant correlations. However, manually inspecting copula plots can introduce user bias and be time-consuming given the number of paired reactions. An alternative to manual curation is using a metric called the "indicator," which sums the probability values across each copula's diagonal and divides them. The value of the indicator, whether positive or negative, helps validate or reject a possible correlation. Users can set an indicator cutoff that functions similarly to the pearson cutoff.

In addition to calculating correlations, a `plot_corr_matrix` function is available for visualizing the correlation matrix as a heatmap. This visualization helps users easily identify significant correlations.

Code that illustrates how to use these functions:
```python
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved
dingo_model = MetabolicNetwork.from_json("ext_data/e_coli_core.json")

sampler = PolytopeSampler(dingo_model)
steady_states = sampler.generate_steady_states()

corr_matrix, indicator_dict = correlated_reactions(
steady_states,
pearson_cutoff = 0.99,
indicator_cutoff = 2,
# number of cells to compute the copula
cells = 10,
# value that defines the width of the copula’s diagonal
cop_coeff = 0.3,
# boolean variable that when True, keeps only the lower triangular matrix
lower_triangle = False)

plot_corr_matrix(corr_matrix,
# list containing reactions’ names
reactions,
# parameter that defines image saving format
format = "svg")
```

The `e_coli_core.json` model we use represents the core metabolism of the E. coli. You can find the model and related data [here](http://bigg.ucsd.edu/models/e_coli_core).

Here’s an example of a heatmap created from a symmetrical correlation matrix without cutoffs:
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved
<center>
<img src="https://sotiristouliopoulos.github.io/dingo/img/corr_matrix_no_cutoffs.png"><br><br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have figures of high resolution; when you are trying to zoom in to check the reaction names you get pixels now.

</center>

We can focus on correlations from reactions that belong to the glycolytic and pentose pathways to get some insights from the heatmap. If you're unfamiliar with these reactions' topology, you can visit [ESCHER](https://escher.github.io/#/) <a href="#ref-3">[3]</a> and load the core metabolism map of E. coli.
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved

We observe strong pairwise correlations among all reactions in the pentose phosphate pathway (`G6PDH2R, PGL, GND, RPE, RPI, TKT1, TALA, TKT2`). However, for the glycolytic pathway (`PGI, PFK, FBA, TPI, GAPD, PGK, PGM, ENO, PYK`), two reactions, `PFK` and `PYK`, show decreased correlation with the others.
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved

From the literature, it is known that `PFK` and `PYK` are catalyzed by enzymes that regulate glycolysis. These enzymes do not contribute to gluconeogenesis (the reverse pathway of glycolysis) because their reactions are unidirectional. When gluconeogenesis occurs, these enzymes have little or no activity, while other reactions function in reverse. This explains their lower correlation compared to other glycolytic reactions.

The gluconeogenic reactions that convert the products of `PFK` and `PYK` back into their substrates are `FBP` and `PPS`, respectively. These reactions may become active when there is an excess of `D-fructose-1,6-bisphosphate` or `pyruvate`, and the model must reduce their concentrations to achieve a steady-state condition.

This example shows how examining the correlation matrix can help researchers study metabolic pathways. However, this process can be challenging in genome-scale models due to the large number of reactions and pathways. Clustering and graph analysis can help automate this process.


## Clustering of the correlation matrix

- Clustering based on a dissimilarity matrix reveals groups of reactions with similar correlation values across the matrix.
- Reactions within the same cluster may contribute to the same pathways.

The `cluster_corr_reactions` function hierarchically clusters the correlation matrix, and the `plot_dendrogram` function plots the resulting dendrogram.

Code that illustrates how to use these functions:
```python
dissimilarity_matrix, labels, clusters = cluster_corr_reactions(
corr_matrix,
# list containing reactions’ names
reactions,
# defines the type of clustering linkage (options: single, average, complete, ward)
linkage = "ward",
# defines a threshold to cut the dendrogram at a specific height
t = 10.0,
# A boolean variable; if True, the dissimilarity matrix is calculated by subtracting absolute values from 1
correction = True)

plot_dendrogram(dissimilarity_matrix,
reactions,
# specifies whether reaction names will appear on the x-axis
plot_labels = True,
t = 10.0,
linkage = "ward")
```

Here is an example of a dendrogram created from a correlation matrix with a pearson cutoff of 0.9999:
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved
<center>
<img src="https://sotiristouliopoulos.github.io/dingo/img/dendrogram_pearson.png"><br><br>
</center>

We observe well-separated and distinct clusters. Some clusters contain reactions from the glycolytic and pentose phosphate pathways, as discussed earlier. Additionally, there are clusters with reactions from the citric acid cycle and other lesser-known pathways.

In the cluster representing the pentose phosphate pathway, we notice that `PGI`, a glycolytic reaction, is included. This is likely because it shares a common metabolite (`D-glucose-6-phosphate`) with `G6PDH2r`, the first reaction in the pentose phosphate pathway.

However, applying a stricter pearson cutoff (e.g., 0.99999) removes `PGI` from this cluster, leaving only the reactions of the pentose phosphate pathway.

This is an early but promising indication that clustering the correlation matrix can reveal groups of reactions that contribute to the same pathways.

In the future, other clustering methods such as `k-means`, `HDBSCAN`, and `biclustering` at the sample level will be explored to identify more specific clusters. Let me briefly explain: Some pathways have distinct phases, each serving a unique role in metabolism. Both glycolysis and the pentose phosphate pathways can be divided into two phases.

For example, the pentose phosphate pathway has an oxidative and a non-oxidative phase. Reactions such as `G6PDH2R`, `PGL`, and `GND` belong to the oxidative phase, which produces `NADPH`. Hierarchical clustering may not distinguish these phases, but using alternative methods not solely based on the correlation matrix could reveal two distinct clusters representing the two phases.


## Graphs from the correlation matrix

- Creating graphs can reveal networks of correlated reactions, which may correspond to metabolic pathways.
- These graphs are generated from the correlation matrix, where reactions are represented as nodes and correlation coefficients as edges.

The `graph_corr_matrix` function generates graphs from the correlation matrix, and the `plot_graph` function plots these graphs.

Code that illustrates how to use these functions:
```python
graphs, layouts = graph_corr_matrix(corr_matrix,
reactions,
# boolean value; if True, transforms the correlation matrix values into absolute values
correction = True,
# nested list containing sublists of reactions within the same cluster, created by the `cluster_corr_reactions` function.
clusters = clusters)

plot_graph(
# graph object returned by the graph_corr_matrix function
graph,
# layout corresponding to the given graph object
layout)

```
Here’s an example of a graph created from a correlation matrix without cutoffs:
<center>
<img src="https://sotiristouliopoulos.github.io/dingo/img/graph_no_cutoffs.png"><br><br>
</center>

The width of each edge represents the correlation strength, scaled from 0 to 1. In addition to the initial graph, users can create and plot subgraphs that represent reaction subnetworks. These subnetworks can serve as supplementary information to clustering results, allowing researchers to identify similarities and differences between the two methods and adjust the structure of desired clusters accordingly.

However, the insights gained from graphs extend beyond pathway prediction. The more connected a node (reaction) is, the more essential its role likely is in metabolism. Conversely, nodes with fewer connections may represent reactions that can be removed without significantly affecting the system.
SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved


## Conclusion

Flux sampling offers significant statistical value, enhancing research in metabolic models. In this blog, we explored the pre- and post-sampling features integrated into the dingo package. From model reduction to pathway prediction, these features are designed to assist researchers in studying fundamental aspects of metabolic function.


## References

SotirisTouliopoulos marked this conversation as resolved.
Show resolved Hide resolved
- <a id="ref-1"></a> [1] Lewis NE, Hixson KK, Conrad TM, et al. Omic data from evolved E. coli are consistent with computed optimal growth from genome‐scale models. Molecular Systems Biology. 2010;6(1). doi:https://doi.org/10.1038/msb.2010.47
- <a id="ref-2"></a> [2] King ZA, Lu J, Dräger A, et al. BiGG Models: A platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Research. 2015;44(D1):D515-D522. doi:https://doi.org/10.1093/nar/gkv1049
- <a id="ref-3"></a> [3] King ZA, Dräger A, Ebrahim A, Sonnenschein N, Lewis NE, Palsson BO. Escher: A Web Application for Building, Sharing, and Embedding Data-Rich Visualizations of Biological Pathways. Gardner PP, ed. PLOS Computational Biology. 2015;11(8):e1004321. doi:https://doi.org/10.1371/journal.pcbi.1004321