feat: fuzzy kcat matching #188

edkerk · 2022-12-19T15:59:24Z

Main improvements in this PR:

Added the classical GECKO 2 behavior to GECKO 3:

If the EC Codes field is missing in the model, the EC codes will be filled into the model.ec structure using the classical GECKO 2 method.
It is now also possible to use the fuzzyKcatMatching function to get kcats from BRENDA etc. using the fuzzy matching from GECKO2.

I hereby confirm that I have:

Tested my code with all requirements for running GECKO
Selected devel as a target branch (top left drop-down menu)

geckomat/get_enzyme_data/loadBRENDAdata.m

edkerk · 2022-12-20T13:19:00Z

I copied commit 4866112 as part of PR #189, to separate the two issues (that commit can already be reviewed, while the fuzzy matching might need more refactoring & discussion). Once #189 is accepted it can also be merged into this PR, but at least #189 will not be unneccessarily delayed.
We should avoid to have too much model-specific files and duplication. See above comment, instead give loadDatabase a flag to only keep a selection of the loaded data.
Related to the above, phylDist.mat is duplicate of the file in databases.
The protDatabase.mat is partially superseded by the Uniprot text files that are obtained from loadDatabases. I'll refactor something similar for KEGG, so that we can avoid storing *.mat files in the repo (changes in these binary files cannot be tracked and they can quickly increase the size of the repo).

edkerk · 2022-12-21T11:40:58Z

Set this as draft as I'm currently refactoring the work, including the Uniprot and KEGG database files.

I'm not working on updating the max_*.txt files, which should potentially be done using DLKcat code. This is probably material for a separate PR.

And directly add eccodes to model.ec.eccodes.

To be done in separate function.

And rename getECCodes to getECfromDatabase

edkerk · 2022-12-21T21:19:51Z

I'll soon rebase this PR to a separate branch, to avoid having additional .mat files becoming part of the git history.

The original refactoring as done by @johan-gson is in the feat/fuzzyMatchingOriginal branch for reference.

Based on the implementation by @johan-gson, I've made the following changes:

Removed ProtDatabase.mat and instead use flat-text uniprot (implemented earlier) and KEGG database files (implemented here). This avoids the usage of binary .mat files.
Remove population of model.ec.eccodes from makeEcModel. This is only needed for fuzzyKcatMatching; is slow if the KEGG and/or UniProt databases are not yet constructed; and the user might want to make a selection of ec-codes derived from KEGG/UniProt and already defined in model.eccodes.
Do not make model.ec.substrates and model.ec.substrCoeffs fields. This is just used by fuzzyKcatMatching and can easily be parsed from the model at that point. This reduces replicating in the model structure and makes it easier to I/O GECKO models as yml-file.
getECCodes renamed getECfromDatabases, and directly populates model.ec.eccodes, using the new databases (see above).
New getECfromGEM instead gathers ec-codes from model.eccodes to populate model.ec.eccodes. The two getEC* functions can act on a selection of the reactions, so they can be combined to reach maximum coverage (advice: first run getECfromAnnotation, and run getECfromDatabases for cases where no ec-code could be found).
fuzzyKcatMatching outputs a kcatList that can be used by selectKcatValue to populate model.ec.kcat in a similar way as with DLKcat-derived values.

Not addressed is the actual fuzzy matching, or updated max_*.txt files. The latter could require some refactoring of the code, but this will be in a separate PR.

Yu-sysbio · 2022-12-22T03:46:14Z

Not addressed is the actual fuzzy matching, or updated max_*.txt files. The latter could require some refactoring of the code, but this will be in a separate PR.

Regarding the updated max_*.txt files, I already have codes in my private repo, which can extract enzyme information from the latest Brenda database. The output files are however different from what GECKO1&2 needs, but I can modify the files afterward.
It would be better to clarify in the separate PR: 1) do we need all three max_*.txt files, i.e., max_KCAT.txt, max_SA.txt and max_MW.txt in GECKO3? 2) should we keep the updated max_*.txt files in GECKO3 exactly the same structure as in GECKO1&2?

edkerk · 2022-12-22T11:12:22Z

Regarding the updated max_*.txt files, I already have codes in my private repo, which can extract enzyme information from the latest Brenda database. The output files are however different from what GECKO1&2 needs, but I can modify the files afterward. It would be better to clarify in the separate PR: 1) do we need all three max_*.txt files, i.e., max_KCAT.txt, max_SA.txt and max_MW.txt in GECKO3? 2) should we keep the updated max_*.txt files in GECKO3 exactly the same structure as in GECKO1&2?

Points 1) and 2) can be discussed in #157.

# Conflicts: # src/geckomat/change_model/makeEcModel.m # src/geckomat/gather_kcats/fuzzyKcatMatching.m # src/geckomat/gather_kcats/writeDLKcatInput.m # src/geckomat/get_enzyme_data/findECInDB.m # src/geckomat/get_enzyme_data/findInDBOld._obsolete.m # src/geckomat/get_enzyme_data/findInDB_obsolete.m # src/geckomat/get_enzyme_data/getECfromDatabase.m # src/geckomat/get_enzyme_data/getECfromGEM.m # src/geckomat/get_enzyme_data/getEnzymeCodes.m # src/geckomat/get_enzyme_data/loadBRENDAdata.m # src/geckomat/get_enzyme_data/loadDatabases.m # src/geckomat/get_enzyme_data/updateDatabases.m # src/geckomat/utilities/loadDatabases.m # userData/ecYeastGEM/data/ProtDatabase.mat # userData/ecYeastGEM/data/uniprot.tab

edkerk · 2022-12-23T01:08:52Z

Ready for review!

This can now generate a BRENDA fuzzy matching based ec-model. Some code is available in tutorials/protocol.m. I have not checked whether the assigned kcat values are correct (should be reporting same as in GECKO1&2, as the matching approach hasn't changed). But the model can at least run FBA (without protein pool constraint, so same growth rate is reached as non-ec-model).

Note that protein databases (KEGG, UniProt) are not distributed with GECKO but are rather reconstructed with loadDatabases. These files get quite large, particularly when we start distributing ecHumanGEM. Meanwhile, the databases are mostly required for making the initial model and finding EC numbers, so when also distributing a model file the user will likely not even require these files anew.

To do in next PR:

Write a function that can choose which of the fuzzy matched kcat values should be used in the model (use kcatList.origin and kcatList.wildcardLvl).
Correct documentation for each function.

johan-gson · 2022-12-23T04:16:14Z

geckomat/get_enzyme_data/loadDatabases.m

+if nargin<2
+    selectDatabase = 'both';
+end
+geckoPath = findGECKOroot();


I realize this code is just copied from my previous code, but I think it would be better to move the retrieval of this path to the base model, so it can be overridden. This way, we could create a small test taxon placed in the test folder when we do test cases, and a TestModelAdapter would then point to that folder instead of the standard one. That also makes it possible to provide a custom such database if desired.

johan-gson · 2022-12-23T04:18:07Z

geckomat/get_enzyme_data/loadDatabases.m

+
+%% Uniprot
+if any(strcmp(selectDatabase,{'uniprot','both'}))
+    filePath = fullfile(geckoPath,'databases','uniprot',[num2str(taxonID) '.tsv']);


I'm thinking then that this would be replaced by something like

filePath = fullfile(modelAdapter.getUniprotPath(),[num2str(taxonID) '.tsv']);

And in the base adapter for example (not tested):

function path = getUniprotPath(obj) geckoPath = findGECKOroot(); path = fullfile(geckoPath,'databases','uniprot'); end

Someone should think through if there should be just one function for all paths like I did now (yes I was lazy :) ) or one function per path, I don't have a strong opinion here.

This way, that function could be overridden in a TestModelAdapter

johan-gson · 2022-12-23T04:28:27Z

geckomat/get_enzyme_data/loadDatabases.m

+
+%% KEGG
+if any(strcmp(selectDatabase,{'kegg','both'}))
+    filePath = fullfile(geckoPath,'databases','kegg',[keggID '.tsv']);


Same thing here.

johan-gson · 2022-12-23T04:49:45Z

Good changes to this code Ed! I had a comment about moving some things to the adapter - we can either do that now or later. Since we start with the old GECKO code and gradually improve it, we cannot expect to fix all things at once, everything that is an improvement passes as I see it! But in the final perfect product, we should have thought through how to be able to test everything.

johan-gson

So, I accidentally left my comments outside of the review, but see them as optional.

Yu-sysbio

Well done Ed! Looks good in general. I ran the script tutorials/protocol.m smoothly. Besides that, I set ub for the total protein pool reaction at 0.1 and got an infeasible solution. This is due to extremely low kcat values in the model as when increasing from 0.1 to 10 I got 1e-4 growth rate. This means that we indeed need tuning in the next stage.

I actually got some errors or found some bugs when making ecGEMs for other organisms e.g., E. coli. These might be not so relevant here and we could discuss them later maybe in test/debug processes?

edkerk · 2022-12-23T10:42:17Z

@Yu-sysbio Yes, I'd prefer to merge this PR, so that further work can built on this code already, while separately making smaller bug-fixes.

@johan-gson Excellent suggestion, will implement this in later PR. Just one thing, you seemed to have commented on an earlier commit, not on the final code:

GECKO/src/geckomat/get_enzyme_data/loadDatabases.m

Lines 30 to 38 in b2f72b1

    
           if nargin<2 
        
               global GECKOModelAdapter 
        
               params  = checkGECKOModelAdapter(GECKOModelAdapter); 
        
               keggID  = params.keggID; 
        
               taxonID = params.taxonID; 
        
               filePath = fullfile(params.path,'data'); 
        
           else 
        
               params.uniprotGeneIdField = geneIdField; 
        
           end

Your argument still holds though!

edkerk mentioned this pull request Dec 19, 2022

feat: Implemented GECKO Light functionality in GECKO 3. #180

Merged

2 tasks

edkerk commented Dec 20, 2022

View reviewed changes

geckomat/get_enzyme_data/loadBRENDAdata.m Outdated Show resolved Hide resolved

edkerk and others added 3 commits December 20, 2022 23:27

refactor: loadDatabases UniProt and KEGG

cfd5fb8

feat: Added the classic fuzzy logic matching from GECKO2

f3bf38f

fix: loadDatabase UniProt skip unused columns

36409b7

edkerk marked this pull request as draft December 21, 2022 11:39

edkerk added 6 commits December 21, 2022 14:52

refactor: getECCodes with new protein DBs

6cf133c

And directly add eccodes to model.ec.eccodes.

refactor: loadBRENDAdata find database files

80793d8

feat: makeEcModel do not set model.ec.eccodes

332bece

To be done in separate function.

fix: various minor bugs

7601038

feat: fuzzyKcatMatching using new DBs

79c06a1

feat: getECfromGEM to copy from model.eccodes

ad3f39a

And rename getECCodes to getECfromDatabase

edkerk marked this pull request as ready for review December 21, 2022 21:21

edkerk marked this pull request as draft December 21, 2022 21:21

edkerk force-pushed the feat/fuzzyMatching branch from 2b44678 to ad3f39a Compare December 21, 2022 21:23

edkerk marked this pull request as ready for review December 21, 2022 21:24

edkerk added the gecko3 label Dec 21, 2022

This was referenced Dec 21, 2022

assign and use EC codes from model.eccodes #175

Closed

fuzzyKcatMatching: refactoring of BRENDA-based approach #158

Closed

This was linked to issues Dec 21, 2022

fuzzyKcatMatching: refactoring of BRENDA-based approach #158

Closed

assign and use EC codes from model.eccodes #175

Closed

feat: fuzzyKcatMatching makes kcatList output

89c0a6c

edkerk added 3 commits December 22, 2022 21:56

feat: loadDatabase manual taxonID and keggID

7206b90

fix: minor bugfixes

cf56471

doc: sketch of draft protocol

b2f72b1

johan-gson reviewed Dec 23, 2022

View reviewed changes

johan-gson approved these changes Dec 23, 2022

View reviewed changes

Yu-sysbio approved these changes Dec 23, 2022

View reviewed changes

edkerk merged commit 6010dba into devel3 Dec 23, 2022

edkerk deleted the feat/fuzzyMatching branch December 23, 2022 11:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fuzzy kcat matching #188

feat: fuzzy kcat matching #188

edkerk commented Dec 19, 2022 •

edited by johan-gson

Loading

edkerk commented Dec 20, 2022

edkerk commented Dec 21, 2022

edkerk commented Dec 21, 2022 •

edited

Loading

Yu-sysbio commented Dec 22, 2022

edkerk commented Dec 22, 2022

edkerk commented Dec 23, 2022 •

edited

Loading

johan-gson Dec 23, 2022

johan-gson Dec 23, 2022

johan-gson Dec 23, 2022

johan-gson Dec 23, 2022

johan-gson Dec 23, 2022

johan-gson commented Dec 23, 2022

johan-gson left a comment

Yu-sysbio left a comment

edkerk commented Dec 23, 2022

feat: fuzzy kcat matching #188

feat: fuzzy kcat matching #188

Conversation

edkerk commented Dec 19, 2022 • edited by johan-gson Loading

Main improvements in this PR:

edkerk commented Dec 20, 2022

edkerk commented Dec 21, 2022

edkerk commented Dec 21, 2022 • edited Loading

Yu-sysbio commented Dec 22, 2022

edkerk commented Dec 22, 2022

edkerk commented Dec 23, 2022 • edited Loading

johan-gson Dec 23, 2022

Choose a reason for hiding this comment

johan-gson Dec 23, 2022

Choose a reason for hiding this comment

johan-gson Dec 23, 2022

Choose a reason for hiding this comment

johan-gson Dec 23, 2022

Choose a reason for hiding this comment

johan-gson Dec 23, 2022

Choose a reason for hiding this comment

johan-gson commented Dec 23, 2022

johan-gson left a comment

Choose a reason for hiding this comment

Yu-sysbio left a comment

Choose a reason for hiding this comment

edkerk commented Dec 23, 2022

edkerk commented Dec 19, 2022 •

edited by johan-gson

Loading

edkerk commented Dec 21, 2022 •

edited

Loading

edkerk commented Dec 23, 2022 •

edited

Loading