Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: fuzzy kcat matching #188

Merged
merged 14 commits into from
Dec 23, 2022
Merged

feat: fuzzy kcat matching #188

merged 14 commits into from
Dec 23, 2022

Conversation

edkerk
Copy link
Member

@edkerk edkerk commented Dec 19, 2022

Main improvements in this PR:

Added the classical GECKO 2 behavior to GECKO 3:

  • If the EC Codes field is missing in the model, the EC codes will be filled into the model.ec structure using the classical GECKO 2 method.
  • It is now also possible to use the fuzzyKcatMatching function to get kcats from BRENDA etc. using the fuzzy matching from GECKO2.

I hereby confirm that I have:

  • Tested my code with all requirements for running GECKO
  • Selected devel as a target branch (top left drop-down menu)

@edkerk
Copy link
Member Author

edkerk commented Dec 20, 2022

  1. I copied commit 4866112 as part of PR #189, to separate the two issues (that commit can already be reviewed, while the fuzzy matching might need more refactoring & discussion). Once #189 is accepted it can also be merged into this PR, but at least #189 will not be unneccessarily delayed.

  2. We should avoid to have too much model-specific files and duplication. See above comment, instead give loadDatabase a flag to only keep a selection of the loaded data.

  3. Related to the above, phylDist.mat is duplicate of the file in databases.

  4. The protDatabase.mat is partially superseded by the Uniprot text files that are obtained from loadDatabases. I'll refactor something similar for KEGG, so that we can avoid storing *.mat files in the repo (changes in these binary files cannot be tracked and they can quickly increase the size of the repo).

@edkerk edkerk marked this pull request as draft December 21, 2022 11:39
@edkerk
Copy link
Member Author

edkerk commented Dec 21, 2022

Set this as draft as I'm currently refactoring the work, including the Uniprot and KEGG database files.

I'm not working on updating the max_*.txt files, which should potentially be done using DLKcat code. This is probably material for a separate PR.

@edkerk
Copy link
Member Author

edkerk commented Dec 21, 2022

I'll soon rebase this PR to a separate branch, to avoid having additional .mat files becoming part of the git history.

The original refactoring as done by @johan-gson is in the feat/fuzzyMatchingOriginal branch for reference.

Based on the implementation by @johan-gson, I've made the following changes:

  • Removed ProtDatabase.mat and instead use flat-text uniprot (implemented earlier) and KEGG database files (implemented here). This avoids the usage of binary .mat files.
  • Remove population of model.ec.eccodes from makeEcModel. This is only needed for fuzzyKcatMatching; is slow if the KEGG and/or UniProt databases are not yet constructed; and the user might want to make a selection of ec-codes derived from KEGG/UniProt and already defined in model.eccodes.
  • Do not make model.ec.substrates and model.ec.substrCoeffs fields. This is just used by fuzzyKcatMatching and can easily be parsed from the model at that point. This reduces replicating in the model structure and makes it easier to I/O GECKO models as yml-file.
  • getECCodes renamed getECfromDatabases, and directly populates model.ec.eccodes, using the new databases (see above).
  • New getECfromGEM instead gathers ec-codes from model.eccodes to populate model.ec.eccodes. The two getEC* functions can act on a selection of the reactions, so they can be combined to reach maximum coverage (advice: first run getECfromAnnotation, and run getECfromDatabases for cases where no ec-code could be found).
  • fuzzyKcatMatching outputs a kcatList that can be used by selectKcatValue to populate model.ec.kcat in a similar way as with DLKcat-derived values.

Not addressed is the actual fuzzy matching, or updated max_*.txt files. The latter could require some refactoring of the code, but this will be in a separate PR.

@edkerk edkerk marked this pull request as ready for review December 21, 2022 21:21
@edkerk edkerk marked this pull request as draft December 21, 2022 21:21
@edkerk edkerk marked this pull request as ready for review December 21, 2022 21:24
@edkerk edkerk added the gecko3 label Dec 21, 2022
@Yu-sysbio
Copy link
Collaborator

Not addressed is the actual fuzzy matching, or updated max_*.txt files. The latter could require some refactoring of the code, but this will be in a separate PR.

Regarding the updated max_*.txt files, I already have codes in my private repo, which can extract enzyme information from the latest Brenda database. The output files are however different from what GECKO1&2 needs, but I can modify the files afterward.
It would be better to clarify in the separate PR: 1) do we need all three max_*.txt files, i.e., max_KCAT.txt, max_SA.txt and max_MW.txt in GECKO3? 2) should we keep the updated max_*.txt files in GECKO3 exactly the same structure as in GECKO1&2?

@edkerk
Copy link
Member Author

edkerk commented Dec 22, 2022

Regarding the updated max_*.txt files, I already have codes in my private repo, which can extract enzyme information from the latest Brenda database. The output files are however different from what GECKO1&2 needs, but I can modify the files afterward. It would be better to clarify in the separate PR: 1) do we need all three max_*.txt files, i.e., max_KCAT.txt, max_SA.txt and max_MW.txt in GECKO3? 2) should we keep the updated max_*.txt files in GECKO3 exactly the same structure as in GECKO1&2?

Points 1) and 2) can be discussed in #157.

# Conflicts:
#	src/geckomat/change_model/makeEcModel.m
#	src/geckomat/gather_kcats/fuzzyKcatMatching.m
#	src/geckomat/gather_kcats/writeDLKcatInput.m
#	src/geckomat/get_enzyme_data/findECInDB.m
#	src/geckomat/get_enzyme_data/findInDBOld._obsolete.m
#	src/geckomat/get_enzyme_data/findInDB_obsolete.m
#	src/geckomat/get_enzyme_data/getECfromDatabase.m
#	src/geckomat/get_enzyme_data/getECfromGEM.m
#	src/geckomat/get_enzyme_data/getEnzymeCodes.m
#	src/geckomat/get_enzyme_data/loadBRENDAdata.m
#	src/geckomat/get_enzyme_data/loadDatabases.m
#	src/geckomat/get_enzyme_data/updateDatabases.m
#	src/geckomat/utilities/loadDatabases.m
#	userData/ecYeastGEM/data/ProtDatabase.mat
#	userData/ecYeastGEM/data/uniprot.tab
@edkerk
Copy link
Member Author

edkerk commented Dec 23, 2022

Ready for review!

This can now generate a BRENDA fuzzy matching based ec-model. Some code is available in tutorials/protocol.m. I have not checked whether the assigned kcat values are correct (should be reporting same as in GECKO1&2, as the matching approach hasn't changed). But the model can at least run FBA (without protein pool constraint, so same growth rate is reached as non-ec-model).

Note that protein databases (KEGG, UniProt) are not distributed with GECKO but are rather reconstructed with loadDatabases. These files get quite large, particularly when we start distributing ecHumanGEM. Meanwhile, the databases are mostly required for making the initial model and finding EC numbers, so when also distributing a model file the user will likely not even require these files anew.

To do in next PR:

  • Write a function that can choose which of the fuzzy matched kcat values should be used in the model (use kcatList.origin and kcatList.wildcardLvl).
  • Correct documentation for each function.

if nargin<2
selectDatabase = 'both';
end
geckoPath = findGECKOroot();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this code is just copied from my previous code, but I think it would be better to move the retrieval of this path to the base model, so it can be overridden. This way, we could create a small test taxon placed in the test folder when we do test cases, and a TestModelAdapter would then point to that folder instead of the standard one. That also makes it possible to provide a custom such database if desired.


%% Uniprot
if any(strcmp(selectDatabase,{'uniprot','both'}))
filePath = fullfile(geckoPath,'databases','uniprot',[num2str(taxonID) '.tsv']);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking then that this would be replaced by something like

filePath = fullfile(modelAdapter.getUniprotPath(),[num2str(taxonID) '.tsv']);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in the base adapter for example (not tested):

    function path = getUniprotPath(obj)
        geckoPath = findGECKOroot();
        path = fullfile(geckoPath,'databases','uniprot');
    end

Someone should think through if there should be just one function for all paths like I did now (yes I was lazy :) ) or one function per path, I don't have a strong opinion here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way, that function could be overridden in a TestModelAdapter


%% KEGG
if any(strcmp(selectDatabase,{'kegg','both'}))
filePath = fullfile(geckoPath,'databases','kegg',[keggID '.tsv']);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

@johan-gson
Copy link
Collaborator

Good changes to this code Ed! I had a comment about moving some things to the adapter - we can either do that now or later. Since we start with the old GECKO code and gradually improve it, we cannot expect to fix all things at once, everything that is an improvement passes as I see it! But in the final perfect product, we should have thought through how to be able to test everything.

Copy link
Collaborator

@johan-gson johan-gson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I accidentally left my comments outside of the review, but see them as optional.

Copy link
Collaborator

@Yu-sysbio Yu-sysbio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done Ed! Looks good in general. I ran the script tutorials/protocol.m smoothly. Besides that, I set ub for the total protein pool reaction at 0.1 and got an infeasible solution. This is due to extremely low kcat values in the model as when increasing from 0.1 to 10 I got 1e-4 growth rate. This means that we indeed need tuning in the next stage.

I actually got some errors or found some bugs when making ecGEMs for other organisms e.g., E. coli. These might be not so relevant here and we could discuss them later maybe in test/debug processes?

@edkerk
Copy link
Member Author

edkerk commented Dec 23, 2022

@Yu-sysbio Yes, I'd prefer to merge this PR, so that further work can built on this code already, while separately making smaller bug-fixes.

@johan-gson Excellent suggestion, will implement this in later PR. Just one thing, you seemed to have commented on an earlier commit, not on the final code:

if nargin<2
global GECKOModelAdapter
params = checkGECKOModelAdapter(GECKOModelAdapter);
keggID = params.keggID;
taxonID = params.taxonID;
filePath = fullfile(params.path,'data');
else
params.uniprotGeneIdField = geneIdField;
end

Your argument still holds though!

@edkerk edkerk merged commit 6010dba into devel3 Dec 23, 2022
@edkerk edkerk deleted the feat/fuzzyMatching branch December 23, 2022 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

assign and use EC codes from model.eccodes fuzzyKcatMatching: refactoring of BRENDA-based approach
5 participants