Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different substrate annotations in dbCAN-PUL and run_dbcan4 #132

Open
Russel88 opened this issue Oct 11, 2023 · 2 comments
Open

Different substrate annotations in dbCAN-PUL and run_dbcan4 #132

Russel88 opened this issue Oct 11, 2023 · 2 comments

Comments

@Russel88
Copy link

Hi developers,

There seems to be different substrate annotations in dbCAN-PUL and those in the output from run_dbcan. Not sure if it's intended or an error. For example, PUL0291 has lactose as substrate in dbCAN-PUL, which also fits with what is stated in the associated reference paper. However, in this file "dbCAN-PUL_07-01-2022.txt" used by run_dbcan4, the substrate for PUL0291 is "human milk oligosaccharide".

I also found this file "dbCAN-PUL.substrate.mapping.xls" on the FTP server, which has both a "curated substrate" and two "updated substrate" columns, but only the curated one seems to be correct. Am I misunderstanding something?

@Russel88 Russel88 changed the title Missing agreement between dbCAN-PUL and run_dbcan4 Different substrate annotations in dbCAN-PUL and run_dbcan4 Oct 11, 2023
@yinlabniu
Copy link
Collaborator

Thanks for the question. The substrate curation is not an easy job as glycans appear with different names and defined at different levels (e.g., lactoses are human milk oligosaccharides, and HMOs are host glycans). in the literature, and we are not happy with our curation either. dbCAN-PUL.substrate.mapping.xls is the most complete ref table and actively maintained, but dbCAN-PUL_07-01-2022.txt is what is used in run_dbcan. dbCAN-PUL.substrate.mapping.xls has updated substrate cols reflecting our continuous efforts in grouping substrates. But you are correct that the "curated_substrate (07/01/2022)" col was directly extracted from literature, which were used to derive the "updated_substrate" columns (at higher levels for glycan definitions).

In short, there is no easy solution for now, and we are continuously working on a hierarchical classification to define glycans.

Yanbin

@Weilan2011
Copy link

Weilan2011 commented Apr 15, 2024

I got similar questions/issues. I tested the annotation with several "well-documented genomes" on the server, which provided reasonable substrates predictions. So did a batch annotation for 500 genomes by following the "Run from Raw Reads: Automated CAZyme and Glycan Substrate Annotation in Microbiomes: A Step-by-Step Protocol". However, the outputs from "run_dbcan" for the same genomes were quite different from what I got from the server. These differences made me very confused. Is it necessary to manually curate the output from "run_dbCAN" with the dbCAN-PUL.substrate.mapping.xls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants