-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different substrate annotations in dbCAN-PUL and run_dbcan4 #132
Comments
Thanks for the question. The substrate curation is not an easy job as glycans appear with different names and defined at different levels (e.g., lactoses are human milk oligosaccharides, and HMOs are host glycans). in the literature, and we are not happy with our curation either. dbCAN-PUL.substrate.mapping.xls is the most complete ref table and actively maintained, but dbCAN-PUL_07-01-2022.txt is what is used in run_dbcan. dbCAN-PUL.substrate.mapping.xls has updated substrate cols reflecting our continuous efforts in grouping substrates. But you are correct that the "curated_substrate (07/01/2022)" col was directly extracted from literature, which were used to derive the "updated_substrate" columns (at higher levels for glycan definitions). In short, there is no easy solution for now, and we are continuously working on a hierarchical classification to define glycans. Yanbin |
I got similar questions/issues. I tested the annotation with several "well-documented genomes" on the server, which provided reasonable substrates predictions. So did a batch annotation for 500 genomes by following the "Run from Raw Reads: Automated CAZyme and Glycan Substrate Annotation in Microbiomes: A Step-by-Step Protocol". However, the outputs from "run_dbcan" for the same genomes were quite different from what I got from the server. These differences made me very confused. Is it necessary to manually curate the output from "run_dbCAN" with the dbCAN-PUL.substrate.mapping.xls? |
Hi developers,
There seems to be different substrate annotations in dbCAN-PUL and those in the output from run_dbcan. Not sure if it's intended or an error. For example, PUL0291 has lactose as substrate in dbCAN-PUL, which also fits with what is stated in the associated reference paper. However, in this file "dbCAN-PUL_07-01-2022.txt" used by run_dbcan4, the substrate for PUL0291 is "human milk oligosaccharide".
I also found this file "dbCAN-PUL.substrate.mapping.xls" on the FTP server, which has both a "curated substrate" and two "updated substrate" columns, but only the curated one seems to be correct. Am I misunderstanding something?
The text was updated successfully, but these errors were encountered: