Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error #179

Open
Ben-41 opened this issue Jun 18, 2024 · 19 comments
Open

dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error #179

Ben-41 opened this issue Jun 18, 2024 · 19 comments
Labels
bug Something isn't working

Comments

@Ben-41
Copy link

Ben-41 commented Jun 18, 2024

Report

hi, I have encounter issues with the estimation of CGC substrate abundance and CGC abundance.
I followed all the steps from the manual and it ran smoothly, including dbcan_utils fam_abund and dbcan_utils fam_substrate_abund, however, when I ran dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund, error raise:

You are estimating the abundance of CGC/CGC substrate!
Reads are single end!
Total reads count: 218847!
Traceback (most recent call last):
File "/home/cdd/anaconda3/envs/dbcan/bin/dbcan_utils", line 10, in
sys.exit(main())
File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 621, in main
PUL_abundance(args)
File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 492, in PUL_abundance
PUL_abund = CAZyme_Abundance_estimate(paras)
File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 254, in init
seqid2dbcan_annotation,cgcid2cgc_standard = Read_cgc_standard_out(parameters.PUL_annotation)
File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 203, in Read_cgc_standard_out
tmp_record = cgc_standard_line(line.rstrip().split("\t"))
File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 191, in init
self.gene_start = int(lines[4])
ValueError: invalid literal for int() with base 10: 'Gene Start'

Version information

No response

@Ben-41 Ben-41 added the bug Something isn't working label Jun 18, 2024
@ZhengJinfang1220
Copy link

Report

hi, I have encounter issues with the estimation of CGC substrate abundance and CGC abundance. I followed all the steps from the manual and it ran smoothly, including dbcan_utils fam_abund and dbcan_utils fam_substrate_abund, however, when I ran dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund, error raise:

You are estimating the abundance of CGC/CGC substrate! Reads are single end! Total reads count: 218847! Traceback (most recent call last): File "/home/cdd/anaconda3/envs/dbcan/bin/dbcan_utils", line 10, in sys.exit(main()) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 621, in main PUL_abundance(args) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 492, in PUL_abundance PUL_abund = CAZyme_Abundance_estimate(paras) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 254, in init seqid2dbcan_annotation,cgcid2cgc_standard = Read_cgc_standard_out(parameters.PUL_annotation) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 203, in Read_cgc_standard_out tmp_record = cgc_standard_line(line.rstrip().split("\t")) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 191, in init self.gene_start = int(lines[4]) ValueError: invalid literal for int() with base 10: 'Gene Start'

Version information

No response

It seems the issue happens when the script reads the file "cgc_standard.out". Can you share this file here? So I can debug the code.

Jinfang

@Ben-41
Copy link
Author

Ben-41 commented Jun 19, 2024

Hi Jinfang, the cgc_standard.out looks like this:
$ head cgc_standard.out
CGC# Gene Type Contig ID Protein ID Gene Start Gene Stop Direction Protein Family
CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_47 44478 45509 - 1.A.33.1.5
CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_48 45726 48518 + GH2|GH2_e50
CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_65 72550 73521 - SIS+CBS+CBS
CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_71 77847 79862 - GH36|GH36_e10
CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_76 87219 88157 + 3.A.2.1.7
CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_80 89523 91055 + 3.A.2.1.7
CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_81 91077 91949 + 3.A.2.1.1
CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_82 92024 93445 + 3.A.2.1.2
CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_87 96421 97605 + Aminotran_1_2

During the prediction of CGCs, the manual said I need to have my own gff file, so I modified the gff file from Prodigal output, which change :
1 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 3 449 62.4 + 0 ID=0_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.651;conf=100.00;score=62.40;cscore=59.18;sscore=3.22;rsco re=0.00;uscore=0.00;tscore=3.22;
2 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 658 1245 11.3 - 0 ID=0_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.505;conf=93.04;score=11.28;cscore=6.79;sscore=4.49;rsc ore=1.54;uscore=0.03;tscore=3.57;
to
1 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 3 449 62.4 + 0 ID=Group_1_2_bin.20_contig-100_0_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.651;conf=100.00;score=62.40;cscore=59.18;sscore=3.22;rsco re=0.00;uscore=0.00;tscore=3.22;
2 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 658 1245 11.3 - 0 ID=Group_1_2_bin.20_contig-100_0_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.505;conf=93.04;score=11.28;cscore=6.79;sscore=4.49;rsc ore=1.54;uscore=0.03;tscore=3.57;

I don't know if this is wrong.

@ZhengJinfang1220
Copy link

Hi Jinfang, the cgc_standard.out looks like this: $ head cgc_standard.out CGC# Gene Type Contig ID Protein ID Gene Start Gene Stop Direction Protein Family CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_47 44478 45509 - 1.A.33.1.5 CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_48 45726 48518 + GH2|GH2_e50 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_65 72550 73521 - SIS+CBS+CBS CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_71 77847 79862 - GH36|GH36_e10 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_76 87219 88157 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_80 89523 91055 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_81 91077 91949 + 3.A.2.1.1 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_82 92024 93445 + 3.A.2.1.2 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_87 96421 97605 + Aminotran_1_2

Thank you, Ben,
It looks normal in your input file.
Can you check the function "Read_cgc_standard_out", this function should defined at line 199 in file "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py". The code in line 201 in the same file means to open and read the file "cgc_standard.out" by line. It should skip the 1st line which is the header. Like in the following screenshot. Can you check what these are like in your code?

20240619094256

@ZhengJinfang1220
Copy link

Group_1_2_bin.20_contig-100_0_47

Yes, you did the correct modification on gff file. And you got the output file "cgc_standard.out". Otherwise, you can not get this output.

@Ben-41
Copy link
Author

Ben-41 commented Jun 19, 2024

image

@ZhengJinfang1220
Copy link

'Gene Start

It seems the codes also look normal. So, what happens? Could you check the input file again, to look for another string "Gene Start" except for 1st line? If this still does not work.

Can you send me all the input files? I will debug on my PC.
Here is my email: [email protected]

@powerby66
Copy link

Hello,i want to ask you some questions please,
dbcan_utils fam_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPMoutput/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM
dbcan_utils CGC_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM
dbcan_utils CGC_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM
bash: dbcan_utils: command not found...
(/data2/jpc/env/dbcan414) jpc 18:09:17 ~/project/db_test/output/EscheriaColiK12MG1655_abund
dbcan_utils fam_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM
bash: dbcan_utils: command not found...
(/data2/jpc/env/dbcan414) jpc 18:09:18 ~/project/db_test/output/EscheriaColiK12MG1655_abund
dbcan_utils CGC_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPMbash: dbcan_utils: command not found...
(/data2/jpc/env/dbcan414) jpc 18:09:18 ~/project/db_test/output/EscheriaColiK12MG1655_abund
dbcan_utils CGC_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM
bash: dbcan_utils: command not found...
but i cant find the utils.py in /data2/jpc/env/dbcan414/lib/python3.7/site-packages/dbcan/utils,how can i solve this question?
1718964724642

@PaolaDiGianvito
Copy link

Hi, I have found a similar issue.
I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?...
when i run this command:
dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM
i have this error:
you are estimating the abundance of CAZyme!
Reads are single end!
Total read count: 156453394!
Can not find read count information for CAZyme: k141_10018_1.
In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me?

@powerby66
Copy link

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me?

Hello,Have you successfully solved this problem?I met the question too

@PaolaDiGianvito
Copy link

PaolaDiGianvito commented Jul 1, 2024 via email

@ZhengJinfang1220
Copy link

Not yet, really. I have used metaeuk for genes prediction. Could be this the problem? Il Lun 1 Lug 2024, 11:11 powerby66 @.> ha scritto:

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me? Hello,Have you successfully solved this problem?I met the question too — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCGZWCPNEMRKC3EXOXMLZKEMMFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZGYZTIMRRGM . You are receiving this because you commented.Message ID: @.
>

Not yet, really. I have used metaeuk for genes prediction. Could be this the problem? Il Lun 1 Lug 2024, 11:11 powerby66 @.> ha scritto:

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me? Hello,Have you successfully solved this problem?I met the question too — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCGZWCPNEMRKC3EXOXMLZKEMMFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZGYZTIMRRGM . You are receiving this because you commented.Message ID: @.
>

Hi, guys. We have fixed the bug in the updated version of dbCAN(several months ago). If still use the older version. please follow the steps:
1 please check "line 93, utils.py", the definition of ReadBedtoos
2 please modify line 95: seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[1:]}
to seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[0:]}, a bug ignoring the first line of read count information.

@PaolaDiGianvito
Copy link

Thank you, i have another question.
I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you
Paola

@yinlabniu
Copy link
Collaborator

yinlabniu commented Jul 1, 2024 via email

@PaolaDiGianvito
Copy link

PaolaDiGianvito commented Jul 1, 2024 via email

@PaolaDiGianvito
Copy link

PaolaDiGianvito commented Jul 1, 2024 via email

@yinlabniu
Copy link
Collaborator

yinlabniu commented Jul 1, 2024 via email

@PaolaDiGianvito
Copy link

PaolaDiGianvito commented Jul 2, 2024 via email

@PaolaDiGianvito
Copy link

Good morning,
i have tried to run my data from contigs of megahit avoiding metaeuk step and using file.fna and meta for CAZyme annotation, followed by ffn generation from prodigal gffr files and stp 8 and 11. At sep 13 I have ever this error
dbcan_utils fam_abund -bt GC1_D2.depth.txt -i ../GC1_D2.CAZyme -a TPM
You are estimating the abundance of CAZyme!
Reads are single end!
Total reads count: 43824341!
Can not find read count information for CAZyme: k141_111_4
...
I have changed line 95 in utils.py as suggested
def ReadBedtoos(filename):
lines = open(filename).readlines()
seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[0:]}
normalized_tpm = 0.
for seqid in seqid2info:
seqid_depth = seqid2info[seqid]
normalized_tpm += seqid_depth.read_count/seqid_depth.length
return seqid2info,normalized_tpm

Can you help me?

@powerby66
Copy link

没错。对于基于 CAZyme 的丰度分析,您只需要预测 CAZymes(在 p5 中提供您自己的 faa),在 p8 和 p11 中需要 ffn。可以跳过使用 contigs 的任何过程。
________________________________ 来自:Paola88 @.> 发送时间:2024 年 7 月 1 日星期一上午 11:31 收件人:linnabrown/run_dbcan @.> 抄送:Yanbin Yin @.>;评论 @.> 主题:回复:[linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund 和 dbcan_utils CGC_abund 错误(问题#179)警告:非 NU 电子邮件抱歉提出这个问题,但如果我不生成文件,我如何估计丰度,如果我不明白,我需要深度文件。谢谢 2024 年 7 月 1 日星期一,17:38 Paola Di Gianvito @.> 写道:
谢谢你的回答。如果我明白,我必须执行教程中的步骤 p5 和 p9 之后的步骤,对吗?2024 年 7 月 1 日,16:45 Yanbin Yin @.
> 写道: > Paola, > > 我们不建议使用 run_dbcan 进行 CGC 预测和基于 CGC 的丰度分析。原因是真核生物中不存在 CGC/PUL 概念。从 metaeuk 生成的 gff 包含外显子,这些外显子在 run_dbcan 中会被错误地视为单独的 CDS/基因。但是,您仍然可以使用 run_dbcan 进行 CAZyme 预测和基于 CAZyme 的丰度分析,因为不会使用 gff 文件。 > > Yanbin > ________________________________ > 发件人:Paola88 @.> > 发送时间:2024 年 7 月 1 日星期一 9:24 AM > 收件人:linnabrown/run_dbcan @.> > 抄送:已订阅 @.> > 主题:回复:[linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund 和 dbcan_utils CGC_abund 错误(问题#179) > > 警告:非 NU 电子邮件 > > > 谢谢,我还有一个问题。 > 我已经使用 MetaEuk gor 基因预测,因此我必须用 bedtools 生成 file.ffn,最好使用哪个文件?metaeuk 的输出还是在底物预测时生成的 profigal.gff 文件?谢谢你 > Paola > > — > 直接回复此电子邮件、在 GitHub 上查看< > #179 (评论) >,> 或取消订阅< > https://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ >。 > > 您收到此邮件是因为您订阅了此主题。 > > — > 直接回复此电子邮件、在 GitHub 上查看 > < #179 (评论) >, > 或取消订阅 > < https://github.com/notifications/unsubscribe-auth/BBFTCG4ZUQ44P6QVQJVT4WTZKFTSLAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGM2TONJZHA > > . > 您收到此邮件是因为您发表了评论。消息 ID: > @.> >
— 直接回复此电子邮件,在 GitHub 上查看< #179 (评论) >,或取消订阅< https://github.com/notifications/unsubscribe-auth/AEXNKZRH75FTSOGCUWDQI6TZKF76TAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGU4DONZQGQ >。您收到此邮件是因为您发表了评论。

May I ask if the GFF and FFN annotation files predicted by Prokka for fungi and protozoa are reliable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants