Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where are you Sulfobacillus? #255

Open
DntBScrdDv opened this issue Feb 9, 2023 · 33 comments
Open

Where are you Sulfobacillus? #255

DntBScrdDv opened this issue Feb 9, 2023 · 33 comments

Comments

@DntBScrdDv
Copy link

Hi,

I'm running make p_compressed, but I think it may be stuck? I modified the make file to use 12 threads instead of one (which was taking forever), but it's been stuck here for nearly two hours... is this normal or should I kill it and try again?

SYSTEM CALL: jellyfish count --if tmp_tmp_0_562.testingKmer -o tmp_tmp_0_562.jf -m 53 -s 5000000 -C -t 8 tmp_0_562_2460.fa
finished
SYSTEM CALL: jellyfish dump tmp_tmp_0_562.jf > tmp_tmp_0_562.jf_dump
finished

Thanks!

@mourisl
Copy link
Collaborator

mourisl commented Feb 9, 2023

It's very strange. You shall observe something like "Selecting two genomes to merge" message after this quite soon. Are you running it on a cluster server or a local machine? Thank you.

@DntBScrdDv
Copy link
Author

Hi, thanks for the reply. I'm running it on a local machine (wsl: 14 threads, 24GB RAM).

I also tried running the non-compressed version, which failed with a memory limitation error. Perhaps I need to try limiting the mem usage?

@mourisl
Copy link
Collaborator

mourisl commented Feb 11, 2023

I think 24GB is not enough for index building. It is a quite memory-intensive task, so we usually create the index on a computing server with large memory. Would the pre-built index work?

@DntBScrdDv
Copy link
Author

Ah, I wondered. I ran it again and got the following error:

real 780m2.875s
user 3428m2.890s
sys 148m14.046s
mv: cannot stat 'reference-sequences/all-compressed-bacteria.fna.tmp.fa': No such file or directory
make[1]: *** [Makefile:229: reference-sequences/all-compressed-bacteria.fna] Error 1
make[1]: Leaving directory '/home/shep/centrifuge/indices'
make: *** [Makefile:137: p_compressed] Error 2

@DntBScrdDv
Copy link
Author

How old is the pre-built index?

For the full background, I used the WIMP classifier in EPI2ME to analyse some MINION data.

I have a very simple community in one of the samples, comprising Leptospirillum ferriphilum, Acidithiobacillus caldus and Sulfobacillus thermosulfidooxidans.

When I did denovo genome construction on the data, I get all three organisms. WIMP just picks up the first two and doesn't hit Sulfobacillus at all.

My only thought is that the WIMP centrifuge index is out of date?

@mourisl
Copy link
Collaborator

mourisl commented Feb 13, 2023

Sorry for the delayed reply. Somehow Sulfobacillus is not in the database, while ncbi refseq says it is complete genome. Maybe it is an issue with my system, and I'll check that later. There is a more recent database at: https://zenodo.org/record/3732127 , would you please give it a try? It probably takes about 21G memory, so your local machine should handle it.

@DntBScrdDv
Copy link
Author

No worries - thanks again for the replies. I am recalling the fast5 files, and will then test with your database and let you know.

Yes - weird about Sulfobacillus... it's a really important genus for us as well! :)

@DntBScrdDv
Copy link
Author

Hmmm....

Reading ftab (1048577): 09:17:34
Reading eftab (20): 09:17:34
Reading offs (4418934637 64-bit words): 09:17:34
(ERR): centrifuge-class died with signal 9 (KILL)

Maybe I can't give it enough memory? Maybe not possible to run on a laptop?

@DntBScrdDv
Copy link
Author

Ah - I hadn't allocated enough RAM to WSL. :)

@DntBScrdDv DntBScrdDv changed the title make p_compressed stuck? Where are you Sulfobacillus? Feb 16, 2023
@DntBScrdDv
Copy link
Author

Right, so I re-ran the analysis using the database you suggested, and still no Sulfobacillus...

@DntBScrdDv
Copy link
Author

PS - is it normal the that calculated abundance in the tsv file is always 0.0??

@mourisl
Copy link
Collaborator

mourisl commented Feb 16, 2023

I just checked the assembly_summary file downloaded from NCBI refseq, and it seems there is no complete genome for Sulfobacillus. They are all on Scaffold level or Contig level Therefore, they will be ignored in the default index building.

GCF_001280565.1 PRJNA224116 SAMN03894429 LGRO00000000.1 na 28034 28034 Sulfobacillus thermosulfidooxidans strain=CBAR-13 latest Scaffold Major Full 2015/09/09 ASS.CBAR13.1 Fundacion Ciencia & Vida GCA_001280565.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/280/565/GCF_001280565.1_ASS.CBAR13.1 na
GCF_001953275.1 PRJNA224116 SAMN05590342 MDZE00000000.1 na 28034 28034 Sulfobacillus thermosulfidooxidans strain=ZBY latest Contig Major Full 2017/01/17 ASM195327v1 Central South University GCA_001953275.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/953/275/GCF_001953275.1_ASM195327v1 na
GCF_001953285.1 PRJNA224116 SAMN05590279 MDZD00000000.1 na 28034 28034 Sulfobacillus thermosulfidooxidans strain=DX latest Contig Major Full 2017/01/17 ASM195328v1 Central South University GCA_001953285.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/953/285/GCF_001953285.1_ASM195328v1 na
GCF_001953295.1 PRJNA224116 SAMN05590356 MDZF00000000.1 na 28034 28034 Sulfobacillus thermosulfidooxidans strain=ZJ latest Contig Major Full 2017/01/17 ASM195329v1 Central South University GCA_001953295.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/953/295/GCF_001953295.1_ASM195329v1 na
GCF_900176145.1 PRJNA224116 SAMN00768000 FWWY00000000.1 representative genome 929705 28034 Sulfobacillus thermosulfidooxidans DSM 9293 strain=DSM 9293 latest ScaffoldMajor Full 2017/04/07 IMG-taxon 2506210005 annotated assembly DOE - JOINT GENOME INSTITUTE GCA_900176145.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/176/145/GCF_900176145.1_IMG-taxon_2506210005_annotated_assembly assembly from type material na
GCF_000294425.1 PRJNA224116 SAMN02471698 ALWJ00000000.1 na 1214914 28034 Sulfobacillus thermosulfidooxidans str. Cutipay strain=Cutipay latest Scaffold Major Full 2012/09/04 ASM29442v1 Mathomics GCA_000294425.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/294/425/GCF_000294425.1_ASM29442v1 na
GCF_000497695.1 PRJNA224116 SAMN02251434 AWEW00000000.1 na 1366420 28034 Sulfobacillus thermosulfidooxidans ST strain=strain ST latest Scaffold Major Full 2013/11/20 SulTherST Central South University GCA_000497695.1 identicahttps://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/497/695/GCF_000497695.1_SulTherST na
GCF_002903155.1 PRJNA224116 SAMN07661080 NWUL00000000.1 na 2039167 2039167 Sulfobacillus sp. hq2 strain=hq2 latest Scaffold Major Full 2018/01/29 ASM290315v1 Chinese Academy of Sciences GCA_002903155.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/903/155/GCF_002903155.1_ASM290315v1 na
GCF_012933365.1 PRJNA224116 SAMN14676984 JABBVZ000000000.1 na 2729629 2729629 Sulfobacillus sp. DSM 109850 strain=DSM 109850 latest Contig Major Full 2020/04/29 ASM1293336v1 Federal Institute for Geosciences and Natural Resources GCA_012933365.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/012/933/365/GCF_012933365.1_ASM1293336v1 an

I think one way is to build an index only for the three species, which I think your local machine should be able to handle.

@DntBScrdDv
Copy link
Author

Ah-ha. Thanks so much for all this help.

I can't build a database just for those species as others are important as well, but I should almost always detect Sulfobacillus spp. in my samples.

Is it possible to build a database just of 16S genes? (i.e. maybe use the NCBI 16S curated db - I can't remember the name just now)

Should I forget local classification?

Thanks again for all your time.

@mourisl
Copy link
Collaborator

mourisl commented Feb 16, 2023

Yes, I built a database for SILVA several years ago at: ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data . It may have Sulfobacillus.

I think the SILVA database is not very large, you may build it locally. Though it is not directly supported by the "make" database script in Centrifuge. I think once you use centrifuge-download to get the taxonomy trees, taxonomy names, you can create your own sequence id to taxonomy id conversion table, then run centrifuge-build to create the index.

@DntBScrdDv
Copy link
Author

Thanks again. I downloaded your SILVA db, and re-ran the sample... To be fair, it has Sulfobacillus, which is great, but the overall results of the run don't make much sense; it should be the three species (Acidithiobacillus caldus, Leptospirillum ferriphilum and Sulfobacillus thermosulfidooxidans) alone (or at least hugely dominant), but ranking by read count gives the following top ten:

name taxID taxRank genomeSize numReads numUniqueReads abundance
Brassica juncea var. tumida 323352 varietas 58062 10842 10238 0.0
[Pseudomonas] geniculata 86188 species 47237 10776 8942 0.0
uncultured bacterium 77133 species 384153777 11643 3105 0.0
Ensete ventricosum 4639 species 13962 3382 2300 0.0
Hevea brasiliensis 3981 species 42518 2919 2007 0.0
Streptococcus pneumoniae 1313 species 4015 1994 1329 0.0
Leptospirillum ferriphilum YSK 1441628 leaf 1563 1827 979 0.0
Acidobacteria bacterium JGI 0001001-H03 1235986 species 1498 495 495 0.0
uncultured Tremellaceae 364294 species 59794 1113 461 0.0
Larimichthys crocea 215358 species 10858 552 422 0.0

I'm wondering if there is actually a solution for our needs... :-/

@mourisl
Copy link
Collaborator

mourisl commented Feb 16, 2023

You can inspect the classification result file to check the assignment score for each read. I think you can use 128 or 256 to filter many false positives and run "centrifuge-kreport" on the filtered classification file to obtain a summary of abundances, and see whether the composition becomes cleaner.

@DntBScrdDv
Copy link
Author

DntBScrdDv commented Feb 17, 2023

Thanks, I gave this a try (for either 128 of 256):

awk 'NR==1 || $4 >= 128' sil_biomx_c_result.out > filter_result_128.out

But the results are still crazy. It at least picks up my species, but loads others, including things like Salmon, Chinese mustard and Ethiopian banana being the dominant organisms (!) and Pseudomonas geniculata being the dominant bacterium.

(I just used your SILVA db)

As a temporary work-around, is it possible to append the Sulfobacillus genomes to the index, even if they're not "complete"? (i.e. add the scaffold-level assemblies?)

@mourisl
Copy link
Collaborator

mourisl commented Feb 19, 2023

Centrifuge does not support adding the sequence to the sequence on the fly :( . I think another approach is to run the classification on the original full index and then run the SILVA db for the unclassified reads

@DntBScrdDv
Copy link
Author

So, I tried a different approach, with some trial and error! I pulled the taxids for the organisms I'm likely to see (bact and arch) and used these to limit centrifuge-download, which I asked to pull Complete Genomes and Scaffolds. I think this was around 550 sequences which my laptop managed to process...

Using this index gave me far more reasonable results: it allowed me to detect my Sulfobacillus species...

The report.tsv still has just zeros for the abundance; is this normal?

Thanks again for all the help!

@mourisl
Copy link
Collaborator

mourisl commented Feb 19, 2023

Glad you figure out a way to resolve this!

For the abundance, do all the species in the report have 0 abundance or is there still some taxonomy id with some abundance value?

@DntBScrdDv
Copy link
Author

They are all zero. I don't think it's linked to my custom index, as it was the same when I tried your pre-compiled database at the start.

@mourisl
Copy link
Collaborator

mourisl commented Feb 23, 2023

That is a known issue in Centrifuge, but I could not reproduce this error somehow. The only time I get close is due to there is a very short genome in the index, and the abundance estimation algorithm will give it a high value due to the length normalization factor. I think you can run "centrifuge-kreport" or other summaries to use unique assignment to approximate the abundance.

@DntBScrdDv
Copy link
Author

Ah, thank you. If you like, I can share my data & index with you?

A stupid question: when using parvian to view the kreport, the number of reads on the Sankey diagram for each OTU: are these unique reads? i.e. an indicator of relative abundance?

@mourisl
Copy link
Collaborator

mourisl commented Feb 23, 2023

Yes, it would be very helpful to have your data!

I'm not sure which column Pavian loads to do the visualization. Maybe you can ask its author Florian?

@DntBScrdDv
Copy link
Author

DntBScrdDv commented Feb 23, 2023 via email

@mourisl
Copy link
Collaborator

mourisl commented Feb 23, 2023

My email [email protected] , maybe the google drive link, the dropbox link or any way you prefer. Thank you!

@DntBScrdDv
Copy link
Author

DntBScrdDv commented Feb 23, 2023 via email

@mourisl
Copy link
Collaborator

mourisl commented Feb 23, 2023

Thank you! I will check it.

@mansi-aai
Copy link

Hello @mourisl @DntBScrdDv !

Command :
make p_compressed # bacterial genomes compressed at the species level [~4.2G]

I am running on AWS , c6a.2xlarge instance (vCPU 8 and memory 16GB ) It's been 2 days since the command is running and now I think it is stuck (Screenshot attached) , Can anyone guide me what should I do ? Stop this and rerun on higher instance or just wait ? Thank you !
Screenshot 2023-10-22 at 12 22 08 AM

@mourisl
Copy link
Collaborator

mourisl commented Oct 21, 2023

Make index is very memory-consuming, and you may need around 400G of memory to build the index for a recent prokaryotic database. Furthermore, the compression method might be too slow for a recent database as well, I think you may want just to use "make p".

@mansi-aai
Copy link

@mourisl Thank you !
Have you had the opportunity to identify the root cause and a potential solution for the issue mentioned in this thread?
I've been working to obtain abundance info for ONT data, and I would greatly appreciate your insights. Once again, thank you!

@mourisl
Copy link
Collaborator

mourisl commented Oct 30, 2023

I think one reason is that there might be some super small genome in the database, such as the vectors. Since the abundance estimation takes the genome length into account, the small genomes will get a lot of abundance in the end. If your output, do you see all the abundances be 0 or there is one with >99 abundance?

@mansi-aai
Copy link

@mourisl all 0.0 , Here I am attaching the output from centrifuge, Let me know If you need more info from my end. Thank you so much for your response !
DRR225043_classification_report.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants