Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain what capped 8 / 16 for the kraken DBs means #33

Open
paulzierep opened this issue Aug 1, 2024 · 4 comments
Open

Explain what capped 8 / 16 for the kraken DBs means #33

paulzierep opened this issue Aug 1, 2024 · 4 comments

Comments

@paulzierep
Copy link

Could you kindly explain how the DB are capped for the kraken2 DB ? Random Subsampling of the input or the DB itself ?

@ChillarAnand
Copy link

The DB itself is capped at 8/16 GB. Thats why you can see the size of those dbs is limited to 8GB/16GB.

https://benlangmead.github.io/aws-indexes/k2

@paulzierep
Copy link
Author

Thank you very much for the response, but could you explain how this is done technically? FYI, I have a student who investigates the performance of kraken2 DBs, and we are also looking into the effect of the capped DBs, but it would be good if we could explain what the technical difference is.

@ChillarAnand
Copy link

I have already requested that the scripts that are used to build these indices be shared.

#31

There is no response yet. If the scripts are available, we know exactly how the DB is capped.

I also build a wide variety of kraken indices and created kraken-db-builder to speed the index building. You can take a look if you are interested.

https://github.com/AvilPage/kraken-db-builder

@incoherentian
Copy link

Hey! The RAM-friendly db are indexed the same as the other full dbs. @BenLangmead &al. then subsample the resulting kmers from each genome until they fit within the variously size-constrained indices. So the more genomes included, the smaller the kmer subsample for each included genome... hope that makes sense? That's my interpretation at least, hopefully correct.

It would probably make sense to first start further reducing the input number genomes for over-represented species, but that would require some subjective choices and be way, way, way too much manual curation. Curious as to what your student turns up @paulzierep

I originally found this page when googling who to thank for these prebuilt dbs, as they've saved me a lot of effort and highmem node queuing over the last couple of years. ("This project is maintained by BenLangmead" in the corner of the project site did not initially clue me in, so I'm clearly not very brilliant.) I am thankful though - thanks db maintainers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants