Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable "streaming" bam reads in read.smartseq2.bams #21

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pdexheimer
Copy link

Currently, read.smartseq2.bams divides the list of bams to be read into n.cores different lists, then spawns n.cores processes to actually read the data. Each of those processes reads in all of the bams requested of it, then works through all of the reads from those bam files, assigning them to genes and read classes (exonic, intronic, etc). This is a problem because it means that all of the input data is in memory at the same time, which can quickly exhaust physical memory (cf #14).

This PR adds an optional parameter, stream.bams, to read.smartseq2.bams. If enabled, this parameter will cause each of the child processes to only read and process one bam at a time. I expect this to slow calculations down, but it makes much larger data sets calculable. With this change, I was able to read in 494 bam files totaling ~100GB on my desktop computer with 16GB of RAM. The parameter defaults to false, which keeps the existing behavior.

I also added an early termination condition if the sample names are not specified on input, because I got burned by forgetting that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant