Enable "streaming" bam reads in read.smartseq2.bams #21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently,
read.smartseq2.bams
divides the list of bams to be read inton.cores
different lists, then spawnsn.cores
processes to actually read the data. Each of those processes reads in all of the bams requested of it, then works through all of the reads from those bam files, assigning them to genes and read classes (exonic, intronic, etc). This is a problem because it means that all of the input data is in memory at the same time, which can quickly exhaust physical memory (cf #14).This PR adds an optional parameter,
stream.bams
, toread.smartseq2.bams
. If enabled, this parameter will cause each of the child processes to only read and process one bam at a time. I expect this to slow calculations down, but it makes much larger data sets calculable. With this change, I was able to read in 494 bam files totaling ~100GB on my desktop computer with 16GB of RAM. The parameter defaults to false, which keeps the existing behavior.I also added an early termination condition if the sample names are not specified on input, because I got burned by forgetting that