Enable "streaming" bam reads in read.smartseq2.bams #21

pdexheimer · 2018-03-27T19:18:42Z

Currently, read.smartseq2.bams divides the list of bams to be read into n.cores different lists, then spawns n.cores processes to actually read the data. Each of those processes reads in all of the bams requested of it, then works through all of the reads from those bam files, assigning them to genes and read classes (exonic, intronic, etc). This is a problem because it means that all of the input data is in memory at the same time, which can quickly exhaust physical memory (cf #14).

This PR adds an optional parameter, stream.bams, to read.smartseq2.bams. If enabled, this parameter will cause each of the child processes to only read and process one bam at a time. I expect this to slow calculations down, but it makes much larger data sets calculable. With this change, I was able to read in 494 bam files totaling ~100GB on my desktop computer with 16GB of RAM. The parameter defaults to false, which keeps the existing behavior.

I also added an early termination condition if the sample names are not specified on input, because I got burned by forgetting that

Added stream.bam parameter to read.smartseq2.bams

8e29d10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable "streaming" bam reads in read.smartseq2.bams #21

Enable "streaming" bam reads in read.smartseq2.bams #21

pdexheimer commented Mar 27, 2018

Enable "streaming" bam reads in read.smartseq2.bams #21

Are you sure you want to change the base?

Enable "streaming" bam reads in read.smartseq2.bams #21

Conversation

pdexheimer commented Mar 27, 2018