Issue merging many large bsseq objects with biscuiteer::unionize() #25

deanpettinga · 2020-04-20T02:41:11Z

Working on my first analysis w/ BISCUIT/biscuiteer, but I’ve encountered some issues handling the data. I have 20 gzip/tabix’d VCFs (15-20Gb each) with accompanying bed.gz files. Biscuiteer seems to be working just fine with small/toy datasets. However, I’ve been having issues merging all these samples into a single bsseq object. I think part of the issue is simply due to the large sample number and the amount of data for each sample. I have attempted to solve this issue with two approaches that have failed thus far:

biscuiteer::readBiscuit() for each sample individually and then use biscuiteer::unionize() to get a single object.
Merge vcf.gz and bed.gz files on the command line and then import together using biscuiteer::readBiscuit()

Do you have any advice for a better/ideal approach in this situation?

thanks in advance!

The text was updated successfully, but these errors were encountered:

ttriche · 2020-04-20T13:49:08Z

This is an interesting point -- normally when I'm prepping a large biscuit/bsseq analysis, I jointly call all of the variants into one large VCF and then turn that into a merged BED for loading. In principle, it should be possible to merge everything on the fly, either by reading directly into HDF5 backing files and stacking the columns, or merging the VCFs and BEDs on the fly prior to writing to a joint HDF5 (possibly a better plan). biscuiteer::unionize() exists for the purpose of doing this, but it's slow, and I would recommend not using it (although when tested against the faster intersect-based merging approach, there is a clear and substantial improvement in DMR calling performance, e.g. for eRRBS-based data). With BioC-3.11 release coming up, we might want to stub in some functions and API-type endpoints to support this, regardless of whether they can be fully tested in time -- there are a handful of other cleanups that need to happen so maybe now is the time. Tim

…

________________________________________ From: deanpettinga <[email protected]> Sent: Sunday, April 19, 2020 10:41 PM To: trichelab/biscuiteer Cc: Subscribed Subject: [External] [trichelab/biscuiteer] Issue merging many large bsseq objects with biscuiteer::unionize() (#25) Working on my first analysis w/ BISCUIT/biscuiteer, but I’ve encountered some issues handling the data. I have 20 gzip/tabix’d VCFs (15-20Gb each) with accompanying bed.gz files. Biscuiteer seems to be working just fine with small/toy datasets. However, I’ve been having issues merging all these samples into a single bsseq object. I think part of the issue is simply due to the large sample number and the amount of data for each sample. I have attempted to solve this issue with two approaches that have failed thus far: 1. biscuiteer::readBiscuit() for each sample individually and then use biscuiteer::unionize() to get a single object. 2. Merge vcf.gz and bed.gz files on the command line and then import together using biscuiteer::readBiscuit() Do you have any advice for a better/ideal approach in this situation? thanks in advance! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#25>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAABCIRSUFBUXNXOHSW2CMDRNOY5HANCNFSM4MMCHOFA>. CAUTION: This email was sent from outside of the organization ([email protected]). Do not click links or open attachments unless you recognize the sender and know the content is safe. If you have any questions, please contact [email protected]<mailto:[email protected]>.

ttriche · 2020-04-20T13:57:29Z

Dean, when you report failures of this sort, could you include a sessionInfo(), the amount of RAM you have, and the error message from each failure? A minimal reproducible example will be helpful, although I'm reasonably sure I know what's going on here and how it can be fixed. That said, intuition is no substitute for actual data points. Thanks,

…

--t

________________________________________ From: deanpettinga <[email protected]> Sent: Sunday, April 19, 2020 10:41 PM To: trichelab/biscuiteer Cc: Subscribed Subject: [External] [trichelab/biscuiteer] Issue merging many large bsseq objects with biscuiteer::unionize() (#25) Working on my first analysis w/ BISCUIT/biscuiteer, but I’ve encountered some issues handling the data. I have 20 gzip/tabix’d VCFs (15-20Gb each) with accompanying bed.gz files. Biscuiteer seems to be working just fine with small/toy datasets. However, I’ve been having issues merging all these samples into a single bsseq object. I think part of the issue is simply due to the large sample number and the amount of data for each sample. I have attempted to solve this issue with two approaches that have failed thus far: 1. biscuiteer::readBiscuit() for each sample individually and then use biscuiteer::unionize() to get a single object. 2. Merge vcf.gz and bed.gz files on the command line and then import together using biscuiteer::readBiscuit() Do you have any advice for a better/ideal approach in this situation? thanks in advance! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#25>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAABCIRSUFBUXNXOHSW2CMDRNOY5HANCNFSM4MMCHOFA>. CAUTION: This email was sent from outside of the organization ([email protected]). Do not click links or open attachments unless you recognize the sender and know the content is safe. If you have any questions, please contact [email protected]<mailto:[email protected]>.

deanpettinga · 2020-04-20T15:25:08Z

Tim,

I suppose this isn't so much of an issue as a question, hence, my lack of sessionInfo() and error message. I think the package is working as intended. I was just hoping to understand best-practice when it comes to improving performance/speed.

I'll move forward with your suggestion of jointly calling variants with BISCUIT into a single VCF. Feel free to close this issue unless you'd like further info from my experience.

thanks much!
Dean

ttriche · 2020-04-20T16:12:45Z

So -- I can vouch for unionize() working -- IFF there is enough RAM. We merged something like 100 eRRBS runs this way, before realizing that joint calling and joint loading tended to be vastly faster. That said, it was successful when we tested unionize, it was just excruciatingly slow (which is why I moved to joint VCFs and joint BEDs thereafter -- it turns out you usually lose a tremendous amount of information if you just intersect). Anyways, it would be nice to see the error message, even if my recommendation for users is not to use unionize(). If it doesn't even work at all in the current build, we need to deprecate the function or fix it. Leaving dead code in bioconductor is against the rules (our problem, not yours). Thanks for opening the issue and I look forward to more feedback from you & the BBC!

…

--t

________________________________________ From: deanpettinga <[email protected]> Sent: Monday, April 20, 2020 11:25 AM To: trichelab/biscuiteer Cc: Tim Triche, Jr.; Comment Subject: [External] Re: [trichelab/biscuiteer] Issue merging many large bsseq objects with biscuiteer::unionize() (#25) Tim, I suppose this isn't so much of an issue as a question, hence, my lack of sessionInfo() and error message. I think the package is working as intended. I was just hoping to understand best-practice when it comes to improving performance/speed. I'll move forward with your suggestion of jointly calling variants with BISCUIT into a single VCF. Feel free to close this issue unless you'd like further info from my experience. thanks much! Dean — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#25 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAABCISFYNASEP2V3OSNNULRNRSOLANCNFSM4MMCHOFA>. CAUTION: This email was sent from outside of the organization ([email protected]). Do not click links or open attachments unless you recognize the sender and know the content is safe. If you have any questions, please contact [email protected]<mailto:[email protected]>.

deanpettinga · 2020-04-21T18:45:31Z

just to clarify, i don't have any errors. I'm not used to handling objects of this magnitude in R, so i was just looking for direction regarding an optimal approach :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue merging many large bsseq objects with biscuiteer::unionize() #25

Issue merging many large bsseq objects with biscuiteer::unionize() #25

deanpettinga commented Apr 20, 2020

ttriche commented Apr 20, 2020 via email

ttriche commented Apr 20, 2020 via email

deanpettinga commented Apr 20, 2020

ttriche commented Apr 20, 2020 via email

deanpettinga commented Apr 21, 2020

Issue merging many large bsseq objects with biscuiteer::unionize() #25

Issue merging many large bsseq objects with biscuiteer::unionize() #25

Comments

deanpettinga commented Apr 20, 2020

ttriche commented Apr 20, 2020 via email

ttriche commented Apr 20, 2020 via email

deanpettinga commented Apr 20, 2020

ttriche commented Apr 20, 2020 via email

deanpettinga commented Apr 21, 2020