-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue merging many large bsseq objects with biscuiteer::unionize() #25
Comments
This is an interesting point -- normally when I'm prepping a large biscuit/bsseq analysis, I jointly call all of the variants into one large VCF and then turn that into a merged BED for loading. In principle, it should be possible to merge everything on the fly, either by reading directly into HDF5 backing files and stacking the columns, or merging the VCFs and BEDs on the fly prior to writing to a joint HDF5 (possibly a better plan).
biscuiteer::unionize() exists for the purpose of doing this, but it's slow, and I would recommend not using it (although when tested against the faster intersect-based merging approach, there is a clear and substantial improvement in DMR calling performance, e.g. for eRRBS-based data).
With BioC-3.11 release coming up, we might want to stub in some functions and API-type endpoints to support this, regardless of whether they can be fully tested in time -- there are a handful of other cleanups that need to happen so maybe now is the time.
Tim
…________________________________________
From: deanpettinga <[email protected]>
Sent: Sunday, April 19, 2020 10:41 PM
To: trichelab/biscuiteer
Cc: Subscribed
Subject: [External] [trichelab/biscuiteer] Issue merging many large bsseq objects with biscuiteer::unionize() (#25)
Working on my first analysis w/ BISCUIT/biscuiteer, but I’ve encountered some issues handling the data. I have 20 gzip/tabix’d VCFs (15-20Gb each) with accompanying bed.gz files. Biscuiteer seems to be working just fine with small/toy datasets. However, I’ve been having issues merging all these samples into a single bsseq object. I think part of the issue is simply due to the large sample number and the amount of data for each sample. I have attempted to solve this issue with two approaches that have failed thus far:
1. biscuiteer::readBiscuit() for each sample individually and then use biscuiteer::unionize() to get a single object.
2. Merge vcf.gz and bed.gz files on the command line and then import together using biscuiteer::readBiscuit()
Do you have any advice for a better/ideal approach in this situation?
thanks in advance!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#25>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAABCIRSUFBUXNXOHSW2CMDRNOY5HANCNFSM4MMCHOFA>.
CAUTION: This email was sent from outside of the organization ([email protected]). Do not click links or open attachments unless you recognize the sender and know the content is safe. If you have any questions, please contact [email protected]<mailto:[email protected]>.
|
Dean,
when you report failures of this sort, could you include a sessionInfo(), the amount of RAM you have, and the error message from each failure?
A minimal reproducible example will be helpful, although I'm reasonably sure I know what's going on here and how it can be fixed.
That said, intuition is no substitute for actual data points.
Thanks,
…--t
________________________________________
From: deanpettinga <[email protected]>
Sent: Sunday, April 19, 2020 10:41 PM
To: trichelab/biscuiteer
Cc: Subscribed
Subject: [External] [trichelab/biscuiteer] Issue merging many large bsseq objects with biscuiteer::unionize() (#25)
Working on my first analysis w/ BISCUIT/biscuiteer, but I’ve encountered some issues handling the data. I have 20 gzip/tabix’d VCFs (15-20Gb each) with accompanying bed.gz files. Biscuiteer seems to be working just fine with small/toy datasets. However, I’ve been having issues merging all these samples into a single bsseq object. I think part of the issue is simply due to the large sample number and the amount of data for each sample. I have attempted to solve this issue with two approaches that have failed thus far:
1. biscuiteer::readBiscuit() for each sample individually and then use biscuiteer::unionize() to get a single object.
2. Merge vcf.gz and bed.gz files on the command line and then import together using biscuiteer::readBiscuit()
Do you have any advice for a better/ideal approach in this situation?
thanks in advance!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#25>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAABCIRSUFBUXNXOHSW2CMDRNOY5HANCNFSM4MMCHOFA>.
CAUTION: This email was sent from outside of the organization ([email protected]). Do not click links or open attachments unless you recognize the sender and know the content is safe. If you have any questions, please contact [email protected]<mailto:[email protected]>.
|
Tim, I suppose this isn't so much of an issue as a question, hence, my lack of I'll move forward with your suggestion of jointly calling variants with thanks much! |
So -- I can vouch for unionize() working -- IFF there is enough RAM. We merged something like 100 eRRBS runs this way, before realizing that joint calling and joint loading tended to be vastly faster. That said, it was successful when we tested unionize, it was just excruciatingly slow (which is why I moved to joint VCFs and joint BEDs thereafter -- it turns out you usually lose a tremendous amount of information if you just intersect).
Anyways, it would be nice to see the error message, even if my recommendation for users is not to use unionize(). If it doesn't even work at all in the current build, we need to deprecate the function or fix it. Leaving dead code in bioconductor is against the rules (our problem, not yours).
Thanks for opening the issue and I look forward to more feedback from you & the BBC!
…--t
________________________________________
From: deanpettinga <[email protected]>
Sent: Monday, April 20, 2020 11:25 AM
To: trichelab/biscuiteer
Cc: Tim Triche, Jr.; Comment
Subject: [External] Re: [trichelab/biscuiteer] Issue merging many large bsseq objects with biscuiteer::unionize() (#25)
Tim,
I suppose this isn't so much of an issue as a question, hence, my lack of sessionInfo() and error message. I think the package is working as intended. I was just hoping to understand best-practice when it comes to improving performance/speed.
I'll move forward with your suggestion of jointly calling variants with BISCUIT into a single VCF. Feel free to close this issue unless you'd like further info from my experience.
thanks much!
Dean
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#25 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAABCISFYNASEP2V3OSNNULRNRSOLANCNFSM4MMCHOFA>.
CAUTION: This email was sent from outside of the organization ([email protected]). Do not click links or open attachments unless you recognize the sender and know the content is safe. If you have any questions, please contact [email protected]<mailto:[email protected]>.
|
just to clarify, i don't have any errors. I'm not used to handling objects of this magnitude in R, so i was just looking for direction regarding an optimal approach :) |
Working on my first analysis w/ BISCUIT/biscuiteer, but I’ve encountered some issues handling the data. I have 20 gzip/tabix’d VCFs (15-20Gb each) with accompanying bed.gz files. Biscuiteer seems to be working just fine with small/toy datasets. However, I’ve been having issues merging all these samples into a single
bsseq
object. I think part of the issue is simply due to the large sample number and the amount of data for each sample. I have attempted to solve this issue with two approaches that have failed thus far:biscuiteer::readBiscuit()
for each sample individually and then usebiscuiteer::unionize()
to get a single object.biscuiteer::readBiscuit()
Do you have any advice for a better/ideal approach in this situation?
thanks in advance!
The text was updated successfully, but these errors were encountered: