-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas for speeding up prune_graph.pl #39
Comments
Full run command is here:
|
I have recently merged #31 adding another pruning script in python (prune_ngsLD.py). |
Thanks Filipe. I've got the new python script running now. I'm trying it on a large contig, and while it seems to be running faster than the perl version, it has been running for 3 hours now and only pruned 3500 nodes, with 2,814,000 remaining. Here's the output so far:
|
If you want to speed it up further, then you can split your input file by chromosome, and launch each separately. |
I have actually been taking this approach to split by linkage group. It still seems to take a long time for the longest chromosomes, however. For example, for my largest chromosome, the script has been running for over 2 days and has pruned ~90,000 SNPs, but there is still a long way to go. The last output was: |
In theory, the script could be parallelized but it would require some code refactoring. As of now, the script iterates over nodes, and my guess would be that the slowest step on each iteration is finding the heaviest node: Lines 171 to 174 in ecee582
We could iterate over "clusters" of linked nodes and, for each cluster, find the heaviest node. This way, we could thread the cluster part (and each thread would find the heaviest node on that cluster). @zjnolen, what do you think? Do you see a better/easier/faster way? |
Hi all. I wanted to post an update. It turns out that some chromosomes were taking an inordinate amount of time to run because of a silly parsing bug in my code. I have since fixed the issue and ran an array job across LGs (12 LGs running at a time) using prune_ngsLD.py. It completed the run, trimming 9.02 million SNPs down to 6.70 million unlinked sites in a little over 5 hours. Thanks for all the help! The parallelization you describe above still sounds like a cool idea! |
Great to hear that you got it working for you, @jtoy7! 😄 @fgvieira, the way you describe sounds great to me! I think it should be quite possible to implement - there's a function to pull out the largest disconnected component from the graph, and it seems to identify these rather quickly. Each component could then be pulled from the graph and pruned in its own thread from the start. The only problem I see here would be that if it starts out with no disconnected components, then it will essentially be the same as no parallelization. But a regular check could be done within the threads, and if multiple components are found, they are pulled out and added to a thread queue. I think this can be done with the multiprocessing module in python, but I've not really used it before, so my understanding of it is rather vague. I can give it a try at some point when I have some time, especially if the current version ends up still slow for some datasets. |
I just made a more efficient alternative to the pruning scripts. It is threaded, but I haven't seen it use more than 3 or 4 threads. |
Hi @fgvieira, I've just gotten around to trying out prune_graph as I've gotten back into some analyses where I'm pruning again. I gave one of the larger chromosomes a try using both prune_graph and the prune_ngsLD.py and your new prune_graph method has definitely been much more memory efficient, using less than 10GB RAM while prune_ngsLD.py was using ~55GB for this chromosome. I do find that prune_graph has been slower than the python script, which finished up this chromosome (403147 nodes with 304191755 edges (86869875 edges <50kb and >0.1 r2)) in 2 days 6 hours, while prune_graph is about to reach 4 days with 140000 nodes remaining (the python one pruned to ~4000 nodes before removing all edges). One thing I've noticed, is that the threading hasn't really been working for the chromosomes I've tried, which might explain why it has been slower. I've been using |
Thanks for giving it a try and compare Glad to hear that it is more memory efficient but I'd also expect it to be faster. Can you let the As for the threads, I implemented what I described above, but it only really makes a difference if you have a lot of large clusters (where it takes a lot of time to find the heaviest node). Since you have a single chromosome, it could be that (as you said) you have a single cluster throughout the run. You can give it a try with several chromosomes and see if it uses more than 1 thread, but most of the times it never uses more than 2/3 threads due to few or small clusters. |
For sure, I'll give comparing all three a go on this chromosome. I usually break up my analyses into roughly equal sized 'chunks' that are around the size of the largest chromosome, so I also have some LD files for chunks with multiple chromosomes, but with similar total size, so those should be able to take advantage of the threads. I'll compare the three prunings on one of those as well and let you know how things go for each. |
This took a little longer than expected! Our HPC has been having a rough month, so the perl runs kept getting killed before finishing. I ran this on two beagle files, one that was made up of one chromosome, and one made up of two, both beagles totaling to roughly 225000 SNPs. In both cases there were about 15000000 edges to prune after filtering. The three methods had slightly different numbers of edges after filtering (r2 >= 0.1 & maxkb <=50000), which I suppose might be related to the precision issues that were mentioned in the other issue? I was a bit surprised, as we tested that the perl and python scripts should at least arrive at the same outputs, but maybe this larger dataset had more room for precision variations to come out. While the three scripts didn't return the same sites for this reason, I did get roughly similar outputs, so I don't find any bugs there:
Memory-wise, prune_graph really improves things:
For runtime, the multi-threading in prune_graph helped when multiple linkage groups were in the beagle, but, especially for a single linkage group, it was slower than the python script:
I think the run time differences might end up being quite dataset and filter setting specific. The python script might still be useful for datasets where linkage groups are large and SNP density is high, like I seem to have here. Outside of these situations, though, I would think prune_graph would still be the most efficient both in terms of memory and runtime. |
Thanks @zjnolen for the thorough comparison! Do you think you can share the first dataset? I'd like to take a closer look where the bottleneck might be. |
Yes, I expected the same, so I was surprised as well! I'll attach the beagle file for the test data here. Let me know if there's any info about the dataset or study organism I can share with you that might be useful. |
I was referring to the LD output file. Can you also send it? |
Sure thing, it's too large for here I think, so here's a link: |
Did some tests and, just like you found with the python script, current parallelization is actually slower than no parallelization! Another approach would be to parallelize the calculation of the weights. |
Made a new version without the components and adding parallelization on the calculation of the node weights, and it seems the be faster. Here are the number I got on 10M edges:
Can you run your tests again and see if you see the same trends? |
Ah ha, that would make sense then for why it was slower! Gave it a try yesterday for both the datasets I previously tested with: Run times before for comparison:
And now:
So seems to also be looking much faster on my end as well! I set it to 4 threads here, though I don't see it using more than 2 or 3 on average, so I think that like with your times, the speed improvement will diminish after a certain number of threads. I can also send a few jobs off with other levels of threading as well to get an idea of how it scales in datasets of this size. |
I'm wondering if you have recommendations for speeding up LD pruning with prune_graph.pl. I have been running the script for almost 3 days now, and it is progressing quite slowly. I have a large number of SNPs but the max_kb_dist I set is pretty small (10 Kb). At this point is has only trimmed 1000 SNPs of 4.1 million, so this doesn't seem like a feasible rate. I assume that this step would not be helped much by parallelization because it is iterative in nature? I'm running it on a 256GBx44 core node and the memory usage seems fine. The most recent output is below. Thanks for any thoughts!
The text was updated successfully, but these errors were encountered: