-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelisation support for analyse #435
Conversation
That's a pity that |
Added some additional changes, main one here is that I change analyse to transfer data into the sub-processes via disk instead of via network. I must admit I am a little unsure on this as I'm not sure how reliable parallel file reads are, though seems to work fine in my testing so far ... I also added a |
Co-authored-by: Isaac Gravestock <[email protected]> Signed-off-by: Craig Gower-Page <[email protected]>
@gravesti - It's just dawned on me that the usage of RDS files to load data into the parallel processes means we are limiting parallelisation to only function on local machines. I think (at least from the documentation) that PSOCK supports running parallel processes on remote machines which wouldn't work in this case. The RDS loading saves ~3/4 seconds so I'm not sure wether to switch back to network loading or to just put a big warning in the documentation pages or provide a toggle-able option. |
@gowerc Have we been testing on a realistic analysis size? If so, then maybe it is ok to be limited to a single machine and just have a warning in the docs. I guess the same argument applies to not using RDS. Accept the few seconds penalty for less complexity and no restrictions or extra arguments. |
Talking to Marcel and Alessandro I think a realistic sample size would be in the order of 1000-2000 e.g. our test code is likely in the right region (e.g. 20-30 seconds in sequential). Though I think a common use case for this will be in tipping point analyses where the same call to I think my current assumption would be that people who are using rbmis internal paralisation are likely just speeding up individual runs on their local machines. If a user is getting to the stage where they need to run it across different nodes (say for multiple sensitivity analyses on different sub-endpoints) they are more likely going to be not using rbmi's internal parallisation and instead be parallelising the different runs of rbmi externally to rbmi. Given how inefficient the internal parallelisation of rbmi is for Then again I fear people will just assume this functionality is fully compatible with parallels remote PSOCK setup and run it anyway, e.g. we are breaking that interface... I think on balance I'm inclined to leave it as is and add a warning note in the documentation. |
@gowerc Will you merge the lintr branch first? Do you still need to update the docs in this PR? |
@gravesti - Have merged in the lintr PR and have added an additional warning regarding the psock network stuff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the effort on this one @gowerc !
Urgh, just seen theres a bunch of notes that I need to resolve:
|
Ok I think that last commit should address the NOTE |
Closes #370
Some general notes:
future()
in the end. I just ran into edge case after edge case that made it too difficult to use and test.make_rbmi_cluster()
to make this setup process as smooth as possiblelm
but currently I am only seeing a 30% gain when using large samples e.g. >2000 for less than that it often takes longer to run in parallel. This is mostly due to the IO exchange by the looks of it. R's PSOCK clusters just transfer data exceedingly slowly :(