-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network size exceeds the DRAM capacity and program gets killed when exporting the network with nrnbbcore_write #943
Comments
This is the original use case for CoreNEURON (i.e. the model is too large for NEURON to build at one time.) CoreNEURON requires 7-fold less memory than CoreNEURON for large models. At least that was the case a few years ago. Since then most of the effort has gone into performance improvements. @pramodk can speak to the most current memory usage results. Anyway, the strategy is to have NEURON build a sequence of model subsets and generate the files for each subset,destroy the subset, and go on to the next subset in the sequence. It is up to you how many subsets to divide the model. On a parallel machine, setup efficiency is best if the model is divided into at least nhost subsets and load balance may be best served if it is a multiple of nhost. This is a fairly straightforward NEURON programming problem as most parallel models alrready are cell gid based in terms of distribution on the machine and a process generally only creates a model subset based on its list of gids. Whether you create subsets of size of a single cell or a million cells is up to you and memory resource. The only issue that is a little out of the run of the mill is the destruction of the model after writing its files. But the key is to first release all the gids with pc.gid_clear(), then destroy the netcons, then the cells. |
@HolyLow : before going int o details, just a My question is : if you have multiple machines, you can run NEURON also on multiple machines to generate the model and then run CoreNEURON also on the same number of machines? Is this how you are running now? From the wording I got the impression that you run NEURON on a single machine and then run CoreNEURON on single or multiple machine. If you could clarify this then that will be helpful. |
@pramodk Yes, currently I am running NEURON on a single machine and CoreNEURON on multi-machines for some reason. So are you suggesting that the NEURON exportion procedure could also be carried out on multi-machines, and if I applied it to multi-machines, the memory occupation problem could be solved? |
Yes. Are you running NEURON with MPI already or just threads? Like CoreNEURON, you can also run NEURON on multiple compute nodes / machines and then there will be more memory available to finish model building step.
|
Is it the case that your model setup on an mpi cluster does not need global collective communication. I.e. that one can even envision building each subset of the model, writing the files, and destroying the subset, without requiring that the entire model exist at once? Anyway, one strategy is
I did not execute so there may be syntax errors but the idea is sound. I need to follow through with a complete example for the ringtest or some other standard example model to be sure I got it right. |
@nrnhines do we still need to merge neuronsimulator/ringtest#18 ? Looks like this issue can be closed following #964. |
I believe we do. https://github.com/neuronsimulator/ringtest/pull/18/files has the test_submodel.py line 39
that makes use of #964 |
I am trying to export a large network with nrnbbcore_write, but the program gets killed because it requires more memory than the machine's DRAM could provide.
So if the network size grows so large that it can not be generated by a single machine, what should I do to support such a large network with neuron?
In simulation phase, I could use CoreNeuron to distribute the simulation to a bunch of machines. But in network exportion phase (with nrnbbcore_write), is it possible to distribute the network exportion procedure to different machines? How could I realize that?
The text was updated successfully, but these errors were encountered: