Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

electronWannierTransport calculations crash #225

Open
SiyuChen opened this issue Oct 5, 2024 · 6 comments
Open

electronWannierTransport calculations crash #225

SiyuChen opened this issue Oct 5, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@SiyuChen
Copy link

SiyuChen commented Oct 5, 2024

Hi, I am doing a electron transport calculation with Phoebe while it unfortunately crashed somehow.

The following is the screen outputs before it crashed:

Started parsing of el-ph interaction.
Allocating 0.08497238 (GB) (per MPI process) for the el-ph coupling matrix.
Finished parsing of el-ph interaction.

Computing electronic band structure.

Statistical parameters for the calculation
Fermi level: 8.30632019 (eV)
Index, temperature, chemical potential, doping concentration
iCalc = 0, T = 150.000000 (K), mu = 8.431200 (eV), n = -2.680797e+21 (cm^-3)

Applying a population window discarding states with df/dT < 1.000000e-10.
Window selection reduced electronic band structure from 1755000 to 421334 states.
Symmetries reduced electronic band structure from 421334 to 107650 states.
Done computing electronic band structure.

Snapshot of Phoebe's memory usage:
VM: 106.0459 (GB). RSS: 51.3745 (GB)

Computing phonon band structure.
Allocating 0.0981 GB (per MPI process).

After this, it crashed with throwing the following error:

terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_fill_insert
terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_fill_insert

Do you have any clue regarding this error?

@jcoulter12
Copy link
Collaborator

Hi Siyu,

Glad to hear you're using the code. Let's see if we can get to the bottom of this.

First, can you confirm for me that you have the most recent version of the code, (just run git pull to be sure) and give me a little info about the resources you used to run this calculation?

Also does this happen if you use a smaller kpoint mesh? I just want to make sure nothing is overflowing or running out of memory first.

Thanks,
Jenny

@jcoulter12 jcoulter12 self-assigned this Oct 5, 2024
@SiyuChen
Copy link
Author

SiyuChen commented Oct 6, 2024

Hi Jenny

Thank you! I confirm that I am using the most recent version of the code.

git log
commit 9b667516b70f1baaad33dce1ae86acc884a225a2 (HEAD -> develop, origin/develop, origin/HEAD)
Merge: 4f0e6195 0a85a21f
Author: Jenny Coulter <[email protected]>
Date:   Mon Aug 12 17:31:17 2024 -0400

    Merge pull request #221 from mir-group/sgplibCMakeFix

    Update spglib to use FetchContent in CMake

I also confirm that the calculation can finish properly if a smaller k-mesh is used. For example, with kMesh = [15,27,39], the job can be done using 5 compute nodes (More specifically, 15 MPI processes x 18 OpenMP threads). I am using a cluster in which each node has 56 cpus and 384 GiB of RAM, that is 6840 MB per cpu.

However, with kMesh = [20,36,52], my phoebe will always crash with the abovementioned error, no matter how I increase the memory. I have tried to launch the job with 10 nodes (10 MPI processes x 56 OpenMP threads, maximizing the memory available to each MPI process), but it still does not work.

Happy to provide more information if needed.

Best wishes
Siyu

@jcoulter12 jcoulter12 added the bug Something isn't working label Oct 6, 2024
@jcoulter12
Copy link
Collaborator

Hi Siyu,

Indeed, the code should be able to scale that far in kMesh (as well as pretty far beyond that -- I've been able to run kMesh = [350,350,350] for some materials). The memory content of your job also doesn't seem very large.

I have some ideas about what might be happening, and likely it's going to be a super minor fix on my part. I can probably fix this in the next day or so.

I'm sure you don't want to share your data broadly, but if you are willing to let me look your files we can communicate by email. You can see my email address is listed under my on my Github account page under my name and photo -- please write and I'll provide a place for you to upload the data.

Thanks for reporting this, we appreciate it when users let us know about these things.
Jenny

@jcoulter12
Copy link
Collaborator

Hi Siyu,

Would you mind checking out the branch named activeBandsVelocitiesOverflowBug?
You can do this by going to your phoebe directory and typing:

git pull
git checkout activeBandsVelocitiesOverflowBug
cd build
make phoebe

and then go into your build directory and type "make" again. Let me know if this does not fix your issue somehow, but I was able to reproduce and see it fixed on my machine.

Basically it as I suspected -- your system is so big, you managed to overflow an integer argument storing the number of band velocities for the phonon band structure :). I just had to increase it from int -> size_t.

There is a chance you could encounter some other such error just because you have so many phonon bands. Let me know if something else fails, usually these issues are quite fast to find (and ideally fix).

Best,
Jenny

@SiyuChen
Copy link
Author

SiyuChen commented Oct 9, 2024

Hi Jenny,

I have done what you said here. However, now my phoebe throws segmentation faults:

srun: error: cpu-p-252: task 10: Segmentation fault
srun: error: cpu-p-252: task 9: Segmentation fault
srun: error: cpu-p-252: task 11: Segmentation fault
srun: error: cpu-p-251: task 8: Segmentation fault
srun: error: cpu-p-597: task 12: Segmentation fault

Can you see your phoebe output "started computing scattering matrix"? The screen output of my phoebe still gets stuck at "Computing phonon band structure. Allocating 0.0502 GB (per MPI process)."

@jcoulter12
Copy link
Collaborator

Hi Siyu,

Ok, I was able to reproduce this -- for me, it's not a seg fault but a very reasonable out of memory error.
This is because you have 180 phonon bands, 426465 qpoints, and for each group velocity (3 dimensions), which is complex, (16 bytes) this means allocating a container to store the band velocities which is 663 gigabytes.

This takes a bit more work to get around, but it could possibly be done. A fast workaround may be to reduce the size of the population window, like this:

windowType = "population"
windowPopulationLimit = 1e-6

however, this can be dangerous if one wants to use the Wigner correction, as for that, contributions can come from far away from the Fermi energy (I think this is noted in the tutorial as well) and in general one also should then converge wrt the window population limit. It would still give you an idea of if the RTA is already converged here.

I think there is a workaround to this that I've wanted to implement anyway. Let me investigate the difficulty of that change and get back to you in ~ a day.

Jenny

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants