Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a VASP handler (warning) for too high NBANDS #324

Merged
merged 16 commits into from
Jun 24, 2024

Conversation

Andrew-S-Rosen
Copy link
Member

@Andrew-S-Rosen Andrew-S-Rosen commented Mar 12, 2024

Summary

Closes #224.

If NBANDS is set to an unphysically high value, the resulting electronic structure properties can be completely erroneous. This is becoming more of a problem lately because VASP will automatically set NBANDS to the number of cores on a machine for small systems, and machines nowadays can have many cores (Perlmutter has 128 cores per node). Running a molecule like CO with 128 NBANDS is a major problem.

This handler checks to see if VASP automatically changed NBANDS due to parallelization. If it did, then it checks if NBANDS > 2 times the NELECT value. If so, we raise a warning. Unfortunately, we can't do much more than that because VASP will override the INCAR even if the user manually specifies NBANDS. The solution is for the user to rerun with fewer cores.

 -----------------------------------------------------------------------------
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|     The number of bands has been changed from the values supplied in        |
|     the INCAR file. This is a result of running the parallel version.       |
|     The orbitals not found in the WAVECAR file will be initialized with     |
|     random numbers, which is usually adequate. For correlated               |
|     calculations, however, you should redo the groundstate calculation.     |
|     I found NBANDS = 4. Now, NBANDS = 64.                                   |
|                                                                             |
 -----------------------------------------------------------------------------

@yang-ruoxi
Copy link
Member

thanks @Andrew-S-Rosen ! There are a few things I would like to comment:

  1. "Large number of NBANDS causes erroneous results": Is this true? Adding empty orbitals in principle shouldn't matter. From what I understood, it is the opposite, where a lot of empty bands are needed to obtain converged results for many-body effects. The convergence for sure will be slow.

  2. The problem with CO here is that, before mentioning the unphysical NBANDS, is running a small molecule with too many cores (128!) for only 10 valence electrons. If one is using the default NCORE = 1, i.e. 1 core per orbital, it forces VASP to use 128 bands, that is over-parallelization resulting in inefficiency.

  3. NBANDS are not only dependent on NELECT, but also the total number of cores, and NCORE.

  4. Since NBANDS and NCORE are closely tied, a sanity check would be checking how many electrons are in the system, and hence the VASP command with respect to NELECT first. It ensure a sensible number of cores to run VASP with. That way, setting NCORE to a default value (1 or 2) would not lead NBANDS to explode.

@Andrew-S-Rosen
Copy link
Member Author

Andrew-S-Rosen commented Mar 12, 2024

Thanks for your input, @yang-ruoxi! All of your points are valid. That said, I'm still trying to figure out the best way to handle this in Custodian. Custodian can't handle anything about compute architecture, but we still want to make sure users are at the very least notified of potentially spurious results. I elaborate a bit more below.

"Large number of NBANDS causes erroneous results": Is this true? Adding empty orbitals in principle shouldn't matter. From what I understood, it is the opposite, where a lot of empty bands are needed to obtain converged results for many-body effects. The convergence for sure will be slow.

I will take some time to reproduce what I observed many months ago, but in short, I did observe erroneous behavior at times. In the CO example, there would be many spurious states in the DOS that would appear if NBANDS was >> NELECT. I believe @mkhorton had also observed some odd behavior in the past, but he'll have to chime in about that.

@esoteric-ephemera mentioned it might be a combination of too high NBANDS with not a high enough ENCUT. I'll explore that avenue as well.

The problem with CO here is that, before mentioning the unphysical NBANDS, is running a small molecule with too many cores (128!) for only 10 valence electrons. If one is using the default NCORE = 1, i.e. 1 core per orbital, it forces VASP to use 128 bands, that is over-parallelization resulting in inefficiency.

Yes, it is clear in this scenario that such calculations would be overly parallelized. Unfortunately, this is pretty common in high-throughput campaigns though because if you blindly run a bunch of calculations with a full node, you might do so on a small system inadvertently. Inefficiency is really a user problem though, so I'm not concerned about that from Custodian's perspective. Ignoring the NBANDS business, I don't think running with too many cores influences the DOS at all. It just makes the calculation slow.

NBANDS are not only dependent on NELECT, but also the total number of cores, and NCORE.

Yes. This is why I have checked the OUTCAR for the reported NBANDS value, which is always the one used by VASP (to the best of my knowledge). This should inherently take into account all of the factors you describe above. @mkhorton proposed doing a check to see if NBANDS > 2*NELECT simply as a rule-of-thumb for being "too high", especially in the scenario where VASP has automatically modified NBANDS.

Since NBANDS and NCORE are closely tied, a sanity check would be checking how many electrons are in the system, and hence the VASP command with respect to NELECT first. It ensure a sensible number of cores to run VASP with. That way, setting NCORE to a default value (1 or 2) would not lead NBANDS to explode.

I agree that the ideal approach would be for the user to run with fewer cores. However, Custodian doesn't have the ability to modify this in any clean way. That's why I have simply raised a warning.

@Andrew-S-Rosen
Copy link
Member Author

@yang-ruoxi: Do you think raising NCORE in such a scenario might be an appropriate workaround? We could fetch the number of compute cores by parsing the OUTCAR file.

@yang-ruoxi
Copy link
Member

@Andrew-S-Rosen, I see. It would be good to know what triggers the odd behaviors with high NBANDS to know what exactly needs to be done. But in reality it's hard to exhaust all possible scenarios, so warning is fine before it is pined down.
And true, in the scope of custodian, it's hard to fine tune the vasp_cmd , so I suppose the considerations I mentioned are relevant but would apply outside of Custodian.
I agree a workaround can be raising the NCORE and checking the OUTCAR, but I fear it would be an over correction in the scenario where high NBANDS wouldn't matter too much to the results. In cases we feel confident it should be corrected though (e.g. wrong combination of NBANDS + ENCUT), this could be a feasible solution.

@Andrew-S-Rosen
Copy link
Member Author

Thanks for the input, @yang-ruoxi! That all makes sense. I'll go ahead and try to reproduce the spurious behavior with too high NBANDS, and we can take it from there. You're right that it'll be worthwhile to dig into the underlying cause.

@JaGeo
Copy link
Member

JaGeo commented Mar 12, 2024

I have also seen problems with this: convergence issues in general and also weird VASP failures.

@Andrew-S-Rosen Andrew-S-Rosen changed the title [WIP] Add a VASP handler for too high NBANDS Add a VASP handler (warning) for too high NBANDS May 21, 2024
@Andrew-S-Rosen
Copy link
Member Author

Andrew-S-Rosen commented May 21, 2024

Since it's not immediately clear to me the best way to correct such a calculation and since @yang-ruoxi has given me the blessing of "warning is fine before it is pined down", I'll call this PR ready to go. It doesn't do anything other than raise a warning if your number of bands is 2x higher than your number of electrons (unlikely to happen intentionally; usually only happens when running a small system with a large number of cores in a high-throughput campaign). If people see spurious warnings in their calculations, we can easily adjust this. In my view, this PR will basically do no harm and may potentially save someone a bit of headache.

@esoteric-ephemera
Copy link
Contributor

It might be possible to correct this with custodian but a warning is probably more than enough. You'd have to read off the system info from OUTCAR (whether it's CPU or GPU, how many MPI ranks are used, how many OpenMP threads are used), and then try to modify the combination of {NCORE or NPAR, KPAR, NBANDS} on CPU builds and {NSIM, KPAR, NBANDS} on GPU builds. But playing around with parallelization can introduce other unexpected errors

A warning might be too low visibility but we can also add this to something like the pymatgen.io.validation add-on

@Andrew-S-Rosen
Copy link
Member Author

Thanks for the input, @esoteric-ephemera. That was the exact conclusion I came to. It probably can be done, but I did not want to bother dealing with the messiness of parallelization. If someone cares more about this, I encourage them to give it a go 😅

Perhaps it will be too low visibility, but low is better than none! Not a super satisfying conclusion, but I unfortunately don't have the time to put into giving this a true fix. External validation could be interesting too.

@Andrew-S-Rosen
Copy link
Member Author

@janosh: While we're on it, this one is ready to go. It is a very painless PR. It just raises a warning.

@shyuep shyuep merged commit 2412b87 into materialsproject:master Jun 24, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce a rule for too high NBANDS
5 participants