Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on GPU? #26

Closed
alikhamze opened this issue Sep 19, 2023 · 6 comments
Closed

Training on GPU? #26

alikhamze opened this issue Sep 19, 2023 · 6 comments
Assignees

Comments

@alikhamze
Copy link

Details

Hello,

I am testing DeepTB on a set of ~5000 structures. I copied an input file from the first step of the BN example and made only small changes for my dataset (paths, atomic species). (I know this set of parameters is not great, I am just trying to make sure I can make the code work on my dataset for now--it already worked on the example).

Because of the dataset size, training with the CPU (the default) is very slow, so I tried training on the GPU by adding "device" : "cuda" to the input json under "common_options".
I then train using dptb train -sk input.json -o ./first > log_first 2>&1 & .

When I do so, the DeepTB crashes with the error RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Here's the full traceback:

Traceback (most recent call last):
  File "/home/ali/venv3.10-deeptb/bin/dptb", line 8, in <module>
    sys.exit(main())
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/dptb/entrypoints/main.py", line 317, in main
    train(**dict_args)
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/dptb/entrypoints/train.py", line 276, in train
    trainer.run(trainer.num_epoch)
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/dptb/nnops/base_trainer.py", line 52, in run
    self.train()
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/dptb/nnops/train_nnsk.py", line 245, in train
    self.optimizer.step(closure)
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/torch/optim/adam.py", line 121, in step
    loss = closure()
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/dptb/nnops/train_nnsk.py", line 229, in closure
    pred, label = self.calc(*data, decompose=self.decompose)
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/dptb/nnops/train_nnsk.py", line 179, in calc
    self.hamileig.get_hs_blocks(bonds_onsite=bond_onsites,
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/dptb/hamiltonian/hamil_eig_sk_crt.py", line 276, in get_hs_blocks
    onsiteH, onsiteS, bonds_onsite = self.get_hs_onsite(bonds_onsite=bonds_onsite, onsite_envs=onsite_envs)
  File "/home/ali/venv3.10-deeptb/lib/python3.10/site-packages/dptb/hamiltonian/hamil_eig_sk_crt.py", line 161, in get_hs_onsite
    sub_hamil_block[ist:ist+norbi, ist:ist+norbi] = th.eye(norbi, dtype=self.dtype, device=self.device) * self.onsiteEs[ib][indx]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Here is my full input file (renamed to .txt because github does not allow me to upload .json):
input.txt

Is there an error in my input file? Or is the way to train on a GPU different from the one I tried?

Thanks!
-Ali

@floatingCatty
Copy link
Member

Hello, thanks for bringing this to our attention.

A quick answer is: cuda option is not supported in the current version.

Here are the reasons:
The GPU acceleration of DeePTB requires some additional parallelization on datasets. We have tried just brute force to move all the calculations to GPU, but without the additional optimization, the GPU does not bring significant speed up. (But if you want this feature, I can add recover the GPU support back without data parallelization back for testing).

We are currently working on the parallelized GPU acceleration, which will be released very shortly.

Here are some suggestions for current usage:

Usually, training a system doesn't need that much of configurations, due to the great generalization capacity of symmetry-preserving neural networks and the SKTB formula. You can just randomly pick up tens of configurations and try the training to see if the speed satisfies your demands.

Besides, what kind of system are you working on? It might be helpful if you can share the systems and settings, thanks!

@alikhamze
Copy link
Author

Hello, thank you for your prompt reply and apologies for my late one, I had a few things to take care of before I could try this again. I have made changes with your feedback and will describe them below and ask some follow up questions.
Before that, however, I'm sorry for clogging up your issues with this--perhaps this discussion would be better suited to the discussion board on github? I don't see a link to one on this project though.

I am looking forward to the release of the GPU version, thank you for letting me know it is not currently supported. I also appreciate the tip about the efficiency of training DeePTB--I have more background in training force fields, were more data is usually better!

I am interested in systems with multiple possible coordinations, so I am especially curious interested in the environmental correction. For my more recent tests, I am trying BN because it can be sp2 (as in h-BN) or sp3 (as in c-BN), and since I already have some data for it and since it was used in your example.
I have trained on a single unit cell each of h-BN and c-BN with the 3rd nearest neighbor included, largely following the tutorial but with my added c-BN data. The training took some amount of time, but was reasonably fast. The model accuracy at this stage is not very good, but I believe that to be because the environment correction is not yet included--does that seem like a reasonable explanation why, given that I'm training on both h-BN and c-BN?

I then added the environment correction (using the same dataset) following the silicon example, which seems to run correctly, but when I try to calculate the band structure from the resulting model, I get this error (missing keys omitted for brevity):

Missing key(s) in state_dict: [[long list of omitted keys here]]
Unexpected key(s) in state_dict: "hopping_net.layer1", "hopping_net.layer2", "onsite_net.layer1", "onsite_net.layer2".

Do you have any suggestions for how to successfully train the model with the environment correction?
I am also wondering if you have any suggestions for speeding up the cpu training besides reducing the dataset size,
Finally, is there an example of the API to TBPLaS anywhere? I would like to try that as well, but did not see one included in the current repo.

This is a very exciting project, thank you for sharing it and your help with it!

@floatingCatty
Copy link
Member

Thank you for the feedback. We are very delighted to contribute to your questions. As for the reason that is using nnsk (the model without environmental correction), the accuracy would depend on the following factors: 1. the dataset you are fitting, 2. the fitting procedure 3. the number of neighbours included, and applied onsite corrections.

  1. For the first factor, for a better starting point of fitting, the model should be trained on perfect crystal structure as the first step, then it is able to transfer to more complex scenarios like the MD structures or strained structures. Can I have more detail about the dataset of h-BN and c-BN of your example? How many frames does it have? How are they generated?

  2. The fitting procedure plays a vital role in the mode training. One reason is that the model has too large of degree of freedom with three nearest open and onsite corrections like uniform or strain. Directly fitting the model with this setting often results in an overfitting. In this case, the general procedure would be: a. start training from the first-nearest neighbour and "none" onsite correction. b. add onsite correction "uniform" or "strain" and train by reloading the former checkpoint. c. increase the bond cutoff to include more neighbours to increase accuracy. Usually step a would give a general fitting that has the correct shape and correspondence of bands. and b step can give a pretty good fitting already. You can try the fitting following this step. You can also send the data to us via email or here on github, we are more than happy to help.

  3. About the potential bugs in the environmental correction, we are not sure if this is a bug or some problems with the configuration file. So if possible, can you share the checkpoint and input configuration file with us so we can replicate this error? That will be most helpful to address this problem.

About speeding up, you could try to reduce the batch size in the current version. We often use 1, since the gradient of each optimization step is computed according to the batch. If the batch size is reduced, it will also reduce the computational cost of performing a single optimization. Meanwhile, you can also reduce the number of k points and corresponding eigenvalues in the data file. A sparser k path or k mesh will reduce the number of times of eigendecomposition, which could also reduce the cost.

The API of TBPLaS is not yet included in the current version. But it will be released shortly. We are also willing to send you the script for using TBPLaS with DeePTB's output Hamiltonian, please tell us if you need it.

Thank you for using this software and giving your precious feedback. Please feel free to contact us if further help is needed. Also I opened the discussion section for convenience.

@alikhamze
Copy link
Author

Thank you for the detailed reply! I'm going to close this issue and can move any future questions to the discussion section, thank you for opening it!

I would really appreciate if you could share the TBPLaS script! That feature is one of the most exciting parts of this project for me, so if I can test it that will be very helpful for me in determining if I should devote more time to using this package. Perhaps the easiest way would be to share it with me in a separate repo? I could also share my email if that's preferable for you, please let me know.

For now, I am following your recommended training procedure and trying several models: one for crystalline h-BN (which should work as in the example), one for crystalline c-BN, and a third model for both. I will also try the environmental correction for these 3 configurations after they finish training, and if I encounter the same error as before, I will open a question in the discussion section with my inputs and training data.

Thank you again!

@alikhamze
Copy link
Author

Hi there, I just wanted to follow up about the TBPLaS script, @floatingCatty !

I also wanted to let you know the issue with the environmental correction was an error on my part and it is working now.

@floatingCatty
Copy link
Member

tbplas.zip
Hello!

This is a brief guide on how to use the TBPLaS interfaces with the current DeePTB code. Please feel free to comment if further help is needed.

It is glad to know the environmental correction is working now! Thanks very much for using our code and the precious advice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants