-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training on GPU? #26
Comments
Hello, thanks for bringing this to our attention. A quick answer is: cuda option is not supported in the current version. Here are the reasons: We are currently working on the parallelized GPU acceleration, which will be released very shortly. Here are some suggestions for current usage: Usually, training a system doesn't need that much of configurations, due to the great generalization capacity of symmetry-preserving neural networks and the SKTB formula. You can just randomly pick up tens of configurations and try the training to see if the speed satisfies your demands. Besides, what kind of system are you working on? It might be helpful if you can share the systems and settings, thanks! |
Hello, thank you for your prompt reply and apologies for my late one, I had a few things to take care of before I could try this again. I have made changes with your feedback and will describe them below and ask some follow up questions. I am looking forward to the release of the GPU version, thank you for letting me know it is not currently supported. I also appreciate the tip about the efficiency of training DeePTB--I have more background in training force fields, were more data is usually better! I am interested in systems with multiple possible coordinations, so I am especially curious interested in the environmental correction. For my more recent tests, I am trying BN because it can be sp2 (as in h-BN) or sp3 (as in c-BN), and since I already have some data for it and since it was used in your example. I then added the environment correction (using the same dataset) following the silicon example, which seems to run correctly, but when I try to calculate the band structure from the resulting model, I get this error (missing keys omitted for brevity):
Do you have any suggestions for how to successfully train the model with the environment correction? This is a very exciting project, thank you for sharing it and your help with it! |
Thank you for the feedback. We are very delighted to contribute to your questions. As for the reason that is using nnsk (the model without environmental correction), the accuracy would depend on the following factors: 1. the dataset you are fitting, 2. the fitting procedure 3. the number of neighbours included, and applied onsite corrections.
About speeding up, you could try to reduce the batch size in the current version. We often use 1, since the gradient of each optimization step is computed according to the batch. If the batch size is reduced, it will also reduce the computational cost of performing a single optimization. Meanwhile, you can also reduce the number of k points and corresponding eigenvalues in the data file. A sparser k path or k mesh will reduce the number of times of eigendecomposition, which could also reduce the cost. The API of TBPLaS is not yet included in the current version. But it will be released shortly. We are also willing to send you the script for using TBPLaS with DeePTB's output Hamiltonian, please tell us if you need it. Thank you for using this software and giving your precious feedback. Please feel free to contact us if further help is needed. Also I opened the discussion section for convenience. |
Thank you for the detailed reply! I'm going to close this issue and can move any future questions to the discussion section, thank you for opening it! I would really appreciate if you could share the TBPLaS script! That feature is one of the most exciting parts of this project for me, so if I can test it that will be very helpful for me in determining if I should devote more time to using this package. Perhaps the easiest way would be to share it with me in a separate repo? I could also share my email if that's preferable for you, please let me know. For now, I am following your recommended training procedure and trying several models: one for crystalline h-BN (which should work as in the example), one for crystalline c-BN, and a third model for both. I will also try the environmental correction for these 3 configurations after they finish training, and if I encounter the same error as before, I will open a question in the discussion section with my inputs and training data. Thank you again! |
Hi there, I just wanted to follow up about the TBPLaS script, @floatingCatty ! I also wanted to let you know the issue with the environmental correction was an error on my part and it is working now. |
tbplas.zip This is a brief guide on how to use the TBPLaS interfaces with the current DeePTB code. Please feel free to comment if further help is needed. It is glad to know the environmental correction is working now! Thanks very much for using our code and the precious advice! |
Details
Hello,
I am testing DeepTB on a set of ~5000 structures. I copied an input file from the first step of the BN example and made only small changes for my dataset (paths, atomic species). (I know this set of parameters is not great, I am just trying to make sure I can make the code work on my dataset for now--it already worked on the example).
Because of the dataset size, training with the CPU (the default) is very slow, so I tried training on the GPU by adding
"device" : "cuda"
to the input json under"common_options"
.I then train using
dptb train -sk input.json -o ./first > log_first 2>&1 &
.When I do so, the DeepTB crashes with the error
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Here's the full traceback:
Here is my full input file (renamed to .txt because github does not allow me to upload .json):
input.txt
Is there an error in my input file? Or is the way to train on a GPU different from the one I tried?
Thanks!
-Ali
The text was updated successfully, but these errors were encountered: