You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While loading learned model in python env and use DeepPot.eval_descriptor function in my test LabeledSystem, there will by OOM error in my A100-40G hardware:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 39.42 GiB of which 3.56 MiB is free. Process 276556 has 30.06 MiB memory in use. Including non-PyTorch memory, this process has 39.38 GiB memory in use. Of the allocated memory 38.05 GiB is allocated by PyTorch, and 25.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/mps/liuzq/FeCHO-dpa2/300rc0/v2-7-direct-10p/desc-val/oom-example/calc_desc.py", line 53, in <module>
desc = descriptor_from_model(onedata, model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mps/liuzq/FeCHO-dpa2/300rc0/v2-7-direct-10p/desc-val/oom-example/calc_desc.py", line 26, in descriptor_from_model
predict = model.eval_descriptor(coords, cells, atypes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/infer/deep_eval.py", line 445, in eval_descriptor
descriptor = self.deep_eval.eval_descriptor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py", line 649, in eval_descriptor
self.eval(
File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py", line 292, in eval
out = self._eval_func(self._eval_model, numb_test, natoms)(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py", line 364, in eval_func
return self.auto_batch_size.execute_all(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/utils/batch_size.py", line 197, in execute_all
n_batch, result = self.execute(execute_with_batch_size, index, natoms)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/utils/batch_size.py", line 120, in execute
raise OutOfMemoryError(
deepmd.utils.errors.OutOfMemoryError: The callable still throws an out-of-memory (OOM) error even when batch size is 1!
But the dp test from the same model in the same LabeledSystem dataset can be done with ~39GB memory used first then become lower to 28G memory usage.
if one directly use DeepPot.eval() function in python code, the likely OOM problem will also emerge (from my previous test in Jan 2024), So I guess there are some difference in dp test on cmd and directly use evaluation interface in python.
The text was updated successfully, but these errors were encountered:
QuantumMisaka
changed the title
[BUG] OOM in DeepPot.eval_descriptor wile dp test works
[BUG] OOM in DeepPot.eval_descriptor while dp test works
Jan 10, 2025
Bug summary
While loading learned model in python env and use
DeepPot.eval_descriptor
function in my test LabeledSystem, there will by OOM error in my A100-40G hardware:But the
dp test
from the same model in the same LabeledSystem dataset can be done with ~39GB memory used first then become lower to 28G memory usage.DeePMD-kit Version
DeePMD-kit v3.0.0rc1.dev0+g0ad42893.d20250106
Backend and its version
Pytorch 2.5.1
How did you download the software?
pip
Input Files, Running Commands, Error Log, etc.
All reference files can be accessed by dp_eval_desc_oom.tar.gz
or though Nutshull Cloud
https://www.jianguoyun.com/p/DV3eAQoQrZ-XCRim4ecFIAA (code : unpbns)
Steps to Reproduce
There will be these scripts
explained as follow:
calc_desc.py
useDeepPot.eval_descriptor()
to generate descriptor of a LabeledSystemdesc_all.sh
read MultiSystems fromdata
and callcalc_desc.py
iterativelydata
directory contain the LabeledSystems which will lead to OOM ineval_descriptor
model.pth
is the model in usetest.sh
is the script callingdp --pt test
to outputtest.log
desc.log
is the OOM stderr print-outdescrtptors
is the directory aim to contain the output descriptor, which should be empty due to OOMOne can checkout the OOM problem directly by these.
Further Information, Files, and Links
Related issue: #4533
if one directly use
DeepPot.eval()
function in python code, the likely OOM problem will also emerge (from my previous test in Jan 2024), So I guess there are some difference indp test
on cmd and directly use evaluation interface in python.The text was updated successfully, but these errors were encountered: