[BUG] An error when MC simulation in lammps #4207

Zch102xjtumse · 2024-10-12T09:12:49Z

Bug summary

Hello everyone. I met an error when I use the finetuned DPA2 model in the lammps MC simulation. The error informations is as below, I don't know what caused this.I'd appreciate it if you could help me with this.

DeePMD-kit Version

DeePMD-kit v3.0.0b4

Backend and its version

PyTorch v2.0.0.post200-gc263bd43e8e

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

ERROR on proc 2: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/ener_model.py", line 56, in forward_lower
comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
_5 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _5, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_6 = (self).get_fitting_net()
model_predict = annotate(Dict[str, Tensor], {})
File "code/torch/deepmd/pt/model/model/ener_model.py", line 213, in forward_common_lower
cc_ext, _36, fp, ap, input_prec, = _35
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_37 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 50, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 93, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
fitting_net = self.fitting_net
File "code/torch/deepmd/pt/model/descriptor/dpa2.py", line 84, in forward
repformers3 = self.repformers
_17 = nlist_dict[_1(_16, (repformers3).get_nsel(), )]
_18 = (repformers1).forward(_17, extended_coord, extended_atype, g11, mapping0, comm_dict0, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
g12, g2, h2, rot_mat, sw, = _18
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/repformers.py", line 226, in forward
_32 = torch.tensor(nloc)
_33 = torch.tensor(torch.sub(nall, nloc))
ret = ops.deepmd.border_op(_25, _26, _27, _28, _29, g10, _31, _32, _33)
~~~~~~~~~~~~~~~~~~~~ <--- HERE
g1_ext, comm_dict6, mapping6 = torch.unsqueeze(ret[0], 0), comm_dict7, mapping2
_34 = (_00).forward(g1_ext, g23, h2, nlist0, nlist_mask, sw1, )

Traceback of TorchScript, original code (most recent call last):
File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/ener_model.py", line 109, in forward_lower
comm_dict: Optional[Dict[str, torch.Tensor]] = None,
):
model_ret = self.forward_common_lower(
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/make_model.py", line 261, in forward_common_lower
)
del extended_coord, fparam, aparam
atomic_ret = self.atomic_model.forward_common_atomic(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
cc_ext,
extended_atype,
File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 241, in forward_common_atomic

    ext_atom_mask = self.make_atom_mask(extended_atype)
    ret_dict = self.forward_atomic(
               ~~~~~~~~~~~~~~~~~~~ <--- HERE
        extended_coord,
        torch.where(ext_atom_mask, extended_atype, 0),

File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 189, in forward_atomic
if self.do_grad_r() or self.do_grad_c():
extended_coord.requires_grad_(True)
descriptor, rot_mat, g2, h2, sw = self.descriptor(
~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/descriptor/dpa2.py", line 652, in forward
g1 = g1_ext
# repformer
g1, g2, h2, rot_mat, sw = self.repformers(
~ <--- HERE
nlist_dict[
get_multiple_nlist_key(
File "/home/zhaochenhao/soft/deepmd3.0b3/lib/python3.10/site-packages/deepmd/pt/model/descriptor/repformers.py", line 480, in forward
assert "recv_num" in comm_dict
assert "communicator" in comm_dict
ret = torch.ops.deepmd.border_op(
~~~~~~~~~~~~~~~~~~~~ <--- HERE
comm_dict["send_list"],
comm_dict["send_proc"],
RuntimeError: Trying to create tensor with negative dimension -1873441304: [-1873441304]
(/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586)
Last command: run 150000

MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

Steps to Reproduce

the lammps in.file is as follow:
label i
variable i loop 2
variable ts equal 0+300*$i
variable ta equal 0+300*$i
shell mkdir dpav1-${ta}
units metal
boundary p p p
atom_style atomic
timestep 0.001
read_data min.data
pair_style deepmd ../dpav1.pth
pair_coeff * * x x x
compute 1 all temp
compute Ek all ke/atom
compute Ep all pe/atom
compute_modify 1 dynamic yes
thermo_style custom step dt time temp ke pe etotal press lx ly lz vol
thermo 100
dump 1 all custom 5000 dpav1-${ta}/dumpthermo.atom.* id type x y z c_Ek c_Ep
velocity all create ${ts} 82765577 rot yes dist gaussian
fix r2 all npt temp ${ta} ${ta} 0.1 iso 0.0 0.0 1.0
fix mc4 all atom/swap 20 5 82765577 ${ts} types 1 2
fix mc5 all atom/swap 20 5 82765577 ${ts} types 1 3
fix mc6 all atom/swap 20 5 82765577 ${ts} types 2 3
run 100000
min_style cg
minimize 1.0e-6 1.0e-7 10000 10000

clear
next i
jump SELF i

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

njzjz · 2024-10-13T02:56:45Z

How many atoms are there? It looks like an integer overflow bug. Could you provide files to reproduce the bug?

Zch102xjtumse · 2024-10-14T03:03:58Z

有多少个原子？它看起来像一个整数溢出错误。您能否提供文件来重现该错误？
108 Here are the files.file.zip

CaRoLZhangxy · 2024-10-28T07:22:16Z

有多少个原子？它看起来像一个整数溢出错误。您能否提供文件来重现该错误？
108 Here are the files.file.zip

I did not reproduce the error of this input on current devel branch with both single and multiprocess execution. It seems that this issue may be fixed on devel branch

Zch102xjtumse added the bug label Oct 12, 2024

njzjz assigned CaRoLZhangxy Oct 26, 2024

njzjz added the failed to reproduce label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] An error when MC simulation in lammps #4207

[BUG] An error when MC simulation in lammps #4207

Zch102xjtumse commented Oct 12, 2024

njzjz commented Oct 13, 2024

Zch102xjtumse commented Oct 14, 2024

CaRoLZhangxy commented Oct 28, 2024

[BUG] An error when MC simulation in lammps #4207

[BUG] An error when MC simulation in lammps #4207

Comments

Zch102xjtumse commented Oct 12, 2024

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

njzjz commented Oct 13, 2024

Zch102xjtumse commented Oct 14, 2024

CaRoLZhangxy commented Oct 28, 2024