[BUG] Tensor Shape Mismatch During DeepMD Training #3949

PhelanShao · 2024-07-04T03:36:54Z

Bug summary

When running DeepMD-kit, an error occurs related to a mismatch in tensor shapes. Specifically, the error message indicates that the shape of the input tensor does not match the expected shape, causing a ValueError during training data processing. This issue might be related to the configuration of the magnetic spin parameters in the input files and the corresponding training data.

I suspect the issue might be due to the training data generated from CP2K not containing magnetic spin data. This issue could stem from the fact that the example provided might include data derived from VASP's OSZICAR content, which contains spin data.

DeePMD-kit Version

registry.dp.tech/dptech/deepmd-kit:2024Q1-d23cf3e

Backend and its version

registry.dp.tech/dptech/deepmd-kit:2024Q1-d23cf3e

How did you download the software?

Others (write below)

Input Files, Running Commands, Error Log, etc.

Traceback (most recent call last):
File "/opt/mamba/bin/dp", line 8, in
sys.exit(main())
File "/opt/mamba/lib/python3.10/site-packages/deepmd/main.py", line 807, in main
deepmd_main(args)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/tf/entrypoints/main.py", line 72, in main
train_dp(**dict_args)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/tf/entrypoints/train.py", line 153, in train
_do_work(jdata, run_opt, is_compress)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/tf/entrypoints/train.py", line 265, in _do_work
model.build(train_data, stop_batch, origin_type_map=origin_type_map)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/tf/train/trainer.py", line 284, in build
self.model.data_stat(data)
File "/opt/mamba/lib/python3.10/site-packages/deepmd/tf/model/ener.py", line 128, in data_stat
self._compute_input_stat(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/tf/model/ener.py", line 147, in _compute_input_stat
self.descrpt.compute_input_stats(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/tf/descriptor/se_a.py", line 373, in compute_input_stats
sysr, sysr2, sysa, sysa2, sysn = self._compute_dstats_sys_smth(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/tf/descriptor/se_a.py", line 841, in _compute_dstats_sys_smth
dd_all = run_sess(
File "/opt/mamba/lib/python3.10/site-packages/deepmd/tf/utils/sess.py", line 31, in run_sess
return sess.run(*args, **kwargs)
File "/opt/mamba/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 972, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/opt/mamba/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1189, in _run
raise ValueError(
ValueError: Cannot feed value of shape (6,) for Tensor d_sea_t_natoms:0, which has shape (7,)

Steps to Reproduce

{
"model": {
"type_map": [
"C",
"O",
"H",
"S"
],
"descriptor": {
"type": "se_e2_a",
"sel": [
16,
46,
92,
52
],
"rcut_smth": 1.0,
"rcut": 5.0,
"neuron": [
25,
50,
100
],
"resnet_dt": false,
"axis_neuron": 16,
"seed": 930070626
},
"fitting_net": {
"neuron": [
240,
240,
240
],
"resnet_dt": true,
"seed": 3301444140
},
"spin": {
"use_spin": [
false,
false,
false,
true
],
"virtual_len": [
0.4
],
"spin_norm": [
0.0,
0.0,
0.0,
1.0
],
"_comment4": " that's all"
}
},
"learning_rate": {
"type": "exp",
"start_lr": 0.001,
"decay_steps": 10000
},
"loss": {
"type": "ener_spin",
"start_pref_e": 0.1,
"limit_pref_e": 2,
"start_pref_fr": 1000,
"limit_pref_fr": 1.0,
"start_pref_fm": 10000,
"limit_pref_fm": 10.0,
"start_pref_v": 0,
"limit_pref_v": 0,
"_comment7": " that's all"
},
"training": {
"stop_batch": 50000,
"disp_file": "lcurve.out",
"disp_freq": 500,
"numb_test": 1,
"save_freq": 500,
"save_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": false,
"profiling_file": "timeline.json",
"_comment": "that's all",
"training_data": {
"systems": [
"../data.init/init/final1/training_data",
"../data.init/init/final2/training_data",
"../data.init/init/final3/training_data",
"../data.iters/iter.000000/02.fp/data.000",
"../data.iters/iter.000000/02.fp/data.001",
"../data.iters/iter.000000/02.fp/data.002",
"../data.iters/iter.000001/02.fp/data.000",
"../data.iters/iter.000001/02.fp/data.001",
"../data.iters/iter.000001/02.fp/data.002",
"../data.iters/iter.000002/02.fp/data.000",
"../data.iters/iter.000002/02.fp/data.001",
"../data.iters/iter.000002/02.fp/data.002",
"../data.iters/iter.000003/02.fp/data.000",
"../data.iters/iter.000003/02.fp/data.001",
"../data.iters/iter.000003/02.fp/data.002",
"../data.iters/iter.000004/02.fp/data.000",
"../data.iters/iter.000004/02.fp/data.001",
"../data.iters/iter.000004/02.fp/data.002",
"../data.iters/iter.000005/02.fp/data.000",
"../data.iters/iter.000005/02.fp/data.001",
"../data.iters/iter.000005/02.fp/data.002",
"../data.iters/iter.000006/02.fp/data.000",
"../data.iters/iter.000006/02.fp/data.001",
"../data.iters/iter.000006/02.fp/data.002",
"../data.iters/iter.000007/02.fp/data.000",
"../data.iters/iter.000007/02.fp/data.001",
"../data.iters/iter.000007/02.fp/data.002",
"../data.iters/iter.000009/02.fp/data.000",
"../data.iters/iter.000010/02.fp/data.000",
"../data.iters/iter.000011/02.fp/data.000",
"../data.iters/iter.000011/02.fp/data.002",
"../data.iters/iter.000012/02.fp/data.000",
"../data.iters/iter.000012/02.fp/data.002",
"../data.iters/iter.000013/02.fp/data.000",
"../data.iters/iter.000013/02.fp/data.002",
"../data.iters/iter.000014/02.fp/data.000",
"../data.iters/iter.000014/02.fp/data.002",
"../data.iters/iter.000015/02.fp/data.000",
"../data.iters/iter.000015/02.fp/data.002",
"../data.iters/iter.000016/02.fp/data.000",
"../data.iters/iter.000016/02.fp/data.002",
"../data.iters/iter.000017/02.fp/data.000",
"../data.iters/iter.000017/02.fp/data.002",
"../data.iters/iter.000018/02.fp/data.000",
"../data.iters/iter.000018/02.fp/data.002",
"../data.iters/iter.000019/02.fp/data.000",
"../data.iters/iter.000019/02.fp/data.002",
"../data.iters/iter.000020/02.fp/data.000",
"../data.iters/iter.000020/02.fp/data.002",
"../data.iters/iter.000021/02.fp/data.000",
"../data.iters/iter.000021/02.fp/data.002",
"../data.iters/iter.000022/02.fp/data.000"
],
"batch_size": [
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1
]
},
"seed": 2826335128
}
}

Further Information, Files, and Links

No response

njzjz · 2024-07-06T02:01:59Z

This error happens when the model.ntypes=5, len(model.type_map)=4, and data.ntypes=4. @iProzd What are the expected values here?

iProzd · 2024-07-06T17:22:40Z

This error happens when the model.ntypes=5, len(model.type_map)=4, and data.ntypes=4. @iProzd What are the expected values here?

@PhelanShao Without access to your training data, it's hard to be certain, but the issue might be related to the strong assumption of type_map in the spin model of TensorFlow. Specifically, element types with spin must appear before those without spin in the type_map. For instance, you should use type_map·: ["S", "C", "O", "H"] instead of ["C", "O", "H", "S"] consistently (e.g., in sel, use_spin, spin_norm, etc.).

Additionally, the type.raw in your training data should reflect this type index. For example, you would use 0, 1, 2, 3 for S, C, O, H elements and 4 for the virtual spin atom of S.

@hztttt Am I correct? And if so we should clarify this requirement in documentations of tf.

@PhelanShao You can also now use spin model in pytorch, see here. Note that the data format of pytorch/tensorflow are different which is also detailed in the doc above.

njzjz · 2024-07-06T20:55:19Z

I believe there will be something wrong when a system does not contain atoms with spin.

njzjz · 2024-07-06T23:36:45Z

Here, None or a type_map with 5 types should be passed to DeepmdDataSystem. Currently, it passes a type_map with 4 types.

deepmd-kit/deepmd/tf/entrypoints/train.py

Lines 212 to 216 in 29db791

    
           type_map = model.model.get_type_map() 
        
           if len(type_map) == 0: 
        
               ipt_type_map = None 
        
           else: 
        
               ipt_type_map = type_map

The question here is the expected type_map.raw for the data with spin, or is it expected no type_map.raw (in this case, we should pass None to DeepmdDataSystem). This should be also explained in #3760.

hztttt · 2024-07-07T02:43:07Z

This error happens when the model.ntypes=5, len(model.type_map)=4, and data.ntypes=4. @iProzd What are the expected values here?

@PhelanShao Without access to your training data, it's hard to be certain, but the issue might be related to the strong assumption of type_map in the spin model of TensorFlow. Specifically, element types with spin must appear before those without spin in the type_map. For instance, you should use type_map·: ["S", "C", "O", "H"] instead of ["C", "O", "H", "S"] consistently (e.g., in sel, use_spin, spin_norm, etc.).

Additionally, the type.raw in your training data should reflect this type index. For example, you would use 0, 1, 2, 3 for S, C, O, H elements and 4 for the virtual spin atom of S.

@hztttt Am I correct? And if so we should clarify this requirement in documentations of tf.

@PhelanShao You can also now use spin model in pytorch, see here. Note that the data format of pytorch/tensorflow are different which is also detailed in the doc above.

Yeah. In tensorflow backened spin model, element types with spin must appear before those without spin in the type_map. Also, We only need to specify spin_norm of the magnetic atoms. For instance, you should use spin_norm·: [1.0] insead of spin_norm·: [1.0, 0.0, 0.0, 0.0] after ajusting type_map. Morever, you should specify magnetic force after atomic force in force.raw to support train for spin.

hztttt · 2024-07-07T02:45:14Z

Here, None or a type_map with 5 types should be passed to DeepmdDataSystem. Currently, it passes a type_map with 4 types.

deepmd-kit/deepmd/tf/entrypoints/train.py

Lines 212 to 216 in 29db791

type_map = model.model.get_type_map()

if len(type_map) == 0:

ipt_type_map = None

else:

ipt_type_map = type_map

The question here is the expected type_map.raw for the data with spin, or is it expected no type_map.raw (in this case, we should pass None to DeepmdDataSystem). This should be also explained in #3760.

It is expected no type_map.raw.

PhelanShao · 2024-07-07T02:54:03Z

Thank you! Actually, there is no sulfur (S) element in the system. It is actually an oxygen (O) element with a magnetic moment. To differentiate it when marking with fp, I was worried that naming it O_1 or O_A might affect its transmission in dpgen, so I replaced it with S. However, the type_map order in the dpgen process using dpdata conversion also needs adjustment, right? Should I rename this element to ensure it is ordered before C, O, and H?

njzjz · 2024-07-07T07:17:45Z

It is expected no type_map.raw.

This is not a good behavior, anyway.

PhelanShao added the bug label Jul 4, 2024

njzjz added upstream and removed upstream labels Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Tensor Shape Mismatch During DeepMD Training #3949

[BUG] Tensor Shape Mismatch During DeepMD Training #3949

PhelanShao commented Jul 4, 2024

njzjz commented Jul 6, 2024

iProzd commented Jul 6, 2024 •

edited

Loading

njzjz commented Jul 6, 2024

njzjz commented Jul 6, 2024

hztttt commented Jul 7, 2024

hztttt commented Jul 7, 2024

PhelanShao commented Jul 7, 2024

njzjz commented Jul 7, 2024

[BUG] Tensor Shape Mismatch During DeepMD Training #3949

[BUG] Tensor Shape Mismatch During DeepMD Training #3949

Comments

PhelanShao commented Jul 4, 2024

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

njzjz commented Jul 6, 2024

iProzd commented Jul 6, 2024 • edited Loading

njzjz commented Jul 6, 2024

njzjz commented Jul 6, 2024

hztttt commented Jul 7, 2024

hztttt commented Jul 7, 2024

PhelanShao commented Jul 7, 2024

njzjz commented Jul 7, 2024

iProzd commented Jul 6, 2024 •

edited

Loading