Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export error after training #3770

Open
ip2016 opened this issue Jul 26, 2024 · 6 comments
Open

Export error after training #3770

ip2016 opened this issue Jul 26, 2024 · 6 comments
Assignees

Comments

@ip2016
Copy link

ip2016 commented Jul 26, 2024

I'm trying to train yolox_tiny model on my image dataset with additional single category. Training and testing completes successfully but exporting fails with error "Argument 1 and 2 element types must match." I'm using otx[xpu] extension and ARC 750 GPU for training.

Steps to Reproduce

  1. Training:
    otx train --config recipe/detection/yolox_tiny.yaml --data_root Datasets/my-dataset --work_dir yolox-model

Epoch 15/199 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8/8 0:00:03 • 0:00:00 2.48it/s v_num: 0 train/loss_cls: 0.452 train/loss_bbox: 1.580 train/loss_obj: 0.928 train/loss: 2.960
train/data_time: 0.022 train/iter_time: 0.423 val/map: 0.692 val/map_50: 1.000 val/map_75: 1.000
val/map_small: -1.000 val/map_medium: -1.000 val/map_large: 0.692 val/mar_1: 0.720 val/mar_10:
0.720 val/mar_100: 0.720 val/mar_small: -1.000 val/mar_medium: -1.000 val/mar_large: 0.720
val/map_per_class: -1.000 val/mar_100_per_class: -1.000 val/classes: 0.000 val/f1-score: 1.000
Elapsed time: 0:01:37.700299

  1. Testing:
    otx test --config yolox-model/20240726_144135/configs.yaml --data_root Datasets/my-dataset --checkpoint yolox-model/20240726_144135/last.ckpt

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ test/classes │ 0.0 │
│ test/f1-score │ 0.8888888955116272 │
│ test/map │ 0.49603959918022156 │
│ test/map_50 │ 0.7920792102813721 │
│ test/map_75 │ 0.7920792102813721 │
│ test/map_large │ 0.49603959918022156 │
│ test/map_medium │ -1.0 │
│ test/map_per_class │ -1.0 │
│ test/map_small │ -1.0 │
│ test/mar_1 │ 0.5 │
│ test/mar_10 │ 0.5 │
│ test/mar_100 │ 0.5 │
│ test/mar_100_per_class │ -1.0 │
│ test/mar_large │ 0.5 │
│ test/mar_medium │ -1.0 │
│ test/mar_small │ -1.0 │
└───────────────────────────┴───────────────────────────┘
Testing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:07 • 0:00:00 0.00it/s
Elapsed time: 0:00:30.886884

  1. Exporting:
    otx export --config yolox-model/20240726_144135/configs.yaml --data_root Datasets/my-dataset --checkpoint yolox-model/20240726_144135/last.ckpt

/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/core/model/detection.py:268: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
shape = (int(inputs.shape[2]), int(inputs.shape[3]))
/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/core/model/detection.py:275: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.
meta_info_list = [meta_info] * len(inputs)
/mnt/d/Projects/venv/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /build/pytorch/aten/src/ATen/native/TensorShape.cpp:3526.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/common/utils/nms.py:248: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
iou_threshold = torch.tensor([iou_threshold], dtype=torch.float32)
/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/common/utils/nms.py:249: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
score_threshold = torch.tensor([score_threshold], dtype=torch.float32)
/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/common/utils/utils.py:142: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
k = torch.tensor(k, device=input.device, dtype=torch.long)
/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/common/utils/nms.py:387: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
score_threshold = float(score_threshold)
/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/common/utils/nms.py:388: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
iou_threshold = float(iou_threshold)
/mnt/d/Projects/venv/lib/python3.10/site-packages/torch/onnx/symbolic_opset9.py:5856: UserWarning: Exporting aten::index operator of advanced indexing in opset 11 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will produce incorrect results.
warnings.warn(
/mnt/d/Projects/venv/lib/python3.10/site-packages/torch/onnx/utils.py:1686: UserWarning: The exported ONNX model failed ONNX shape inference. The model will not be executable by the ONNX Runtime. If this is unintended and you believe there is a bug, please report an issue at https://github.com/pytorch/pytorch/issues. Error reported by strict ONNX shape inference: [ShapeInferenceError] (op_type:Where, node name: /Where_2): Y has inconsistent type tensor(float) (Triggered internally at /build/pytorch/torch/csrc/jit/serialization/export.cpp:1415.)
_C._check_onnx_proto(proto)
2024-07-26 08:27:07,083 - root - INFO - Converting to ONNX is done.

GeneralFailure: Check 'error_message.empty()' failed at src/frontends/onnx/frontend/src/frontend.cpp:122:
FrontEnd API failed with GeneralFailure:
Errors during ONNX translation:
While validating ONNX node '<Node(Where): /Where_2>':
Check 'element::Type::merge(result_et, get_input_element_type(1), get_input_element_type(2))' failed at src/core/src/op/select.cpp:68:
While validating node 'opset1::Select Select_2595 (opset1::Equal /Equal[0]:boolean[1,..200], opset1::Tile /Tile_1[0]:i64[1,..200], opset1::Constant /Constant_112[0]:f32[1]) ->
(dynamic[...])' with friendly_name 'Select_2595':
Argument 1 and 2 element types must match.

Traceback (most recent call last):
File "/mnt/d/Projects/venv/bin/otx", line 8, in
sys.exit(main())
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/cli/init.py", line 17, in main
OTXCLI()
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/cli/cli.py", line 60, in init
self.run()
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/cli/cli.py", line 531, in run
fn(**fn_kwargs)
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/engine/engine.py", line 585, in export
exported_model_path = self.model.export(
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/detection/yolox.py", line 99, in export
return super().export(output_dir, base_name, export_format, precision, to_exportable_code)
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/core/model/base.py", line 647, in export
return self._exporter.export(
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/core/exporter/base.py", line 108, in export
return self.to_openvino(model, output_dir, base_model_name, precision)
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/core/exporter/native.py", line 80, in to_openvino
exported_model = openvino.convert_model(
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/convert.py", line 100, in convert_model
ov_model, _ = _convert(cli_parser, params, True)
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/convert_impl.py", line 535, in _convert
raise e
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/convert_impl.py", line 477, in _convert
ov_model = driver(argv, {"conversion_parameters": non_default_params})
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/convert_impl.py", line 228, in driver
ov_model = moc_emit_ir(prepare_ir(argv), argv)
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/convert_impl.py", line 177, in prepare_ir
ov_model = moc_pipeline(argv, moc_front_end)
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/moc_frontend/pipeline.py", line 244, in moc_pipeline
ov_model = moc_front_end.convert(input_model)
File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/frontend/frontend.py", line 18, in convert
converted_model = super().convert(model)
openvino._pyopenvino.GeneralFailure: Check 'error_message.empty()' failed at src/frontends/onnx/frontend/src/frontend.cpp:122:
FrontEnd API failed with GeneralFailure:
Errors during ONNX translation:
While validating ONNX node '<Node(Where): /Where_2>':
Check 'element::Type::merge(result_et, get_input_element_type(1), get_input_element_type(2))' failed at src/core/src/op/select.cpp:68:
While validating node 'opset1::Select Select_2595 (opset1::Equal /Equal[0]:boolean[1,..200], opset1::Tile /Tile_1[0]:i64[1,..200], opset1::Constant /Constant_112[0]:f32[1]) -> (dynamic[...])' with friendly_name 'Select_2595':
Argument 1 and 2 element types must match.

Environment:

  • OS: Windows11, WSL2 (Ubuntu 22.04)
  • Framework version:
    torch==2.1.0.post2+cxx11.abi
    intel-extension-for-pytorch==2.1.30+xpu
  • Python version: 3.10
  • OpenVINO version:
    openvino==2024.0.0
    openvino-dev==2024.0.0
  • OTX version: 2.2.0
  • GPU model and memory: ARC 750 8GB

python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.version); print(ipex.version); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

2.1.0.post2+cxx11.abi
2.1.30+xpu
[0]: _DeviceProperties(name='Intel(R) Graphics [0x56a1]', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=0, total_memory=7934MB, max_compute_units=448, gpu_eu_count=448)

hwinfo --display

07: PCI 4bfb0000.0: 0302 3D controller
[Created at pci.386]
Unique ID: +JEX.TMx8hlOLi40
SysFS ID: /devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/035448d6-4bfb-4b22-bbe5-9a3bb13c8f15/pci4bfb:00/4bfb:00:00.0
SysFS BusID: 4bfb:00:00.0
Hardware Class: graphics card
Model: "Microsoft 3D controller"
Vendor: pci 0x1414 "Microsoft Corporation"
Device: pci 0x008e
Driver: "dxgkrnl"
Driver Modules: "dxgkrnl", "dxgkrnl"
Module Alias: "pci:v00001414d0000008Esv00000000sd00000000bc03sc02i00"
Config Status: cfg=new, avail=yes, need=no, active=unknown

clinfo -l

Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM)
-- Device #0: Intel(R) FPGA Emulation Device
Platform #1: Intel(R) OpenCL
-- Device #0: Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
Platform #2: Intel(R) OpenCL Graphics
-- Device #0: Intel(R) Graphics [0x56a1]

@harimkang
Copy link
Contributor

@sovrasov Who is it appropriate to assign this to? (ARC GPU issue)

@ip2016
Copy link
Author

ip2016 commented Jul 30, 2024

@sovrasov Who is it appropriate to assign this to? (ARC GPU issue)

I'm not sure that this is ARC GPU specific issue. I'm observing the same error with CPU training/validation/export.

@sovrasov
Copy link
Contributor

@sovrasov Who is it appropriate to assign this to? (ARC GPU issue)

I'm not sure that this is ARC GPU specific issue. I'm observing the same error with CPU training/validation/export.

You're right it's ARC-specific. otx[xpu] installs a patched torch + IPEX, which messes up output types sometimes. Currently, workaround is to conduct export in a cpu or cuda environment (i.e. use upstream torch).

@ip2016
Copy link
Author

ip2016 commented Jul 30, 2024

@sovrasov Who is it appropriate to assign this to? (ARC GPU issue)

I'm not sure that this is ARC GPU specific issue. I'm observing the same error with CPU training/validation/export.

You're right it's ARC-specific. otx[xpu] installs a patched torch + IPEX, which messes up output types sometimes. Currently, workaround is to conduct export in a cpu or cuda environment (i.e. use upstream torch).

Thanks. I'll try it out.

@ip2016
Copy link
Author

ip2016 commented Jul 30, 2024

Update: I have different error trying to train on CPU with otx[base] package:

RuntimeError: "nms_kernel" not implemented for 'BFloat16'

@sovrasov
Copy link
Contributor

Update: I have different error trying to train on CPU with otx[base] package:

RuntimeError: "nms_kernel" not implemented for 'BFloat16'

Training with upstream torch is not required: the checkpoint trained on ARC with IPEX should work in upstream torch as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants