-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export error after training #3770
Comments
@sovrasov Who is it appropriate to assign this to? (ARC GPU issue) |
I'm not sure that this is ARC GPU specific issue. I'm observing the same error with CPU training/validation/export. |
You're right it's ARC-specific. otx[xpu] installs a patched torch + IPEX, which messes up output types sometimes. Currently, workaround is to conduct export in a cpu or cuda environment (i.e. use upstream torch). |
Thanks. I'll try it out. |
Update: I have different error trying to train on CPU with otx[base] package: RuntimeError: "nms_kernel" not implemented for 'BFloat16' |
Training with upstream torch is not required: the checkpoint trained on ARC with IPEX should work in upstream torch as well |
I'm trying to train yolox_tiny model on my image dataset with additional single category. Training and testing completes successfully but exporting fails with error "Argument 1 and 2 element types must match." I'm using otx[xpu] extension and ARC 750 GPU for training.
Steps to Reproduce
otx train --config recipe/detection/yolox_tiny.yaml --data_root Datasets/my-dataset --work_dir yolox-model
otx test --config yolox-model/20240726_144135/configs.yaml --data_root Datasets/my-dataset --checkpoint yolox-model/20240726_144135/last.ckpt
otx export --config yolox-model/20240726_144135/configs.yaml --data_root Datasets/my-dataset --checkpoint yolox-model/20240726_144135/last.ckpt
Environment:
torch==2.1.0.post2+cxx11.abi
intel-extension-for-pytorch==2.1.30+xpu
openvino==2024.0.0
openvino-dev==2024.0.0
python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.version); print(ipex.version); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
hwinfo --display
clinfo -l
The text was updated successfully, but these errors were encountered: