CUDA errors when using models that have been imported from HF and trained with SentenceTransformers #324

RobertHua96 · 2020-07-26T23:30:08Z

Hi,

Expected behaviour: When I create a SentenceTransformer model by importing in a HF model and fine tuning it with the NLI code example, it should work when encodding text.

Actual behaviour: CUDA errors occur when trying to embed text.

The pretrained models from the SentenceTransformers package are able to embed this text without errors.

How the model was initialised:

This error still occurs even for models trained from scratch without layer freezing.

Could someone let me know what could be going wrong?

nreimers · 2020-07-27T08:33:19Z

How does your complete code look like?

Do these examples work when you do not change anything?
https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_nli.py
https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_stsbenchmark.py

yuwon · 2020-07-28T11:43:01Z

I have the same issue.

As mentioned in https://github.com/UKPLab/sentence-transformers#training,
I first downloaded NLI and STS data and tried to run training_nli.py.

However, I got the following error:

Iteration:   0%|          | 33/58880 [00:05<2:57:22,  5.53it/s]
Epoch:   0%|          | 0/1 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "/home/user/workspace/embedding_cluster/scripts/train_nli.py", line 73, in <module>
    model.fit(train_objectives=[(train_dataloader, train_loss)],
  File "/home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 407, in fit
    loss_value.backward()
  File "/home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/autograd/__init__.py", line 98, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` (gemm<float> at /pytorch/aten/src/ATen/cuda/CUDABlas.cpp:165)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f4d33142536 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xf4bc97 (0x7f4d344e9c97 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x13f589d (0x7f4d3499389d in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: THCudaTensor_addmm + 0x5c (0x7f4d3499d44c in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1041c58 (0x7f4d345dfc58 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xf65018 (0x7f4d34503018 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x10c2780 (0x7f4d70e72780 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x2c9b47e (0x7f4d72a4b47e in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x10c2780 (0x7f4d70e72780 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::Tensor::mm(at::Tensor const&) const + 0xf0 (0x7f4d70a35930 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x28e6b5c (0x7f4d72696b5c in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::generated::MmBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x151 (0x7f4d72697961 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2d89705 (0x7f4d72b39705 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f4d72b36a03 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f4d72b377e2 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f4d72b2fe59 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f4d7f4735f8 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0xbd6df (0x7f4d802ff6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #18: <unknown function> + 0x76db (0x7f4d822886db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #19: clone + 0x3f (0x7f4d81fb1a3f in /lib/x86_64-linux-gnu/libc.so.6)

CUDA: 10.2
Nvidia Driver: 440.95.01
pytorch: 1.5.1
transformers: 3.0.2
sentence-transformers: 0.3.2

nreimers · 2020-07-28T12:05:54Z

Hi @yuwon,
I tested the script with CUDA 9.2 and CUDA 10.1. For cuda 10.2, my installed driver is sadly too old. With CUDA 9.2 / CUDA 10.1. it works.

Sadly I don't know where the error comes from. Appears some issue with CUDA/Pytorch. Maybe you can try it with a different CUDA / pytorch version?

Best
Nils Reimers

braaannigan · 2020-07-29T19:52:05Z

Hi @yuwon I had lots of problems like this. I've moved to developing in a docker container with an official pytorch cuda base image and never had problems again. Blog post on deving in docker here: http://braaannigan.github.io/software/2020/07/26/dev_in_docker.html

yuwon · 2020-07-30T00:31:26Z

Thanks @braaannigan. Yes, I've also tried with Pytorch's official Docker image but I also failed.

olastor · 2022-02-15T10:16:08Z

I also encountered the same error as @yuwon (CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm...). I have cuda 11.6 and had torch==1.8.2+cu111 installed. After uninstalling pytorch and installing the nightly version as suggested in allenai/allennlp#5064 (comment) it now seems to work with cuda 11.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA errors when using models that have been imported from HF and trained with SentenceTransformers #324

CUDA errors when using models that have been imported from HF and trained with SentenceTransformers #324

RobertHua96 commented Jul 26, 2020 •

edited

Loading

nreimers commented Jul 27, 2020

yuwon commented Jul 28, 2020 •

edited

Loading

nreimers commented Jul 28, 2020

braaannigan commented Jul 29, 2020

yuwon commented Jul 30, 2020

olastor commented Feb 15, 2022

CUDA errors when using models that have been imported from HF and trained with SentenceTransformers #324

CUDA errors when using models that have been imported from HF and trained with SentenceTransformers #324

Comments

RobertHua96 commented Jul 26, 2020 • edited Loading

nreimers commented Jul 27, 2020

yuwon commented Jul 28, 2020 • edited Loading

nreimers commented Jul 28, 2020

braaannigan commented Jul 29, 2020

yuwon commented Jul 30, 2020

olastor commented Feb 15, 2022

RobertHua96 commented Jul 26, 2020 •

edited

Loading

yuwon commented Jul 28, 2020 •

edited

Loading