Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA errors when using models that have been imported from HF and trained with SentenceTransformers #324

Open
RobertHua96 opened this issue Jul 26, 2020 · 6 comments

Comments

@RobertHua96
Copy link

RobertHua96 commented Jul 26, 2020

Hi,

Expected behaviour: When I create a SentenceTransformer model by importing in a HF model and fine tuning it with the NLI code example, it should work when encodding text.

Actual behaviour: CUDA errors occur when trying to embed text.
image

The pretrained models from the SentenceTransformers package are able to embed this text without errors.

How the model was initialised:

image

This error still occurs even for models trained from scratch without layer freezing.

Could someone let me know what could be going wrong?

@nreimers
Copy link
Member

@yuwon
Copy link

yuwon commented Jul 28, 2020

I have the same issue.

As mentioned in https://github.com/UKPLab/sentence-transformers#training,
I first downloaded NLI and STS data and tried to run training_nli.py.

However, I got the following error:

Iteration:   0%|          | 33/58880 [00:05<2:57:22,  5.53it/s]
Epoch:   0%|          | 0/1 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "/home/user/workspace/embedding_cluster/scripts/train_nli.py", line 73, in <module>
    model.fit(train_objectives=[(train_dataloader, train_loss)],
  File "/home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 407, in fit
    loss_value.backward()
  File "/home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/autograd/__init__.py", line 98, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` (gemm<float> at /pytorch/aten/src/ATen/cuda/CUDABlas.cpp:165)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f4d33142536 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xf4bc97 (0x7f4d344e9c97 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x13f589d (0x7f4d3499389d in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: THCudaTensor_addmm + 0x5c (0x7f4d3499d44c in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1041c58 (0x7f4d345dfc58 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xf65018 (0x7f4d34503018 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x10c2780 (0x7f4d70e72780 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x2c9b47e (0x7f4d72a4b47e in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x10c2780 (0x7f4d70e72780 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::Tensor::mm(at::Tensor const&) const + 0xf0 (0x7f4d70a35930 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x28e6b5c (0x7f4d72696b5c in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::generated::MmBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x151 (0x7f4d72697961 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2d89705 (0x7f4d72b39705 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f4d72b36a03 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f4d72b377e2 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f4d72b2fe59 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f4d7f4735f8 in /home/use/anaconda3/envs/torch_py38/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0xbd6df (0x7f4d802ff6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #18: <unknown function> + 0x76db (0x7f4d822886db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #19: clone + 0x3f (0x7f4d81fb1a3f in /lib/x86_64-linux-gnu/libc.so.6)

CUDA: 10.2
Nvidia Driver: 440.95.01
pytorch: 1.5.1
transformers: 3.0.2
sentence-transformers: 0.3.2

@nreimers
Copy link
Member

Hi @yuwon,
I tested the script with CUDA 9.2 and CUDA 10.1. For cuda 10.2, my installed driver is sadly too old. With CUDA 9.2 / CUDA 10.1. it works.

Sadly I don't know where the error comes from. Appears some issue with CUDA/Pytorch. Maybe you can try it with a different CUDA / pytorch version?

Best
Nils Reimers

@braaannigan
Copy link

Hi @yuwon I had lots of problems like this. I've moved to developing in a docker container with an official pytorch cuda base image and never had problems again. Blog post on deving in docker here: http://braaannigan.github.io/software/2020/07/26/dev_in_docker.html

@yuwon
Copy link

yuwon commented Jul 30, 2020

Thanks @braaannigan. Yes, I've also tried with Pytorch's official Docker image but I also failed.

@olastor
Copy link

olastor commented Feb 15, 2022

I also encountered the same error as @yuwon (CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm...). I have cuda 11.6 and had torch==1.8.2+cu111 installed. After uninstalling pytorch and installing the nightly version as suggested in allenai/allennlp#5064 (comment) it now seems to work with cuda 11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants