LLM: 8 bit quantization occasional matrix multiplication error #145

kyriediculous · 2024-08-06T16:41:32Z

Describe the bug

When using 8-bit quantization with the LLM pipeline and a multiple GPU setup, it mostly runs fine.

After some random amount of requests however the pipeline starts failing and requires a restart.

More investigation is required as to why this bug occurs.

Related: bitsandbytes-foundation/bitsandbytes#162

Full error trace:

Exception in thread Thread-18 (model_generate_wrapper):
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/app/app/pipelines/llm_generate.py", line 180, in model_generate_wrapper
    self.model.generate(**kwargs)
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/generation/utils.py", line 1989, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/generation/utils.py", line 2932, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 944, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 562, in forward
    value_states = self.v_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 817, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 556, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 415, in forward
    output += torch.matmul(subA, state.subB)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1 and 5x1024)

Reproduction steps

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behaviour

No response

Severity

Minor

Screenshots / Live demo link

OS

Linux

Running on

Docker

AI-worker version

eperimental: llm-pipeline

Additional context

This error only occurs using 8 bit quantization, not sure if other models of lowering precision are supported that are compatible with the model which might also solve the issue.

The text was updated successfully, but these errors were encountered:

kyriediculous · 2024-10-17T15:24:16Z

Likely caused by using incompatible architecture on one GPU (Pascal architecture) and bitsandbytes is only compatible with volta and up

kyriediculous added the bug Something isn't working label Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM: 8 bit quantization occasional matrix multiplication error #145

LLM: 8 bit quantization occasional matrix multiplication error #145

kyriediculous commented Aug 6, 2024

kyriediculous commented Oct 17, 2024

LLM: 8 bit quantization occasional matrix multiplication error #145

LLM: 8 bit quantization occasional matrix multiplication error #145

Comments

kyriediculous commented Aug 6, 2024

Describe the bug

Reproduction steps

Expected behaviour

Severity

Screenshots / Live demo link

OS

Running on

AI-worker version

Additional context

kyriediculous commented Oct 17, 2024