Issue: Performance degradation with int8 quantization in multi-batch scenarios #218

kakarotzzz · 2025-01-14T09:16:57Z

When using int8 quantization, there is a significant performance drop in multi-batch inference compared to single-batch inference. The single-batch performance is good, but the performance doesn't scale well with increased batch size.

class WeightOnlyInt8Linear(torch.nn.Module):
    __constants__ = ['in_features', 'out_features']
    in_features: int
    out_features: int
    weight: torch.Tensor

    def __init__(self, in_features: int, out_features: int, bias: bool = True,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.register_buffer("weight", torch.empty((out_features, in_features), dtype=torch.int8))
        self.register_buffer("scales", torch.ones(out_features, dtype=torch.bfloat16))
  
    def forward(self, input: torch.Tensor) -> torch.Tensor:
        return F.linear(input, self.weight.to(dtype=input.dtype)) * self.scales

Current Behavior

The explicit .to(dtype=input.dtype) creates a separate type conversion kernel
In single batch case, inductor can successfully fuse this conversion with gemm
In multi-batch case, the fusion fails and we get:
- One kernel for int8->fp16 conversion
- Another kernel for gemm computation
- This leads to extra memory traffic and lower performance

The text was updated successfully, but these errors were encountered:

cpuhrsch · 2025-01-15T05:55:39Z

@kakarotzzz You might be able to fuse this using torch._inductor.config.use_mixed_mm = True depending on the PyTorch version you are using. On that note, which version of PyTorch are you using?

kakarotzzz · 2025-01-15T06:29:05Z

I'm using PyTorch 2.5.0 and have enabled these optimization configurations:

torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.triton.unique_kernel_names = True
torch._inductor.config.fx_graph_cache = True 
torch._inductor.config.use_mixed_mm = True

Adding use_mixed_mm = True didn't bring any performance improvements, and performance significantly degrades even with batch_size = 2, testing on 3090

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: Performance degradation with int8 quantization in multi-batch scenarios #218

Issue: Performance degradation with int8 quantization in multi-batch scenarios #218

kakarotzzz commented Jan 14, 2025

cpuhrsch commented Jan 15, 2025 •

edited

Loading

kakarotzzz commented Jan 15, 2025 •

edited

Loading

Issue: Performance degradation with int8 quantization in multi-batch scenarios #218

Issue: Performance degradation with int8 quantization in multi-batch scenarios #218

Comments

kakarotzzz commented Jan 14, 2025

Current Behavior

cpuhrsch commented Jan 15, 2025 • edited Loading

kakarotzzz commented Jan 15, 2025 • edited Loading

cpuhrsch commented Jan 15, 2025 •

edited

Loading

kakarotzzz commented Jan 15, 2025 •

edited

Loading