You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using int8 quantization, there is a significant performance drop in multi-batch inference compared to single-batch inference. The single-batch performance is good, but the performance doesn't scale well with increased batch size.
@kakarotzzz You might be able to fuse this using torch._inductor.config.use_mixed_mm = True depending on the PyTorch version you are using. On that note, which version of PyTorch are you using?
When using int8 quantization, there is a significant performance drop in multi-batch inference compared to single-batch inference. The single-batch performance is good, but the performance doesn't scale well with increased batch size.
Current Behavior
.to(dtype=input.dtype)
creates a separate type conversion kernelThe text was updated successfully, but these errors were encountered: