forked from openvinotoolkit/openvino
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[CPU] enable MLP & QKV on non-PA case and minor fixes (openvinotoolki…
…t#26103) ### Details: - enable MLP & QKV optimization on all cases (previous PR only enable on PageAttention case), this brings ~30% first token latency reduction w/o any regression on 2nd token latency. - relax restrictions on K dimension size: from integer multiple of 256 down to 32, this allows more LLMs to benefit from this optimization (like QWen1.5) - reduce total memory footprint by converting directly from fp16 weight into bf16 (previously we require bf16 weights in node and this introduces a duplicated weight memory in bf16 format when most LLM are in fp16 format). - allocate small sub weight tensors introduces too much page-fault overhead, make big weight tensor allocation can reduce it by a lot thanks to huge-pages - prepack weight tensor in AMX B-tile format directly from gate & up tensor w/o intermediate combination step, this reduces first inference latency - fix scratch buffer bugs in previous PR, now scratch buffers are truly shared among layers. ### Tickets: - *ticket-id*
- Loading branch information
Showing
8 changed files
with
609 additions
and
181 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.