[GPU] Relax UnsqueezeBroadcastReshapeSDPAFusion #27515

ceciliapeng2011 · 2024-11-12T08:08:10Z

Details:

By relaxing UnsqueezeBroadcastReshapeSDPAFusion, GQA pattern is enabled and Broadcasting nodes overheads in paths of key and value are removed, thus improves performance of GLM4 model significantly.
Fix for GLM4V, which has initial state shape (-1, 0, 0, 0), and shape infer failed.

Tickets:

CVS-157263

…branch from non-reshape.

TianmengChen · 2024-11-13T12:42:32Z

Hi, cecilia
We try to run glm4v under this PR, but get error log:

Traceback (most recent call last):
  File "C:\chen\zhipu\glm4v\test_v_ov.py", line 20, in <module>
    model = OvGLM4v(MODEL_PATH, "GPU")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\chen\zhipu\glm4v\glm4v_helper.py", line 407, in __init__
    compiled_model = core.compile_model(self.model, device)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\chen\genai\Lib\site-packages\openvino\runtime\ie_api.py", line 543, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Exception from src\inference\src\cpp\core.cpp:107:
Exception from src\inference\src\dev\plugin.cpp:53:
Exception from src\core\src\pass\graph_rewrite.cpp:295:
[MATCHER] UnsqueezeBroadcastReshapeSDPAFusionnode: gpu_opset::SDPA __module.model.encoder.layers.0.self_attention.core_attention/aten::scaled_dot_product_attention/ScaledDotProductAttention (opset1::Reshape aten::flatten/Reshape[0]:f16[?,16,?,128], opset1::Reshape __module.model.encoder.layers.0.self_attention/aten::view/Reshape_2[0]:f16[?,16,?,128], opset1::Reshape __module.model.encoder.layers.0.self_attention/aten::view/Reshape_4[0]:f16[?,16,?,128], opset1::Select __module.model.encoder.layers.0.self_attention.core_attention/aten::masked_fill_/Select[0]:f16[?,1,?,?]) -> (f16[?,16,?,128]) callback has thrown: Check 'value_input_correctness' failed at src\core\shape_inference\include\scaled_dot_product_attention_shape_inference.hpp:68:
While validating node 'gpu_opset::SDPA SDPA_58065 (opset1::Reshape aten::flatten/Reshape[0]:f16[?,16,?,128], gpu_opset::KVCache __module.model.encoder.layers.0.self_attention/aten::cat/Concat[0]:f16[?,?,?,?], gpu_opset::KVCache __module.model.encoder.layers.0.self_attention/aten::cat/Concat_1[0]:f16[?,4,?,128], opset1::Select __module.model.encoder.layers.0.self_attention.core_attention/aten::masked_fill_/Select[0]:f16[?,1,?,?]) -> ()' with friendly_name 'SDPA_58065':
Shape inference input shapes {[?,16,?,128],[?,?,?,?],[?,4,?,128]}
Value input shape not compatible with other inputs.

can you have a look, thank you.

…hen read_value of init shape full 0.

…AFusion

ceciliapeng2011 · 2024-11-21T01:42:20Z

src/plugins/intel_gpu/src/plugin/transformations/unsqueeze_broadcast_reshape_sdpa_fusion.cpp

@@ -42,7 +38,7 @@ UnsqueezeBroadcastReshapeSDPAFusion::UnsqueezeBroadcastReshapeSDPAFusion() {
        return rank_equals(4)(output) && consumers_count(1);
    };

-    auto input_a_m = any_input(not_reshape);
+    auto input_a_m = any_input();


@sshlyapn May I know why "not_reshape" is asked here previously? Any problem here if I remove this check?

Originally it was copied from UnsqueezeBroadcastReshapeMatmulFusion transformation, but it seems okay to me to relax this

Seems that you allowed query input’s reshape, and seems that we need to check whether both sdpa_opt and sdpa_micro supports dynamic padded query input.
E.g.,
Fused QKV gemm => VariadicSplit (crop + optimized out) => reshape (optimized out) => sdpa query input
Not quickly sure which model contains such a pattern.
Maybe you can just create a functional test, which has above pattern, and then check the values are correct.

@yeonbok This relax is an GQA pattern optimizing by removing broadcast nodes from key and value input paths. The sdpa gpu node was in the exec graph already before this optimizing. So correctness of this special case you mentioned should have been assured already.

sshlyapn · 2024-11-21T05:56:02Z

src/plugins/intel_gpu/src/plugin/transformations/unsqueeze_broadcast_reshape_sdpa_fusion.cpp

@@ -42,7 +38,7 @@ UnsqueezeBroadcastReshapeSDPAFusion::UnsqueezeBroadcastReshapeSDPAFusion() {
        return rank_equals(4)(output) && consumers_count(1);
    };

-    auto input_a_m = any_input(not_reshape);
+    auto input_a_m = any_input();


Originally it was copied from UnsqueezeBroadcastReshapeMatmulFusion transformation, but it seems okay to me to relax this

sshlyapn · 2024-11-21T06:12:05Z

src/plugins/intel_gpu/src/plugin/transformations/op/kv_cache.cpp

+    // Sometime input0 shape has zeros (or even dynamic dim) in several dimensions, for
+    // example concat [-1, 0, 0, 0] + [-1, 4, -1, 128] along axis 2, we could (and should) infer
+    // dim value of axis 1 and 4 in this case.
+    for (int64_t i = 0; i < static_cast<int64_t>(out_shapes[0].size()); ++i) {


I think we can just initialize out_shapes[0] with input_shapes[1] instead of input_shapes[0], since the input_shapes[1] input shape is always "new_token" input and input_shapes[0] is "past". It seems [1] shape should always be more detailed

### Details: - By relaxing UnsqueezeBroadcastReshapeSDPAFusion, GQA pattern is enabled and Broadcasting nodes overheads in paths of key and value are removed, thus improves performance of GLM4 model significantly. - Fix for GLM4V, which has initial state shape (-1, 0, 0, 0), and shape infer failed. ### Tickets: - *CVS-157263* --------- Co-authored-by: Chen Peter <[email protected]>

### Details: - By relaxing UnsqueezeBroadcastReshapeSDPAFusion, GQA pattern is enabled and Broadcasting nodes overheads in paths of key and value are removed, thus improves performance of GLM4 model significantly. - Fix for GLM4V, which has initial state shape (-1, 0, 0, 0), and shape infer failed. ### Tickets: - *CVS-157263* Co-authored-by: Chen Peter <[email protected]>

ceciliapeng2011 requested review from a team as code owners November 12, 2024 08:08

github-actions bot added the category: GPU OpenVINO GPU plugin label Nov 12, 2024

relax UnsqueezeBroadcastReshapeSDPAFusion with no need to ask querry …

e6e560d

…branch from non-reshape.

p-durandin added do_not_merge and removed do_not_merge labels Nov 13, 2024

ceciliapeng2011 and others added 3 commits November 15, 2024 00:09

make shape_infer of kvcache and sdpa more robust. There seen a case w…

91205cd

…hen read_value of init shape full 0.

Merge branch 'master' into cecilia/relax/UnsqueezeBroadcastReshapeSDP…

cc9a06e

…AFusion

fix CI build error

d28ace0

ceciliapeng2011 requested review from sshlyapn and yeonbok November 21, 2024 01:40

ceciliapeng2011 commented Nov 21, 2024

View reviewed changes

sshlyapn reviewed Nov 21, 2024

View reviewed changes

ceciliapeng2011 requested a review from sshlyapn November 22, 2024 03:43

sshlyapn approved these changes Nov 22, 2024

View reviewed changes

sshlyapn changed the title ~~relax UnsqueezeBroadcastReshapeSDPAFusion with no need to ask querry …~~ [GPU] Relax UnsqueezeBroadcastReshapeSDPAFusion Nov 22, 2024

p-durandin approved these changes Nov 22, 2024

View reviewed changes

p-durandin added this pull request to the merge queue Nov 22, 2024

Merged via the queue into openvinotoolkit:master with commit c801f4e Nov 22, 2024
154 checks passed

ceciliapeng2011 added 2 commits November 22, 2024 17:50

update according to review comments.

1e26fe1

update review comments

70f0637

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Relax UnsqueezeBroadcastReshapeSDPAFusion #27515

[GPU] Relax UnsqueezeBroadcastReshapeSDPAFusion #27515

ceciliapeng2011 commented Nov 12, 2024 •

edited

Loading

TianmengChen commented Nov 13, 2024

ceciliapeng2011 Nov 21, 2024 •

edited

Loading

sshlyapn Nov 21, 2024

ceciliapeng2011 Nov 22, 2024

sshlyapn Nov 21, 2024

sshlyapn Nov 21, 2024

ceciliapeng2011 Nov 22, 2024

[GPU] Relax UnsqueezeBroadcastReshapeSDPAFusion #27515

[GPU] Relax UnsqueezeBroadcastReshapeSDPAFusion #27515

Conversation

ceciliapeng2011 commented Nov 12, 2024 • edited Loading

Details:

Tickets:

TianmengChen commented Nov 13, 2024

ceciliapeng2011 Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

sshlyapn Nov 21, 2024

Choose a reason for hiding this comment

ceciliapeng2011 Nov 22, 2024

Choose a reason for hiding this comment

sshlyapn Nov 21, 2024

Choose a reason for hiding this comment

sshlyapn Nov 21, 2024

Choose a reason for hiding this comment

ceciliapeng2011 Nov 22, 2024

Choose a reason for hiding this comment

ceciliapeng2011 commented Nov 12, 2024 •

edited

Loading

ceciliapeng2011 Nov 21, 2024 •

edited

Loading