Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Relax UnsqueezeBroadcastReshapeSDPAFusion #27515

Conversation

ceciliapeng2011
Copy link
Contributor

@ceciliapeng2011 ceciliapeng2011 commented Nov 12, 2024

Details:

  • By relaxing UnsqueezeBroadcastReshapeSDPAFusion, GQA pattern is enabled and Broadcasting nodes overheads in paths of key and value are removed, thus improves performance of GLM4 model significantly.
  • Fix for GLM4V, which has initial state shape (-1, 0, 0, 0), and shape infer failed.

Tickets:

@ceciliapeng2011 ceciliapeng2011 requested review from a team as code owners November 12, 2024 08:08
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Nov 12, 2024
@TianmengChen
Copy link

Hi, cecilia
We try to run glm4v under this PR, but get error log:

Traceback (most recent call last):
  File "C:\chen\zhipu\glm4v\test_v_ov.py", line 20, in <module>
    model = OvGLM4v(MODEL_PATH, "GPU")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\chen\zhipu\glm4v\glm4v_helper.py", line 407, in __init__
    compiled_model = core.compile_model(self.model, device)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\chen\genai\Lib\site-packages\openvino\runtime\ie_api.py", line 543, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Exception from src\inference\src\cpp\core.cpp:107:
Exception from src\inference\src\dev\plugin.cpp:53:
Exception from src\core\src\pass\graph_rewrite.cpp:295:
[MATCHER] UnsqueezeBroadcastReshapeSDPAFusionnode: gpu_opset::SDPA __module.model.encoder.layers.0.self_attention.core_attention/aten::scaled_dot_product_attention/ScaledDotProductAttention (opset1::Reshape aten::flatten/Reshape[0]:f16[?,16,?,128], opset1::Reshape __module.model.encoder.layers.0.self_attention/aten::view/Reshape_2[0]:f16[?,16,?,128], opset1::Reshape __module.model.encoder.layers.0.self_attention/aten::view/Reshape_4[0]:f16[?,16,?,128], opset1::Select __module.model.encoder.layers.0.self_attention.core_attention/aten::masked_fill_/Select[0]:f16[?,1,?,?]) -> (f16[?,16,?,128]) callback has thrown: Check 'value_input_correctness' failed at src\core\shape_inference\include\scaled_dot_product_attention_shape_inference.hpp:68:
While validating node 'gpu_opset::SDPA SDPA_58065 (opset1::Reshape aten::flatten/Reshape[0]:f16[?,16,?,128], gpu_opset::KVCache __module.model.encoder.layers.0.self_attention/aten::cat/Concat[0]:f16[?,?,?,?], gpu_opset::KVCache __module.model.encoder.layers.0.self_attention/aten::cat/Concat_1[0]:f16[?,4,?,128], opset1::Select __module.model.encoder.layers.0.self_attention.core_attention/aten::masked_fill_/Select[0]:f16[?,1,?,?]) -> ()' with friendly_name 'SDPA_58065':
Shape inference input shapes {[?,16,?,128],[?,?,?,?],[?,4,?,128]}
Value input shape not compatible with other inputs.

can you have a look, thank you.

@@ -42,7 +38,7 @@ UnsqueezeBroadcastReshapeSDPAFusion::UnsqueezeBroadcastReshapeSDPAFusion() {
return rank_equals(4)(output) && consumers_count(1);
};

auto input_a_m = any_input(not_reshape);
auto input_a_m = any_input();
Copy link
Contributor Author

@ceciliapeng2011 ceciliapeng2011 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sshlyapn May I know why "not_reshape" is asked here previously? Any problem here if I remove this check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally it was copied from UnsqueezeBroadcastReshapeMatmulFusion transformation, but it seems okay to me to relax this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that you allowed query input’s reshape, and seems that we need to check whether both sdpa_opt and sdpa_micro supports dynamic padded query input.
E.g.,
Fused QKV gemm => VariadicSplit (crop + optimized out) => reshape (optimized out) => sdpa query input
Not quickly sure which model contains such a pattern.
Maybe you can just create a functional test, which has above pattern, and then check the values are correct.

@yeonbok This relax is an GQA pattern optimizing by removing broadcast nodes from key and value input paths. The sdpa gpu node was in the exec graph already before this optimizing. So correctness of this special case you mentioned should have been assured already.

@@ -42,7 +38,7 @@ UnsqueezeBroadcastReshapeSDPAFusion::UnsqueezeBroadcastReshapeSDPAFusion() {
return rank_equals(4)(output) && consumers_count(1);
};

auto input_a_m = any_input(not_reshape);
auto input_a_m = any_input();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally it was copied from UnsqueezeBroadcastReshapeMatmulFusion transformation, but it seems okay to me to relax this

// Sometime input0 shape has zeros (or even dynamic dim) in several dimensions, for
// example concat [-1, 0, 0, 0] + [-1, 4, -1, 128] along axis 2, we could (and should) infer
// dim value of axis 1 and 4 in this case.
for (int64_t i = 0; i < static_cast<int64_t>(out_shapes[0].size()); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just initialize out_shapes[0] with input_shapes[1] instead of input_shapes[0], since the input_shapes[1] input shape is always "new_token" input and input_shapes[0] is "past". It seems [1] shape should always be more detailed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sshlyapn sshlyapn changed the title relax UnsqueezeBroadcastReshapeSDPAFusion with no need to ask querry … [GPU] Relax UnsqueezeBroadcastReshapeSDPAFusion Nov 22, 2024
@p-durandin p-durandin added this pull request to the merge queue Nov 22, 2024
Merged via the queue into openvinotoolkit:master with commit c801f4e Nov 22, 2024
154 checks passed
NishantPrabhuFujitsu pushed a commit to NishantPrabhuFujitsu/openvino that referenced this pull request Nov 26, 2024
### Details:
- By relaxing UnsqueezeBroadcastReshapeSDPAFusion, GQA pattern is
enabled and Broadcasting nodes overheads in paths of key and value are
removed, thus improves performance of GLM4 model significantly.
- Fix for GLM4V, which has initial state shape (-1, 0, 0, 0), and shape
infer failed.
 
### Tickets:
 - *CVS-157263*

---------

Co-authored-by: Chen Peter <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Nov 26, 2024
### Details:
- By relaxing UnsqueezeBroadcastReshapeSDPAFusion, GQA pattern is
enabled and Broadcasting nodes overheads in paths of key and value are
removed, thus improves performance of GLM4 model significantly.
- Fix for GLM4V, which has initial state shape (-1, 0, 0, 0), and shape
infer failed.
 
### Tickets:
 - *CVS-157263*

Co-authored-by: Chen Peter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants