-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Snippets] Move BrgemmCopyB repacking logic outside the Subgraph #27007
[Snippets] Move BrgemmCopyB repacking logic outside the Subgraph #27007
Conversation
9d0332e
to
f45c4fa
Compare
18376d1
to
fbc7368
Compare
0eae725
to
bcdb12e
Compare
@a-sidorova @IvanNovoselov could you please review the PR? Thanks |
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/move_brgemm_repacking_out.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/move_brgemm_repacking_out.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/external_repacking_adjuster.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/external_repacking_adjuster.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/external_repacking_adjuster.cpp
Outdated
Show resolved
Hide resolved
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp
Outdated
Show resolved
Hide resolved
.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp
Outdated
Show resolved
Hide resolved
1b22709
to
fb62330
Compare
src/common/snippets/include/snippets/lowered/pass/serialize_control_flow.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/move_brgemm_repacking_out.hpp
Outdated
Show resolved
Hide resolved
src/common/snippets/include/snippets/mha_parallel_wa_optimizer.hpp
Outdated
Show resolved
Hide resolved
@@ -69,6 +67,9 @@ void CPURuntimeConfigurator::update(const ov::snippets::lowered::LinearIRCPtr& l | |||
if (linear_ir->is_dynamic()) { | |||
update_loop_args(linear_ir); | |||
} | |||
update_data_offsets(); | |||
m_final_runtime_optimizers.run(*linear_ir); | |||
m_config->m_latest_shapes = std::move(m_config->shapes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think std::move
will work here? What will be the m_config->shapes
be after this command?
What if we need to access it from any other method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use an assumption that this always performs as last step of update
. Probably, we can even move this logic outside the update
method. BTW config initial update section can be extracted too:
m_config->master_shape = linear_ir->get_master_shape();
m_config->io_shapes = extract_shapes();
m_config->io_layouts = extract_layouts();
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep master_shape
io_shapes
io_layouts
update inside update()
, since it aligns with initialization()
for example.
My concern here is that update
returns config with invalid state, i.e. stolen io_shapes. So I propose to move latest_shapes
initialization (and io_shapes corruption) to the latest possible moment. In this case, it's just before the return from get_updated_config
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I moved m_config->shapes
invalidation to get_updated_config
. Also, after runtime_optimizers pipeline was implemented, it became possible to reuse RuntimeConfigurator::update
in cpu configurator, so I did that to avoid code duplication
if (m_kernel < min_kernel_m) | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we iterate up to m_dim
if we know that we'll break
here for sufficiently big divisors?
Shouldn't we iterate up to smth like m_dim/min_kernel
and remove this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, but please let me address this later: after perf validation (WIP), I will probably have to significantly change this code anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we iterate up to smth like m_dim/min_kernel and remove this condition?
BTW I think so 🤔 If m_dim % min_kernel != 0
, just find the nearest integer quotient of division
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SpltM heuristic has changed after perf validation, please take a look
64c9426
to
68b8b06
Compare
@a-sidorova @IvanNovoselov the PR is ready for the 2nd review |
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/eliminate_brgemm_copy_b.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp
Outdated
Show resolved
Hide resolved
splited.second = divisor_1; | ||
break; | ||
// TODO: should we limit minimal kernel_m? | ||
const size_t min_kernel_m = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but I believe than min_kernel_m
should be 32-64 at least. For me, 4
is too low value. However, let's wait for the perf validation 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at the updated heuristic
if (m_kernel < min_kernel_m) | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we iterate up to smth like m_dim/min_kernel and remove this condition?
BTW I think so 🤔 If m_dim % min_kernel != 0
, just find the nearest integer quotient of division
@@ -69,6 +67,9 @@ void CPURuntimeConfigurator::update(const ov::snippets::lowered::LinearIRCPtr& l | |||
if (linear_ir->is_dynamic()) { | |||
update_loop_args(linear_ir); | |||
} | |||
update_data_offsets(); | |||
m_final_runtime_optimizers.run(*linear_ir); | |||
m_config->m_latest_shapes = std::move(m_config->shapes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep master_shape
io_shapes
io_layouts
update inside update()
, since it aligns with initialization()
for example.
My concern here is that update
returns config with invalid state, i.e. stolen io_shapes. So I propose to move latest_shapes
initialization (and io_shapes corruption) to the latest possible moment. In this case, it's just before the return from get_updated_config
.
f29c19d
to
6df0b31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job 👍🏼
// Ideal case #2: M is divisible by optimal parallel work amount, and the new_m_dim is big enough | ||
// In this case, each thread will execute the Snippets kernel 'batch_dim' times | ||
if (m_dim % optimal_parallelism_work_amount == 0) { | ||
const auto new_m_dim = m_dim / optimal_parallelism_work_amount; | ||
const size_t min_kernel_m = 64; | ||
if (new_m_dim >= min_kernel_m) { | ||
splited.first = optimal_parallelism_work_amount; | ||
splited.second = new_m_dim; | ||
OPENVINO_ASSERT(splited.first * splited.second == m_dim, "Incorrect dimension M splitting!"); | ||
return splited; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to return to this algorithm soon.
Because I'm not sure still that the current implementation covers our needs.
Just imagine that there is shape [1,5,16384,64]
(SD) and optimal_parallelism_work_amount = 18
(our workstation).
Ideal case #1 will be skipped because 18 / 5 = 3.6
- not integer value.
Ideal case #2 will be skipped too because 16384 / 18 = 910.(2)
- not integer value.
We will go to the next step and will get not optimal scheduling - I'd expect that in this case new_m_dim
will be 32
or 64
(small) and new_batch_dim
= 512
or 256
- yes, not all threads will process the same count of kernels. But there will a lot of kernels that we shouldn't notice this non-equality. This is my opinion and I'm not sure in 100% too. But I just change the thread count for SD - and Ideal Case #2 is broken now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I reflected this point in the 157339 ticket
…nvinotoolkit#27007) ### Details: Currently, CopyB repacking is always performed inside Subgraph. In the case when batch on B Matmul input is significantly smaller than batch on A Matmul input, and parallel work amount is big enough, this may lead to ineffective execution, since repacking for B input is performed in each parallel task whereas only one repacking iteration for each B batch is enough. Within this PR, CopyB repacking is moved outside the snippets kernel and performed via common reorder primitive just before the snippets kernel execution. ### Tickets: - *CVS-154383*
…nvinotoolkit#27007) Currently, CopyB repacking is always performed inside Subgraph. In the case when batch on B Matmul input is significantly smaller than batch on A Matmul input, and parallel work amount is big enough, this may lead to ineffective execution, since repacking for B input is performed in each parallel task whereas only one repacking iteration for each B batch is enough. Within this PR, CopyB repacking is moved outside the snippets kernel and performed via common reorder primitive just before the snippets kernel execution. - *CVS-154383*
Details:
Currently, CopyB repacking is always performed inside Subgraph. In the case when batch on B Matmul input is significantly smaller than batch on A Matmul input, and parallel work amount is big enough, this may lead to ineffective execution, since repacking for B input is performed in each parallel task whereas only one repacking iteration for each B batch is enough.
Within this PR, CopyB repacking is moved outside the snippets kernel and performed via common reorder primitive just before the snippets kernel execution.
Tickets: