Introduce matmul-based IP for training #2377

densamoilov · 2025-01-11T07:48:14Z

This PR introduces matmul-based IP implementation for training.

When IP is created for forward_training plain layouts are always used to be consistent with the corresponding tensors used for backward_data and backward_weights
To support bias gradients for backward_weights case the matmul primitive has been extended with an internal feature that enables matmul to reduce matrix A. The reduction is fused.
Verbose was not extended to cover the new internal feature intentionally because having it in verbose doesn't provide any information that would be helpful, information about bias for IP is already enough.
This matmul-based implementation for training is meant to be a full replacement for the existing brgemm-based one however, the latter one will stay available through the primitive descriptor iterator for some time but no further development is planned for it and it will eventually be removed.

The performance data below is collected using a custom benchmark that mimics a training scenario that is defined as follows:

brgemm-based: reorder weights (plain -> blocked) -> forward -> backward data -> backward weights with bias -> reorder weights (blocked -> plan)
matmul-based: forward -> backward data -> backward weights with bias - no reorders required because the weights are used as is (in the plain layout)

The parameters are taken from the data set used in the performance testing. The following models are covered:

BERT Large
DLRM-v2
DLRM
MaskRCNN
Transformer LT

densamoilov · 2025-01-11T07:48:41Z

make test
enable device_cpu
disable device_gpu

densamoilov · 2025-01-14T01:04:37Z

make test
enable device_cpu
disable device_gpu

densamoilov · 2025-01-16T22:41:00Z

make test
enable device_cpu
disable device_gpu
set test_set=NIGHTLY

mgouicem

General comment:

why not reuse the reduction jit generator in src/cpu/x64/jit_uni_reduction_kernel.cpp?
given the asks to have internal APIs for basic blocks (and external one) it would make sense to untie this reduction from brgemm and have standalone reduction building block that could be reused everywhere (and that we could expose with ukernel API).
Finally, for efficiency, would it make sense to put it in maybe_pre_process_data (same as upconversions) instead of post-op? I guess this would allow to use the reduction computation as prefetching? Maybe @ankalinin has some thoughts on that.

include/oneapi/dnnl/dnnl_types.h

densamoilov · 2025-01-17T20:07:56Z

@mgouicem,

why not reuse the reduction jit generator in src/cpu/x64/jit_uni_reduction_kernel.cpp?

given the asks to have internal APIs for basic blocks (and external one) it would make sense to untie this reduction from brgemm and have standalone reduction building block that could be reused everywhere (and that we could expose with ukernel API).

Originally, I enabled matmul-based IP for more formats than brgemm-based one supports and that required enabling reduction for matrix B that I enabled via reusing the existing reducer, which is used in brgemm-based IP, almost verbatim. But to support the formats that are supported by brgemm-based IP I had to extend the reducer to support reduction for matrix A. In the end I decided that for now we only need to support the formats that are required to replace the brgemm-based IP so I disabled reduction for matrix B, which can enabled later if needed. So long story short, it was easier, assuming initial goals.

Regarding the standalone reduction building block. I had thought about the idea and I think it makes sense to introduce it, however I believe that we need to approach it the other way around that is, we need to define what the building block should look like, make sure that it indeed can be used everywhere (externally and internally), implement it and only then we will enable it internally assuming that it doesn't introduce performance regressions. Doing that work as part of this task doesn't seem to be the most efficient way to achieve our goals. I also think it should be straightforward to replace a call to one API (current reducer) with another one (future building block).

Finally, for efficiency, would it make sense to put it in maybe_pre_process_data (same as upconversions) instead of post-op? I guess this would allow to use the reduction computation as prefetching? Maybe @ankalinin has some thoughts on that.

In most cases when the reduction is performed we do copy A so I don't know if doing the reduction prior to calling to brgemm will make any difference. The only case when we don't do copy A is when A is transposed and M == 1.

mgouicem · 2025-01-20T09:25:36Z

In most cases when the reduction is performed we do copy A so I don't know if doing the reduction prior to calling to brgemm will make any difference. The only case when we don't do copy A is when A is transposed and M == 1.

Exactly. I would expect the reduction to either happen in copy routine (for non AMX), or in brgemm kernel itself when the copy routine is fused (for AMX, as part of tile_load preprocessing). This is a somewhat similar approach to int8 compensation computation as well.

This way reduction happens as A is loaded in cache. When doing it as a post-op after brgemm kernel call, I guess there is no guarantee that the A blocks are still in cache.

Regarding the standalone reduction building block. I had thought about the idea and I think it makes sense to introduce it, however I believe that we need to approach it the other way around that is, we need to define what the building block should look like, make sure that it indeed can be used everywhere (externally and internally), implement it and only then we will enable it internally assuming that it doesn't introduce performance regressions.

Makes sense. Is there anything you could share from this work to help with that feature enabling. I guess the sooner we start the fewer redundant code paths we would end up with.

Doing that work as part of this task doesn't seem to be the most efficient way to achieve our goals. I also think it should be straightforward to replace a call to one API (current reducer) with another one (future building block).

Well from past experience, it seems that it does not happen. Take brgemm or jit_io for example: as it was introduced, no existing implementation was changed to retrofit it. Instead we just added a new one leaving the other implementations as legacy.

I get it that it is often faster to have a point solution, but we need to start thinking about a more scalable approach, otherwise we will quickly end up with non-uniform support of features across multiple implementations, and redundant kernel generators.

mgouicem

Would it make sense to add your benchmark as part of example (in a similar way to matmul_perf)?

densamoilov · 2025-01-21T05:42:12Z

Exactly. I would expect the reduction to either happen in copy routine (for non AMX), or in brgemm kernel itself when the copy routine is fused (for AMX, as part of tile_load preprocessing). This is a somewhat similar approach to int8 compensation computation as well.

This way reduction happens as A is loaded in cache. When doing it as a post-op after brgemm kernel call, I guess there is no guarantee that the A blocks are still in cache.

I think it depends on the size of the M chunk that usually consists of multiple blocks but in general, I think, we try to reuse the chunk data as much as possible (see here). In any way, I didn't see that the reduction was a bottleneck so I didn't see much sense in trying to optimize it further. We can explore the ideas when/if it becomes a bottleneck.

Makes sense. Is there anything you could share from this work to help with that feature enabling. I guess the sooner we start the fewer redundant code paths we would end up with.

Unfortunately, not much. I guess you have in mind an extension to brgemm that would enable brgemm to do the reduction inside. I think it would make sense to go that route because the current external reduction kernel depends on some of the brgemm parameters anyway. Also, if we go that way then we'll need to decide whether we allow a reduction only for A or B or we would be OK with doing a reduction for A and B simultaneously (same topic that we've discussion about the matmul extensions).

I get it that it is often faster to have a point solution, but we need to start thinking about a more scalable approach, otherwise we will quickly end up with non-uniform support of features across multiple implementations, and redundant kernel generators.

I'm with you on this one.

Would it make sense to add your benchmark as part of example (in a similar way to matmul_perf)?

I'm not sure it will be very useful for someone because most of the frameworks have already moved to matmul and we actually encourage users to use matmul. And the example is basically just 3 inner product primitives called sequentially (we don't need reorders now). If you think it makes sense I can add it, let me know.

mgouicem · 2025-01-21T13:17:15Z

Unfortunately, not much. I guess you have in mind an extension to brgemm that would enable brgemm to do the reduction inside.

Not necessarily, I think keeping it as separate from brgemm would actually make the API simpler and more flexible as it would be usable in non brgemm related settings.

I think it would make sense to go that route because the current external reduction kernel depends on some of the brgemm parameters anyway.

Are you referring to the reduction kernel you introduced? Does it depend on more than block size, stride and datatype?

Also, if we go that way then we'll need to decide whether we allow a reduction only for A or B or we would be OK with doing a reduction for A and B simultaneously (same topic that we've discussion about the matmul extensions).

I guess to support int8 computation and zero point, both A and B need supporting (so row-reduction and column-reduction). In that case keeping the reduction separate from brgemm would force reduction to happen as part of copy routine, but both can rely on same building block.

I'm not sure it will be very useful for someone because most of the frameworks have already moved to matmul and we actually encourage users to use matmul. And the example is basically just 3 inner product primitives called sequentially (we don't need reorders now). If you think it makes sense I can add it, let me know.

I guess we need a way to track global optimization (in that case layout selection). If we get enough feedback from frameworks on that then we should be good without adding this benchmark. If we feel that is not enough then it might make sense to add that benchmark to our examples and track its performance.

densamoilov requested review from a team as code owners January 11, 2025 07:48

github-actions bot added platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 component:api Codeowner: @oneapi-src/onednn-arch labels Jan 11, 2025

densamoilov force-pushed the dsamoylo/main/matmul-ip-fwd-and-bwd branch 5 times, most recently from 3130efe to 0c4f319 Compare January 14, 2025 01:03

cpu: x64: use plain layout for matmul-based IP for forward_training

f3fd488

densamoilov force-pushed the dsamoylo/main/matmul-ip-fwd-and-bwd branch from 0c4f319 to de99ef5 Compare January 14, 2025 21:56

cpu: x64: enable matmul-based IP for bwd_d

e9af461

densamoilov force-pushed the dsamoylo/main/matmul-ip-fwd-and-bwd branch from de99ef5 to 18a7dd3 Compare January 15, 2025 22:20

ankalinin approved these changes Jan 16, 2025

View reviewed changes

mgouicem reviewed Jan 17, 2025

View reviewed changes

include/oneapi/dnnl/dnnl_types.h Show resolved Hide resolved

mgouicem approved these changes Jan 20, 2025

View reviewed changes

cpu: x64: enable matmul-based IP for bwd_wb

6cc9d8b

densamoilov force-pushed the dsamoylo/main/matmul-ip-fwd-and-bwd branch from 18a7dd3 to 6cc9d8b Compare January 21, 2025 05:49

densamoilov merged commit 22037c4 into main Jan 21, 2025
6 of 16 checks passed

densamoilov deleted the dsamoylo/main/matmul-ip-fwd-and-bwd branch January 21, 2025 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce matmul-based IP for training #2377

Introduce matmul-based IP for training #2377

densamoilov commented Jan 11, 2025 •

edited

Loading

densamoilov commented Jan 11, 2025

densamoilov commented Jan 14, 2025

densamoilov commented Jan 16, 2025

mgouicem left a comment •

edited

Loading

densamoilov commented Jan 17, 2025 •

edited

Loading

mgouicem commented Jan 20, 2025

mgouicem left a comment

densamoilov commented Jan 21, 2025

mgouicem commented Jan 21, 2025

Introduce matmul-based IP for training #2377

Introduce matmul-based IP for training #2377

Conversation

densamoilov commented Jan 11, 2025 • edited Loading

densamoilov commented Jan 11, 2025

densamoilov commented Jan 14, 2025

densamoilov commented Jan 16, 2025

mgouicem left a comment • edited Loading

Choose a reason for hiding this comment

densamoilov commented Jan 17, 2025 • edited Loading

mgouicem commented Jan 20, 2025

mgouicem left a comment

Choose a reason for hiding this comment

densamoilov commented Jan 21, 2025

mgouicem commented Jan 21, 2025

densamoilov commented Jan 11, 2025 •

edited

Loading

mgouicem left a comment •

edited

Loading

densamoilov commented Jan 17, 2025 •

edited

Loading