Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce matmul-based IP for training #2377

Merged
merged 3 commits into from
Jan 21, 2025

Conversation

densamoilov
Copy link
Contributor

@densamoilov densamoilov commented Jan 11, 2025

This PR introduces matmul-based IP implementation for training.

  • When IP is created for forward_training plain layouts are always used to be consistent with the corresponding tensors used for backward_data and backward_weights
  • To support bias gradients for backward_weights case the matmul primitive has been extended with an internal feature that enables matmul to reduce matrix A. The reduction is fused.
  • Verbose was not extended to cover the new internal feature intentionally because having it in verbose doesn't provide any information that would be helpful, information about bias for IP is already enough.
  • This matmul-based implementation for training is meant to be a full replacement for the existing brgemm-based one however, the latter one will stay available through the primitive descriptor iterator for some time but no further development is planned for it and it will eventually be removed.

The performance data below is collected using a custom benchmark that mimics a training scenario that is defined as follows:

  • brgemm-based: reorder weights (plain -> blocked) -> forward -> backward data -> backward weights with bias -> reorder weights (blocked -> plan)
  • matmul-based: forward -> backward data -> backward weights with bias - no reorders required because the weights are used as is (in the plain layout)

The parameters are taken from the data set used in the performance testing. The following models are covered:

  • BERT Large
  • DLRM-v2
  • DLRM
  • MaskRCNN
  • Transformer LT

image

@densamoilov densamoilov requested review from a team as code owners January 11, 2025 07:48
@github-actions github-actions bot added platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 component:api Codeowner: @oneapi-src/onednn-arch labels Jan 11, 2025
@densamoilov
Copy link
Contributor Author

make test
enable device_cpu
disable device_gpu

@densamoilov densamoilov force-pushed the dsamoylo/main/matmul-ip-fwd-and-bwd branch 5 times, most recently from 3130efe to 0c4f319 Compare January 14, 2025 01:03
@densamoilov
Copy link
Contributor Author

make test
enable device_cpu
disable device_gpu

@densamoilov densamoilov force-pushed the dsamoylo/main/matmul-ip-fwd-and-bwd branch from 0c4f319 to de99ef5 Compare January 14, 2025 21:56
@densamoilov densamoilov force-pushed the dsamoylo/main/matmul-ip-fwd-and-bwd branch from de99ef5 to 18a7dd3 Compare January 15, 2025 22:20
@densamoilov
Copy link
Contributor Author

make test
enable device_cpu
disable device_gpu
set test_set=NIGHTLY

Copy link
Contributor

@mgouicem mgouicem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment:

  • why not reuse the reduction jit generator in src/cpu/x64/jit_uni_reduction_kernel.cpp?
  • given the asks to have internal APIs for basic blocks (and external one) it would make sense to untie this reduction from brgemm and have standalone reduction building block that could be reused everywhere (and that we could expose with ukernel API).
  • Finally, for efficiency, would it make sense to put it in maybe_pre_process_data (same as upconversions) instead of post-op? I guess this would allow to use the reduction computation as prefetching? Maybe @ankalinin has some thoughts on that.

include/oneapi/dnnl/dnnl_types.h Show resolved Hide resolved
@densamoilov
Copy link
Contributor Author

densamoilov commented Jan 17, 2025

@mgouicem,

  • why not reuse the reduction jit generator in src/cpu/x64/jit_uni_reduction_kernel.cpp?
  • given the asks to have internal APIs for basic blocks (and external one) it would make sense to untie this reduction from brgemm and have standalone reduction building block that could be reused everywhere (and that we could expose with ukernel API).

Originally, I enabled matmul-based IP for more formats than brgemm-based one supports and that required enabling reduction for matrix B that I enabled via reusing the existing reducer, which is used in brgemm-based IP, almost verbatim. But to support the formats that are supported by brgemm-based IP I had to extend the reducer to support reduction for matrix A. In the end I decided that for now we only need to support the formats that are required to replace the brgemm-based IP so I disabled reduction for matrix B, which can enabled later if needed. So long story short, it was easier, assuming initial goals.

Regarding the standalone reduction building block. I had thought about the idea and I think it makes sense to introduce it, however I believe that we need to approach it the other way around that is, we need to define what the building block should look like, make sure that it indeed can be used everywhere (externally and internally), implement it and only then we will enable it internally assuming that it doesn't introduce performance regressions. Doing that work as part of this task doesn't seem to be the most efficient way to achieve our goals. I also think it should be straightforward to replace a call to one API (current reducer) with another one (future building block).

  • Finally, for efficiency, would it make sense to put it in maybe_pre_process_data (same as upconversions) instead of post-op? I guess this would allow to use the reduction computation as prefetching? Maybe @ankalinin has some thoughts on that.

In most cases when the reduction is performed we do copy A so I don't know if doing the reduction prior to calling to brgemm will make any difference. The only case when we don't do copy A is when A is transposed and M == 1.

@mgouicem
Copy link
Contributor

In most cases when the reduction is performed we do copy A so I don't know if doing the reduction prior to calling to brgemm will make any difference. The only case when we don't do copy A is when A is transposed and M == 1.

Exactly. I would expect the reduction to either happen in copy routine (for non AMX), or in brgemm kernel itself when the copy routine is fused (for AMX, as part of tile_load preprocessing). This is a somewhat similar approach to int8 compensation computation as well.

This way reduction happens as A is loaded in cache. When doing it as a post-op after brgemm kernel call, I guess there is no guarantee that the A blocks are still in cache.

Regarding the standalone reduction building block. I had thought about the idea and I think it makes sense to introduce it, however I believe that we need to approach it the other way around that is, we need to define what the building block should look like, make sure that it indeed can be used everywhere (externally and internally), implement it and only then we will enable it internally assuming that it doesn't introduce performance regressions.

Makes sense. Is there anything you could share from this work to help with that feature enabling. I guess the sooner we start the fewer redundant code paths we would end up with.

Doing that work as part of this task doesn't seem to be the most efficient way to achieve our goals. I also think it should be straightforward to replace a call to one API (current reducer) with another one (future building block).

Well from past experience, it seems that it does not happen. Take brgemm or jit_io for example: as it was introduced, no existing implementation was changed to retrofit it. Instead we just added a new one leaving the other implementations as legacy.

I get it that it is often faster to have a point solution, but we need to start thinking about a more scalable approach, otherwise we will quickly end up with non-uniform support of features across multiple implementations, and redundant kernel generators.

Copy link
Contributor

@mgouicem mgouicem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add your benchmark as part of example (in a similar way to matmul_perf)?

@densamoilov
Copy link
Contributor Author

Exactly. I would expect the reduction to either happen in copy routine (for non AMX), or in brgemm kernel itself when the copy routine is fused (for AMX, as part of tile_load preprocessing). This is a somewhat similar approach to int8 compensation computation as well.

This way reduction happens as A is loaded in cache. When doing it as a post-op after brgemm kernel call, I guess there is no guarantee that the A blocks are still in cache.

I think it depends on the size of the M chunk that usually consists of multiple blocks but in general, I think, we try to reuse the chunk data as much as possible (see here). In any way, I didn't see that the reduction was a bottleneck so I didn't see much sense in trying to optimize it further. We can explore the ideas when/if it becomes a bottleneck.

Makes sense. Is there anything you could share from this work to help with that feature enabling. I guess the sooner we start the fewer redundant code paths we would end up with.

Unfortunately, not much. I guess you have in mind an extension to brgemm that would enable brgemm to do the reduction inside. I think it would make sense to go that route because the current external reduction kernel depends on some of the brgemm parameters anyway. Also, if we go that way then we'll need to decide whether we allow a reduction only for A or B or we would be OK with doing a reduction for A and B simultaneously (same topic that we've discussion about the matmul extensions).

I get it that it is often faster to have a point solution, but we need to start thinking about a more scalable approach, otherwise we will quickly end up with non-uniform support of features across multiple implementations, and redundant kernel generators.

I'm with you on this one.

Would it make sense to add your benchmark as part of example (in a similar way to matmul_perf)?

I'm not sure it will be very useful for someone because most of the frameworks have already moved to matmul and we actually encourage users to use matmul. And the example is basically just 3 inner product primitives called sequentially (we don't need reorders now). If you think it makes sense I can add it, let me know.

@densamoilov densamoilov force-pushed the dsamoylo/main/matmul-ip-fwd-and-bwd branch from 18a7dd3 to 6cc9d8b Compare January 21, 2025 05:49
@densamoilov densamoilov merged commit 22037c4 into main Jan 21, 2025
6 of 16 checks passed
@densamoilov densamoilov deleted the dsamoylo/main/matmul-ip-fwd-and-bwd branch January 21, 2025 05:57
@mgouicem
Copy link
Contributor

Unfortunately, not much. I guess you have in mind an extension to brgemm that would enable brgemm to do the reduction inside.

Not necessarily, I think keeping it as separate from brgemm would actually make the API simpler and more flexible as it would be usable in non brgemm related settings.

I think it would make sense to go that route because the current external reduction kernel depends on some of the brgemm parameters anyway.

Are you referring to the reduction kernel you introduced? Does it depend on more than block size, stride and datatype?

Also, if we go that way then we'll need to decide whether we allow a reduction only for A or B or we would be OK with doing a reduction for A and B simultaneously (same topic that we've discussion about the matmul extensions).

I guess to support int8 computation and zero point, both A and B need supporting (so row-reduction and column-reduction). In that case keeping the reduction separate from brgemm would force reduction to happen as part of copy routine, but both can rely on same building block.

I'm not sure it will be very useful for someone because most of the frameworks have already moved to matmul and we actually encourage users to use matmul. And the example is basically just 3 inner product primitives called sequentially (we don't need reorders now). If you think it makes sense I can add it, let me know.

I guess we need a way to track global optimization (in that case layout selection). If we get enough feedback from frameworks on that then we should be good without adding this benchmark. If we feel that is not enough then it might make sense to add that benchmark to our examples and track its performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:api Codeowner: @oneapi-src/onednn-arch platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants