-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce matmul-based IP for training #2377
Conversation
make test |
3130efe
to
0c4f319
Compare
make test |
0c4f319
to
de99ef5
Compare
de99ef5
to
18a7dd3
Compare
make test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment:
- why not reuse the reduction jit generator in src/cpu/x64/jit_uni_reduction_kernel.cpp?
- given the asks to have internal APIs for basic blocks (and external one) it would make sense to untie this reduction from brgemm and have standalone reduction building block that could be reused everywhere (and that we could expose with ukernel API).
- Finally, for efficiency, would it make sense to put it in maybe_pre_process_data (same as upconversions) instead of post-op? I guess this would allow to use the reduction computation as prefetching? Maybe @ankalinin has some thoughts on that.
Originally, I enabled matmul-based IP for more formats than brgemm-based one supports and that required enabling reduction for matrix B that I enabled via reusing the existing reducer, which is used in brgemm-based IP, almost verbatim. But to support the formats that are supported by brgemm-based IP I had to extend the reducer to support reduction for matrix A. In the end I decided that for now we only need to support the formats that are required to replace the brgemm-based IP so I disabled reduction for matrix B, which can enabled later if needed. So long story short, it was easier, assuming initial goals. Regarding the standalone reduction building block. I had thought about the idea and I think it makes sense to introduce it, however I believe that we need to approach it the other way around that is, we need to define what the building block should look like, make sure that it indeed can be used everywhere (externally and internally), implement it and only then we will enable it internally assuming that it doesn't introduce performance regressions. Doing that work as part of this task doesn't seem to be the most efficient way to achieve our goals. I also think it should be straightforward to replace a call to one API (current reducer) with another one (future building block).
In most cases when the reduction is performed we do copy A so I don't know if doing the reduction prior to calling to brgemm will make any difference. The only case when we don't do copy A is when A is transposed and M == 1. |
Exactly. I would expect the reduction to either happen in copy routine (for non AMX), or in brgemm kernel itself when the copy routine is fused (for AMX, as part of tile_load preprocessing). This is a somewhat similar approach to int8 compensation computation as well. This way reduction happens as A is loaded in cache. When doing it as a post-op after brgemm kernel call, I guess there is no guarantee that the A blocks are still in cache.
Makes sense. Is there anything you could share from this work to help with that feature enabling. I guess the sooner we start the fewer redundant code paths we would end up with.
Well from past experience, it seems that it does not happen. Take brgemm or jit_io for example: as it was introduced, no existing implementation was changed to retrofit it. Instead we just added a new one leaving the other implementations as legacy. I get it that it is often faster to have a point solution, but we need to start thinking about a more scalable approach, otherwise we will quickly end up with non-uniform support of features across multiple implementations, and redundant kernel generators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add your benchmark as part of example (in a similar way to matmul_perf)?
I think it depends on the size of the M chunk that usually consists of multiple blocks but in general, I think, we try to reuse the chunk data as much as possible (see here). In any way, I didn't see that the reduction was a bottleneck so I didn't see much sense in trying to optimize it further. We can explore the ideas when/if it becomes a bottleneck.
Unfortunately, not much. I guess you have in mind an extension to brgemm that would enable brgemm to do the reduction inside. I think it would make sense to go that route because the current external reduction kernel depends on some of the brgemm parameters anyway. Also, if we go that way then we'll need to decide whether we allow a reduction only for A or B or we would be OK with doing a reduction for A and B simultaneously (same topic that we've discussion about the matmul extensions).
I'm with you on this one.
I'm not sure it will be very useful for someone because most of the frameworks have already moved to matmul and we actually encourage users to use matmul. And the example is basically just 3 inner product primitives called sequentially (we don't need reorders now). If you think it makes sense I can add it, let me know. |
18a7dd3
to
6cc9d8b
Compare
Not necessarily, I think keeping it as separate from brgemm would actually make the API simpler and more flexible as it would be usable in non brgemm related settings.
Are you referring to the reduction kernel you introduced? Does it depend on more than block size, stride and datatype?
I guess to support int8 computation and zero point, both A and B need supporting (so row-reduction and column-reduction). In that case keeping the reduction separate from brgemm would force reduction to happen as part of copy routine, but both can rely on same building block.
I guess we need a way to track global optimization (in that case layout selection). If we get enough feedback from frameworks on that then we should be good without adding this benchmark. If we feel that is not enough then it might make sense to add that benchmark to our examples and track its performance. |
This PR introduces matmul-based IP implementation for training.
forward_training
plain layouts are always used to be consistent with the corresponding tensors used forbackward_data
andbackward_weights
backward_weights
case the matmul primitive has been extended with an internal feature that enables matmul to reduce matrix A. The reduction is fused.The performance data below is collected using a custom benchmark that mimics a training scenario that is defined as follows:
reorder weights (plain -> blocked) -> forward -> backward data -> backward weights with bias -> reorder weights (blocked -> plan)
forward -> backward data -> backward weights with bias
- no reorders required because the weights are used as is (in the plain layout)The parameters are taken from the data set used in the performance testing. The following models are covered: