Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the rotm kernel with RVV intrinsic. #5038

Closed
wants to merge 3 commits into from

Conversation

tingboliao
Copy link

Based on the scalar implementation of rotm, we optimized it by using RVV 1.0 Intrinsic.
Subsequently, we developed related cases for the functional and performance verifications on K230 and K1.

The performance data are shown as below:

Parameter setting: OPENBLAS_LOOPS = 10000.

  1. K230 [C908, vlen = 128]@1.6GHz:
    | Cases | Scalar / MFlops | Optimized RVV / MFlops |
    | srotm.goto | 875.57 | 1536.78 |
    | drotm.goto | 799.77 | 1408.70 |

  2. K1 [C908, vlen = 256]@1.6GHz:
    | Cases | Scalar / MFlops | Optimized RVV / MFlops |
    | srotm.goto | 880.02 | 1490.44 |
    | drotm.goto | 811.13 | 1541.92 |

In the above data, the bigger value is, the better performance is.

@martin-frbg
Copy link
Collaborator

Thanks - the numbers are very compelling, but I'm not entirely sure having that much architecture-specific code at the interface level is a good idea. At least I don't think we've done this before, and if every architecture ifdef'd their specific intrinsics implementation into it, the file would get unwieldy rather quickly. (Need some time to think about alternatives though - not sure if it's easy to add a kernel mapping for just riscv64 either...)

@tingboliao
Copy link
Author

Thanks, we will further consider new alternatives, and submit a new Pull Request (PR) later if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants