-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPLASMA Warmup -- 2nd try #69
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Done: all PTG tests that are CUDA-enabled To do:
|
As discussed on 03/31/23 we need to
|
This was referenced Mar 31, 2023
|
… in testers, to provide a simple way to get consistent performance results. Implementation of warmup_zpotrf in testing_zpotrf.c testing_zpotrf: use zplghe and not zplrnt to initialize symetric positive definite matrices; call the warmup function as it has been validated experimentally Port warmup to testing_zpoinv Port warmup to QR (PTG). Looks like CUDA-QR is having some issues. Support warmup for GEMM -- Only assign a preferred device in zgemm_NN_gpu.jdf if the upper level has not assigned one, allowing the user to control finely where tasks will execute if they want (and the warmup process definitely wants to control that) Update to current parsec, and enable warmup in testing_zgebrd_ge2g Port warmup to testing_zgelqf_hqr Port testing_zgelqf_systolic Fix some bugs in testing_zpotrf.c's warmup Add warmup for zgetrf* zpoinv, and zpotrf_dtd* Use the same zgeqrf warmup for dtd tests Use the same warmup for testing_zgemm and testing_zgemm_dtd Porting warmup on zgelqf Add loop and warmup to testing zheev Add warmup and performance measurement loop to GEQRF HQR and Systolic Inplement new warmup strategy when no-known GPU implementation exists - if there is a known GPU implementation, just assume we need to warmup once per device - if there is no known GPU implementation, iterate over the task classes, and check if a GPU implementation exists. If it is the case, run a warmup for each device of that type. That codepath will be skipped until someone implements a GPU version for all operations... Worst case, it will not properly work, and not break the test. Best case, we will not forget to do warmup for GPU cases. Add the warmup/loop to ZGESVD Fix GEMM warmups: GEMM uses reshaping to support ScaLAPACK + TILED data representations, and the data collection wrapper does not work well with the hack of changing the rank_of function in the source data collections. Simply do a 1D distribution of A and C over all the ranks to ensure that all processes initialize GEMM in the warmup.
This is done in #89 |
Signed-off-by: Aurelien Bouteiller <[email protected]>
Signed-off-by: Aurelien Bouteiller <[email protected]>
abouteiller
approved these changes
Jun 22, 2023
abouteiller
approved these changes
Jun 22, 2023
Signed-off-by: Aurelien Bouteiller <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a second try for solving the warmup issue in DPLASMA (especially in CUDA codes).
Here are some performance measurements of the approach proposed in this PR, on Leconte (8x V100):
'gflops/avg' represents the ratio 'gflops of this run' divided by the 'appropriate average' (meaning the average without the outlier on runs without warmup, and the average of all measured points on runs with warmup).
There is still some warmup problem that is unidentified, at large tile size (512, 1024), for 1 to 4 GPUs, the first actual run is still slower than the others for the small problem sizes. It's unclear what is the source of the issue at this point, but the warmup patch fixes most of the CUDA/CUBLAS warmup issues.
The goal of the current code is to include changes for all tests that feature a CUDA implementation and timing:
TRSM is the last kernel that features a CUDA implementation, and it does not have timing in its testings.
Aurelien has notified during the discusssion of an issue with HIP: allocation of memory on the HIP device was lazy at some point, and allocation at first touch is a significant part of the warmup overheads of the HIP runs. We decided that this should be solved at the PaRSEC level, during memory allocation, and not at the DPLASMA warmup level.