-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LAPACK][CUSOLVER] Add potrf and getrs batch functions to cuSolver #209
Conversation
Remaining batch functions implemented using same method as potrf and getrs functions. Getrf group batched operation commented out and set to unimplemented due to test segfaulting. Error appears to not be related to oneMKL but lapack. Further investigation will be done on it. Test results: Some non-batched tests are failing locally, this occurs with and without this patch. |
Thanks for the PR @AidanBeltonS Can you provide more details on the non batch failures? Setting the environment variable CTEST_OUTPUT_ON_FAILURE=1 with ctest can provide more details. Unusual errors are sometimes due to linking against Netlib libraries compiled with 32 bit integers, can you verify your reference libraries were compiled for 64bit integers? I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look. |
I have attached the results below with ctest output on failures. I have double checked I have built the netlib libraries for 64bits and am linking with the correct ones.
Thanks for the log. I will look into the failing tests. |
@ericlars I found the issue with
I have not been able to reproduce the errors you have found however I have looked further into the issue of non-batched operations failing. The problem appears to be with the addition of multiple CUDA streams per SYCL queue. This likely messes with the cusolver scope handler which assumes there is one stream per device per thread. It may be that the batch test failures you are seeing are also caused by the multistream changes. Would you be able to re-run the tests on a commit before the addition of multiple streams per queue? The DPCPP commit just before multi-streams is: d149ec39e7791a2d70858a7cf10261d6353b01be I have attached the test logs below. Commit: dd418459868a976cd2eeae367fea6b92795ea611 |
@AidanBeltonS My logs were from a build of llvm from 3/19, and it looks like the multistream patch was committed on 5/17, but for sanity I'll try building from your suggested commits. Thanks for the sleuthing. |
GEQRF_GROUP_LAUNCHER_SCRATCH(double, cusolverDnDgetrf_bufferSize) | ||
GEQRF_GROUP_LAUNCHER_SCRATCH(std::complex<float>, cusolverDnCgetrf_bufferSize) | ||
GEQRF_GROUP_LAUNCHER_SCRATCH(std::complex<double>, cusolverDnZgetrf_bufferSize) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getrf -> geqrf
Found the source for geqrf failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for catching the mistake! I have fixed the scratch function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is GetrsBatchStride
still failing? I found a mistake where there was a call to free in the queue submit and not the host_task, this was fixed in previous commit.
New failures confirmed as resulting from multi-streams: although it is really a bug in oneMKL because interop streams are not synced. Proposed fix here: #215 |
I have updated this PR to use the new functionality introduced in #215 |
Looks good to me! |
// Creates list of matrix/vector pointers from initial ptr and stride | ||
// Note: user is responsible for deallocating memory | ||
template <typename T> | ||
T **create_ptr_list_from_stride(T *ptr, int64_t ptr_stride, int64_t batch_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is introduced, but never used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Alexander-Kleymenov I believe this func is used in cusolver_batch.cpp in the same directory for translating the strided APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I see
/* batched helpers */ | ||
|
||
// Creates list of matrix/vector pointers from initial ptr and stride | ||
// Note: user is responsible for deallocating memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like none of the usages follow this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I have updated this to free all the instances where this allocates memory
cusolverStatus_t err; | ||
|
||
// Uses scratch so sync between each cuSolver call | ||
for (int64_t i = 0; i < batch_size; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how strict it is, but int64_t here is used without std namespace, a bit different vs everywhere else, and 32 bit int type is used not as int32_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the std
namespace.
and 32 bit int type is used not as int32_t.
Could you clarify what you are referring to, which 32 bit int type is not used as a int32_t and in what way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment regarding legacy API makes use of 32-bit int a special case vs use of int64_t, so this special case could be explicitly highlighted with explicit fixed width type int32_t, not implementation dependent int type. In the current building environment int equals to int32_t. Use of int32_t would be just for code readability (highlight special case) and consistency (use only fixed-width integer types). This is just for your consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be better to have this as explicitly stated as int32_t
rather than int
. I think it should be a separate PR as the non-batched implementation also uses int
in this way so it would be out of scope of this PR. I am happy to make a follow up PR to change this for both implementations.
sycl::event done_casting = queue.submit([&](sycl::handler &cgh) { | ||
cgh.depends_on(done); | ||
cgh.parallel_for(sycl::range<1>{ ipiv_size }, [=](sycl::id<1> index) { | ||
ipiv[index] = static_cast<std::int64_t>(ipiv32[index]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original ipiv could be passed as *int32_t, and then resulting 32-bit values could be expanded to 64-bit (in-place).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain further how the 32-bit values can be expanded to 64-bit in place? Are you suggesting use the 64-bit memory as if it were 32-bit then sequentially shift each value to the correct place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the idea. It would reduce amount of code (especially in array-of-pointers case), and runtime overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this would be an interested and likely more efficient approach.
I think however it should be separated to this PR as the non-batched operations use a kernel for casting between types, same as here, so this change would be out of scope of this batch PR. Additionally, the change would be done for performance reasons so it would be good to measure if there is any performance benefit to the expanding method before implementing.
My thought is that it may make sense to create an issue to track this change so a future PR can be made once performance measurements have confirmed this approach is better and the PR can address both batched and non-batched implementations. Would this approach work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Batch case is different from regular by the size of problems: regular API is intended for larger cases, and this pre/post call overheads amortized by long kernel computations time. Batch case works on smaller problems, and run-time is also pretty short (due to high parallelism allowing to handle lots of small problems simultaneously). So any additional overhead is significant. Do you have any measurements of corresponding CUDA API on NV HW to compare it with the performance done through these SYCL interfaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this approach work for you?
Would work fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any measurements of corresponding CUDA API on NV HW to compare it with the performance done through these SYCL interfaces?
I currently do not have any measurements to compare the performance of this backend to native cuSolver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue to track future change: #230
sycl::event e = queue.submit([&](sycl::handler &cgh) { | ||
cgh.depends_on(done); | ||
cgh.parallel_for(sycl::range<1>{ ipiv_size }, [=](sycl::id<1> index) { | ||
d_ipiv[index] = static_cast<std::int64_t>(d_ipiv32[index]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why using static_cast here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the static_cast
for (int64_t i = 0; i < num_events; i++) { | ||
cgh.depends_on(casting_dependencies[i]); | ||
} | ||
cgh.host_task([=](sycl::interop_handle ih) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would not be required if ipiv was passed as **int32_t, and converted in-pace later,
|
||
overflow_check(n, nrhs, lda, ldb, stride_a, stride_b, batch_size, scratchpad_size); | ||
|
||
// cusolver function only supports nrhs = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuSolver is referred with different letter casing across the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Enqueue free memory, don't return event as not-neccessary for user to wait for ipiv32 being released | ||
queue.submit([&](sycl::handler &cgh) { | ||
cgh.depends_on(done_casting); | ||
cgh.host_task([=](sycl::interop_handle ih) { sycl::free(ipiv32, queue); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
host_task lambda here can capture queue by reference, avoiding object copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// cusolver function only supports nrhs = 1 | ||
if (nrhs != 1) | ||
throw unimplemented("lapack", "potrs_batch", "cusolver potrs_batch only supports nrhs = 1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if at some point the support will be expanded, is there any way to bypass this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no way to check if cuSolver supports this nrhs > 1 case. If support is expanded this check will have to be modified or deleted depending on the circumstances at the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be some single build-time constant, or preprocessor macro, switching it in single place. Like static constexpr bool multiple_nrhs_support = false;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have wrapped the check with a macros preprocessor, so if the user defines POTRS_BATCHED_MULTIPLE_NRHS_SUPPORTED
then the check will be removed.
scratch_size); | ||
}); | ||
}) | ||
.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something happened to code formatting in this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
// Enqueue free memory, don't return event as not-neccessary for user to wait for ipiv32 being released | ||
queue.submit([&](sycl::handler &cgh) { | ||
cgh.depends_on(done_casting); | ||
cgh.host_task([&](sycl::interop_handle ih) { sycl::free(ipiv32, queue); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will lead to sporadic crashes: passing queue by reference makes sense as queue object is not going to be destroyed before host_tasl lambda is finished, but ipiv32 variable, which is automatic stack-allocated, can be destroyed by the time of host_task execution, and contain some other data, so free it will crash. [=,&queue]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted this change
overflow_check(n, nrhs, lda, ldb, stride_a, stride_b, batch_size, scratchpad_size); | ||
|
||
// cuSolver function only supports nrhs = 1 | ||
#ifndef POTRS_BATCHED_MULTIPLE_NRHS_SUPPORTED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is C++ way of doing this, not involving preprocessor's macro (if constexpr (not cusolver_batched_potrs_supports_multiple_nrhs) { if (nrhs != 1) { throw } }
).
Here it is also not clear from the code if it is supported, or not, without explicit definition.
Comment on line 346 got detached. Better to remove it as the code is quite self-documenting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue with this approach is where is cusolver_batched_potrs_supports_multiple_nrhs
going to reside? If this is to be set in a single place it would have to be a global variable, which is undesirable, as each of these are free functions.
Personally I think having this check done this way makes sense as we are using the legacy API. It seems very unlikely that NVIDIA will come out with an update eliminating this constraint for an API they are probably going to deprecate at some point. So any support to allow nrhs > 1
would require a change to the code and therefore the check can be simply deleted at that time. However, I still do not have any issues with making a way to by pass it. Just noting that it is unlikely to be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue with this approach is where is cusolver_batched_potrs_supports_multiple_nrhs going to reside? If this is to be set in a single place it would have to be a global variable, which is undesirable, as each of these are free functions.
This is not a variable, but a constant, leaving no trace in the binary. Could be declared at the beginning of the source.
It seems very unlikely that NVIDIA will come out with an update eliminating this constraint for an API they are probably going to deprecate at some point.
This is a good point. So we can omit on this expectation.
Documentation explicitly states The routine will be removed in the next major release.
Why we are using it here if it won't work in a year?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the macro.
The reason to use the legacy API is that the new API is not yet fleshed out. There would be a lot of missing functions, I would expect that when the legacy API is deprecated they will also bring in more supported operations for the new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to use the legacy API is that the new API is not yet fleshed out.
I skimmed over the spec and all API, having deprecation notices, also have a proposed new API to use. What is missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take a look at the number of supported lapack operations, the 64-bit API does not support nearly as many operations. For example the 64-bit dense eigenvalue solver section only really contains two lapack operations. You can then see that the legacy API has many more.
Legacy API eigenvalue solvers: https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-eigensolver-reference
64-bit API eigenvalue solvers: https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-eigensolver-reference-64bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this update only concerned with potrs and getrs which are being deprecated and have new API proposed?
|
||
// cuSolver function only supports nrhs = 1 | ||
if (nrhs != 1) | ||
throw unimplemented("lapack", "potrs_batch", "cusolver potrs_batch only supports nrhs = 1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I don't see any deprecation notices for cusolverDnDpotrsBatched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The legacy API functions are not "deprecated" functions. They just follow an old interface, have fewer features, and will be made deprecated in the future.
auto ipiv32_acc = ipiv32.template get_access<sycl::access::mode::write>(cgh); | ||
auto ipiv_acc = ipiv.template get_access<sycl::access::mode::read>(cgh); | ||
cgh.parallel_for(sycl::range<1>{ ipiv_size }, [=](sycl::id<1> index) { | ||
ipiv32_acc[index] = static_cast<std::int32_t>(ipiv_acc[index]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think if it makes sense to have overflow checks for integer arrays same as on line 114?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave this without checks. While it would be nice to provide the user more error information, if the user provided valid dimensions but then requests out of bound ipiv
values then that is more of an issue on their end.
There is also downside of having to check every ipiv
value across the batches then copy the results back to the host to decide to throw an exception or not. So this means more memory being copied from device to host and the host cannot just queue up as much asynchronous work as possible and move on.
// Create new buffer with 32-bit ints then copy over results | ||
std::uint64_t ipiv_size = stride_ipiv * batch_size; | ||
sycl::buffer<int, 1> ipiv32(sycl::range<1>{ ipiv_size }); | ||
sycl::buffer<int> devInfo{ batch_size }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sycl::buffer<int>
is equivalent to sycl::buffer<int, 1>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this, I have removed the dim value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. Looks good!
@ericlars it looks like we have approvals for this, could this be merged now, or is there any further discussions to which need to be resolved? |
Description
This PR extends
potrf_batch
to additional overloads. Additionally it implements allgetrs_batch
overloads. This change fully implements bothpotrf_batch
andgetrs_batch
for USM and Buffer for both group and strided batch operations.In neither case does cuSolver have a direct equivalent to the oneMKL function.
So in both cases some manipulation has to occur to make things work.
potrf_batch
implements the strided batch operation with the cuSolver group batch operation. This should be quite efficient, as it is just reformatting the input.getrs_batch
is implemented using non-batched functions so this just loops over the cuSolver function for each matrix in the batch. The performance of this would not be significantly different than the user just looping over the oneMKL function themselves. This I believe this is okay, as it would provide additional convenience and portability between backends to the user.The motivation for this change is to add greater functionality to the cuSolver backend and to improve portability between cuSolver and other lapack backends. Additionally this would provide a framework for further implementations of missing batch functions which do not align well with cuSolver functions.
Edit: Follow up commit implements remaining batched functions, with the exception of getrf group batched due to an issue with test.
Test logs:
potrf_and_getrs_results.txt