[LAPACK][CUSOLVER] Add potrf and getrs batch functions to cuSolver #209

AidanBeltonS · 2022-06-17T11:23:12Z

Description

This PR extends potrf_batch to additional overloads. Additionally it implements all getrs_batch overloads. This change fully implements both potrf_batch and getrs_batch for USM and Buffer for both group and strided batch operations.
In neither case does cuSolver have a direct equivalent to the oneMKL function.

So in both cases some manipulation has to occur to make things work. potrf_batch implements the strided batch operation with the cuSolver group batch operation. This should be quite efficient, as it is just reformatting the input. getrs_batch is implemented using non-batched functions so this just loops over the cuSolver function for each matrix in the batch. The performance of this would not be significantly different than the user just looping over the oneMKL function themselves. This I believe this is okay, as it would provide additional convenience and portability between backends to the user.

The motivation for this change is to add greater functionality to the cuSolver backend and to improve portability between cuSolver and other lapack backends. Additionally this would provide a framework for further implementations of missing batch functions which do not align well with cuSolver functions.

Edit: Follow up commit implements remaining batched functions, with the exception of getrf group batched due to an issue with test.

Test logs:
potrf_and_getrs_results.txt

AidanBeltonS · 2022-06-23T15:50:36Z

Remaining batch functions implemented using same method as potrf and getrs functions.

Getrf group batched operation commented out and set to unimplemented due to test segfaulting. Error appears to not be related to oneMKL but lapack. Further investigation will be done on it.

Test results: Some non-batched tests are failing locally, this occurs with and without this patch.
test_result.txt

ericlars · 2022-06-29T22:04:00Z

Thanks for the PR @AidanBeltonS

Can you provide more details on the non batch failures? Setting the environment variable CTEST_OUTPUT_ON_FAILURE=1 with ctest can provide more details. Unusual errors are sometimes due to linking against Netlib libraries compiled with 32 bit integers, can you verify your reference libraries were compiled for 64bit integers?

I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.

AidanBeltonS · 2022-07-19T11:36:28Z

Can you provide more details on the non batch failures? Setting the environment variable CTEST_OUTPUT_ON_FAILURE=1 with ctest can provide more details. Unusual errors are sometimes due to linking against Netlib libraries compiled with 32 bit integers, can you verify your reference libraries were compiled for 64bit integers?

I have attached the results below with ctest output on failures. I have double checked I have built the netlib libraries for 64bits and am linking with the correct ones.

results.txt

I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.

Thanks for the log. I will look into the failing tests.

AidanBeltonS · 2022-07-21T09:17:15Z

@ericlars I found the issue with getrf_batch_group the problem was with my implementation, which was causing an odd failure in the reference lapack check. This has been resolved.

I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.

I have not been able to reproduce the errors you have found however I have looked further into the issue of non-batched operations failing. The problem appears to be with the addition of multiple CUDA streams per SYCL queue. This likely messes with the cusolver scope handler which assumes there is one stream per device per thread. It may be that the batch test failures you are seeing are also caused by the multistream changes. Would you be able to re-run the tests on a commit before the addition of multiple streams per queue? The DPCPP commit just before multi-streams is: d149ec39e7791a2d70858a7cf10261d6353b01be

I have attached the test logs below.
Commit: d149ec39e7791a2d70858a7cf10261d6353b01be
results-no-multistream.txt

Commit: dd418459868a976cd2eeae367fea6b92795ea611
results-with-multistream.txt

ericlars · 2022-07-21T22:40:43Z

@AidanBeltonS My logs were from a build of llvm from 3/19, and it looks like the multistream patch was committed on 5/17, but for sanity I'll try building from your suggested commits. Thanks for the sleuthing.

ericlars · 2022-07-27T18:32:08Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+GEQRF_GROUP_LAUNCHER_SCRATCH(double, cusolverDnDgetrf_bufferSize)
+GEQRF_GROUP_LAUNCHER_SCRATCH(std::complex<float>, cusolverDnCgetrf_bufferSize)
+GEQRF_GROUP_LAUNCHER_SCRATCH(std::complex<double>, cusolverDnZgetrf_bufferSize)
+


getrf -> geqrf

Found the source for geqrf failures.

Thank you for catching the mistake! I have fixed the scratch function

Is GetrsBatchStride still failing? I found a mistake where there was a call to free in the queue submit and not the host_task, this was fixed in previous commit.

JackAKirk · 2022-07-29T17:22:35Z

@AidanBeltonS My logs were from a build of llvm from 3/19, and it looks like the multistream patch was committed on 5/17, but for sanity I'll try building from your suggested commits. Thanks for the sleuthing.

New failures confirmed as resulting from multi-streams: although it is really a bug in oneMKL because interop streams are not synced. Proposed fix here: #215

AerialMantis · 2022-09-07T09:44:42Z

@ericlars now that #215 has been merged the issues resulting from the multiple CUDA stream implementation in DPC++ should be resolved, would you be able to review this again?

… cusolver_batch

AidanBeltonS · 2022-09-07T13:36:03Z

I have updated this PR to use the new functionality introduced in #215

ericlars · 2022-09-13T23:53:54Z

Looks good to me!

log_llvm_cusolver_.txt

Alexander-Kleymenov · 2022-09-14T02:05:15Z

src/lapack/backends/cusolver/cusolver_helper.hpp

+// Creates list of matrix/vector pointers from initial ptr and stride
+// Note: user is responsible for deallocating memory
+template <typename T>
+T **create_ptr_list_from_stride(T *ptr, int64_t ptr_stride, int64_t batch_size) {


it is introduced, but never used?

@Alexander-Kleymenov I believe this func is used in cusolver_batch.cpp in the same directory for translating the strided APIs.

Alexander-Kleymenov · 2022-09-14T22:07:08Z

src/lapack/backends/cusolver/cusolver_helper.hpp

+/* batched helpers */
+
+// Creates list of matrix/vector pointers from initial ptr and stride
+// Note: user is responsible for deallocating memory


Looks like none of the usages follow this?

Thanks for catching this. I have updated this to free all the instances where this allocates memory

Alexander-Kleymenov · 2022-09-15T16:17:54Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+            cusolverStatus_t err;
+
+            // Uses scratch so sync between each cuSolver call
+            for (int64_t i = 0; i < batch_size; ++i) {


I don't know how strict it is, but int64_t here is used without std namespace, a bit different vs everywhere else, and 32 bit int type is used not as int32_t.

I have added the std namespace.

and 32 bit int type is used not as int32_t.

Could you clarify what you are referring to, which 32 bit int type is not used as a int32_t and in what way?

The comment regarding legacy API makes use of 32-bit int a special case vs use of int64_t, so this special case could be explicitly highlighted with explicit fixed width type int32_t, not implementation dependent int type. In the current building environment int equals to int32_t. Use of int32_t would be just for code readability (highlight special case) and consistency (use only fixed-width integer types). This is just for your consideration.

I agree it would be better to have this as explicitly stated as int32_t rather than int. I think it should be a separate PR as the non-batched implementation also uses int in this way so it would be out of scope of this PR. I am happy to make a follow up PR to change this for both implementations.

Alexander-Kleymenov · 2022-09-15T16:31:03Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+    sycl::event done_casting = queue.submit([&](sycl::handler &cgh) {
+        cgh.depends_on(done);
+        cgh.parallel_for(sycl::range<1>{ ipiv_size }, [=](sycl::id<1> index) {
+            ipiv[index] = static_cast<std::int64_t>(ipiv32[index]);


Original ipiv could be passed as *int32_t, and then resulting 32-bit values could be expanded to 64-bit (in-place).

Could you explain further how the 32-bit values can be expanded to 64-bit in place? Are you suggesting use the 64-bit memory as if it were 32-bit then sequentially shift each value to the correct place?

Yes, this is the idea. It would reduce amount of code (especially in array-of-pointers case), and runtime overhead.

Okay, this would be an interested and likely more efficient approach.

I think however it should be separated to this PR as the non-batched operations use a kernel for casting between types, same as here, so this change would be out of scope of this batch PR. Additionally, the change would be done for performance reasons so it would be good to measure if there is any performance benefit to the expanding method before implementing.

My thought is that it may make sense to create an issue to track this change so a future PR can be made once performance measurements have confirmed this approach is better and the PR can address both batched and non-batched implementations. Would this approach work for you?

Batch case is different from regular by the size of problems: regular API is intended for larger cases, and this pre/post call overheads amortized by long kernel computations time. Batch case works on smaller problems, and run-time is also pretty short (due to high parallelism allowing to handle lots of small problems simultaneously). So any additional overhead is significant. Do you have any measurements of corresponding CUDA API on NV HW to compare it with the performance done through these SYCL interfaces?

Would this approach work for you?

Would work fine.

Do you have any measurements of corresponding CUDA API on NV HW to compare it with the performance done through these SYCL interfaces?

I currently do not have any measurements to compare the performance of this backend to native cuSolver

Issue to track future change: #230

Alexander-Kleymenov · 2022-09-15T16:38:17Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+            sycl::event e = queue.submit([&](sycl::handler &cgh) {
+                cgh.depends_on(done);
+                cgh.parallel_for(sycl::range<1>{ ipiv_size }, [=](sycl::id<1> index) {
+                    d_ipiv[index] = static_cast<std::int64_t>(d_ipiv32[index]);


why using static_cast here?

I have removed the static_cast

Alexander-Kleymenov · 2022-09-15T16:40:30Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+        for (int64_t i = 0; i < num_events; i++) {
+            cgh.depends_on(casting_dependencies[i]);
+        }
+        cgh.host_task([=](sycl::interop_handle ih) {


this would not be required if ipiv was passed as **int32_t, and converted in-pace later,

Alexander-Kleymenov · 2022-09-15T16:48:31Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+
+    overflow_check(n, nrhs, lda, ldb, stride_a, stride_b, batch_size, scratchpad_size);
+
+    // cusolver function only supports nrhs = 1


cuSolver is referred with different letter casing across the comments.

Alexander-Kleymenov · 2022-09-15T16:54:11Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+    // Enqueue free memory, don't return event as not-neccessary for user to wait for ipiv32 being released
+    queue.submit([&](sycl::handler &cgh) {
+        cgh.depends_on(done_casting);
+        cgh.host_task([=](sycl::interop_handle ih) { sycl::free(ipiv32, queue); });


host_task lambda here can capture queue by reference, avoiding object copying.

Alexander-Kleymenov · 2022-09-15T16:57:24Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+
+    // cusolver function only supports nrhs = 1
+    if (nrhs != 1)
+        throw unimplemented("lapack", "potrs_batch", "cusolver potrs_batch only supports nrhs = 1");


if at some point the support will be expanded, is there any way to bypass this check?

There is no way to check if cuSolver supports this nrhs > 1 case. If support is expanded this check will have to be modified or deleted depending on the circumstances at the time.

Could it be some single build-time constant, or preprocessor macro, switching it in single place. Like static constexpr bool multiple_nrhs_support = false; ?

Okay, I have wrapped the check with a macros preprocessor, so if the user defines POTRS_BATCHED_MULTIPLE_NRHS_SUPPORTED then the check will be removed.

Alexander-Kleymenov · 2022-09-15T16:59:31Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+                                      scratch_size);
+            });
+        })
+        .wait();


something happened to code formatting in this function

Fixed formatting

Thank you for the updates.

Alexander-Kleymenov

Looks good.

Alexander-Kleymenov · 2022-09-16T17:50:01Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+    // Enqueue free memory, don't return event as not-neccessary for user to wait for ipiv32 being released
+    queue.submit([&](sycl::handler &cgh) {
+        cgh.depends_on(done_casting);
+        cgh.host_task([&](sycl::interop_handle ih) { sycl::free(ipiv32, queue); });


This will lead to sporadic crashes: passing queue by reference makes sense as queue object is not going to be destroyed before host_tasl lambda is finished, but ipiv32 variable, which is automatic stack-allocated, can be destroyed by the time of host_task execution, and contain some other data, so free it will crash. [=,&queue]

I have reverted this change

…s by reference

Alexander-Kleymenov · 2022-09-20T19:46:29Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+    overflow_check(n, nrhs, lda, ldb, stride_a, stride_b, batch_size, scratchpad_size);
+
+    // cuSolver function only supports nrhs = 1
+#ifndef POTRS_BATCHED_MULTIPLE_NRHS_SUPPORTED


There is C++ way of doing this, not involving preprocessor's macro (if constexpr (not cusolver_batched_potrs_supports_multiple_nrhs) { if (nrhs != 1) { throw } }).
Here it is also not clear from the code if it is supported, or not, without explicit definition.
Comment on line 346 got detached. Better to remove it as the code is quite self-documenting here.

I think the issue with this approach is where is cusolver_batched_potrs_supports_multiple_nrhs going to reside? If this is to be set in a single place it would have to be a global variable, which is undesirable, as each of these are free functions.

Personally I think having this check done this way makes sense as we are using the legacy API. It seems very unlikely that NVIDIA will come out with an update eliminating this constraint for an API they are probably going to deprecate at some point. So any support to allow nrhs > 1 would require a change to the code and therefore the check can be simply deleted at that time. However, I still do not have any issues with making a way to by pass it. Just noting that it is unlikely to be needed.

I think the issue with this approach is where is cusolver_batched_potrs_supports_multiple_nrhs going to reside? If this is to be set in a single place it would have to be a global variable, which is undesirable, as each of these are free functions.

This is not a variable, but a constant, leaving no trace in the binary. Could be declared at the beginning of the source.

It seems very unlikely that NVIDIA will come out with an update eliminating this constraint for an API they are probably going to deprecate at some point.

This is a good point. So we can omit on this expectation.
Documentation explicitly states The routine will be removed in the next major release. Why we are using it here if it won't work in a year?

I have removed the macro.
The reason to use the legacy API is that the new API is not yet fleshed out. There would be a lot of missing functions, I would expect that when the legacy API is deprecated they will also bring in more supported operations for the new API.

The reason to use the legacy API is that the new API is not yet fleshed out.

I skimmed over the spec and all API, having deprecation notices, also have a proposed new API to use. What is missed?

If you take a look at the number of supported lapack operations, the 64-bit API does not support nearly as many operations. For example the 64-bit dense eigenvalue solver section only really contains two lapack operations. You can then see that the legacy API has many more.
Legacy API eigenvalue solvers: https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-eigensolver-reference
64-bit API eigenvalue solvers: https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-eigensolver-reference-64bit

Isn't this update only concerned with potrs and getrs which are being deprecated and have new API proposed?

Alexander-Kleymenov · 2022-09-23T18:27:20Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+
+    // cuSolver function only supports nrhs = 1
+    if (nrhs != 1)
+        throw unimplemented("lapack", "potrs_batch", "cusolver potrs_batch only supports nrhs = 1");


BTW, I don't see any deprecation notices for cusolverDnDpotrsBatched

The legacy API functions are not "deprecated" functions. They just follow an old interface, have fewer features, and will be made deprecated in the future.

Alexander-Kleymenov · 2022-09-23T18:32:31Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+        auto ipiv32_acc = ipiv32.template get_access<sycl::access::mode::write>(cgh);
+        auto ipiv_acc = ipiv.template get_access<sycl::access::mode::read>(cgh);
+        cgh.parallel_for(sycl::range<1>{ ipiv_size }, [=](sycl::id<1> index) {
+            ipiv32_acc[index] = static_cast<std::int32_t>(ipiv_acc[index]);


Do you think if it makes sense to have overflow checks for integer arrays same as on line 114?

I think we can leave this without checks. While it would be nice to provide the user more error information, if the user provided valid dimensions but then requests out of bound ipiv values then that is more of an issue on their end.
There is also downside of having to check every ipiv value across the batches then copy the results back to the host to decide to throw an exception or not. So this means more memory being copied from device to host and the host cannot just queue up as much asynchronous work as possible and move on.

Alexander-Kleymenov · 2022-09-23T18:34:17Z

src/lapack/backends/cusolver/cusolver_batch.cpp

+    // Create new buffer with 32-bit ints then copy over results
+    std::uint64_t ipiv_size = stride_ipiv * batch_size;
+    sycl::buffer<int, 1> ipiv32(sycl::range<1>{ ipiv_size });
+    sycl::buffer<int> devInfo{ batch_size };


sycl::buffer<int> is equivalent to sycl::buffer<int, 1>

Thanks for catching this, I have removed the dim value.

Alexander-Kleymenov

Thanks for the updates. Looks good!

AerialMantis · 2022-10-05T15:02:51Z

@ericlars it looks like we have approvals for this, could this be merged now, or is there any further discussions to which need to be resolved?

Add potrf and getrs batch functions

a67d362

mmeterel assigned ericlars Jun 17, 2022

Add potrs, getrf, geqrf, orgqr, ungqr

f7def86

Resolve getrf_batch_group issue, and reformat

90b58cd

ericlars reviewed Jul 27, 2022

View reviewed changes

Fix typo

8e6729b

JackAKirk mentioned this pull request Jul 29, 2022

add cuStreamSync for async cusolver functs #215

Merged

ericlars mentioned this pull request Jul 29, 2022

[CUSOLVER] cuSOLVER handler does not support multiple streams #216

Closed

aidan.belton added 2 commits September 7, 2022 11:08

Merge branch 'develop' of https://github.com/AidanBeltonS/oneMKL into…

1a43454

… cusolver_batch

Update to handle async behaviour

ed58282

ericlars approved these changes Sep 13, 2022

View reviewed changes

Alexander-Kleymenov reviewed Sep 14, 2022

View reviewed changes

Free batched memory alloc and clang-format

fcf9dab

Alexander-Kleymenov reviewed Sep 15, 2022

View reviewed changes

Alexander-Kleymenov approved these changes Sep 15, 2022

View reviewed changes

Update based on feedback

f4e35c0

Alexander-Kleymenov reviewed Sep 16, 2022

View reviewed changes

Add POTRS_BATCHED_MULTIPLE_NRHS_SUPPORTED macro and revert lambda pas…

5d95f2a

…s by reference

Alexander-Kleymenov reviewed Sep 20, 2022

View reviewed changes

Alexander-Kleymenov approved these changes Sep 22, 2022

View reviewed changes

mkrainiuk approved these changes Sep 22, 2022

View reviewed changes

Remove macro on nrhs != 1

a66db20

AidanBeltonS mentioned this pull request Sep 23, 2022

[LAPACK][CUSOLVER] Inefficient conversion from 32-bit ints to 64-bit ints. #230

Open

Alexander-Kleymenov reviewed Sep 23, 2022

View reviewed changes

Remove redundant dim value

797c4bb

AerialMantis mentioned this pull request Sep 27, 2022

unsupported getri_batch/getrf_batch for Nvidia #229

Closed

Alexander-Kleymenov approved these changes Sep 27, 2022

View reviewed changes

ericlars merged commit c502dec into uxlfoundation:develop Oct 6, 2022


		overflow_check(n, nrhs, lda, ldb, stride_a, stride_b, batch_size, scratchpad_size);

		// cusolver function only supports nrhs = 1

[LAPACK][CUSOLVER] Add potrf and getrs batch functions to cuSolver #209

[LAPACK][CUSOLVER] Add potrf and getrs batch functions to cuSolver #209

Conversation

AidanBeltonS commented Jun 17, 2022 • edited Loading

Description

AidanBeltonS commented Jun 23, 2022

ericlars commented Jun 29, 2022

AidanBeltonS commented Jul 19, 2022

AidanBeltonS commented Jul 21, 2022

ericlars commented Jul 21, 2022

ericlars Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented Jul 29, 2022

AerialMantis commented Sep 7, 2022

AidanBeltonS commented Sep 7, 2022

ericlars commented Sep 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AidanBeltonS Sep 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alexander-Kleymenov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alexander-Kleymenov Sep 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alexander-Kleymenov left a comment

Choose a reason for hiding this comment

AerialMantis commented Oct 5, 2022

AidanBeltonS commented Jun 17, 2022 •

edited

Loading

ericlars Jul 27, 2022 •

edited

Loading

AidanBeltonS Sep 20, 2022 •

edited

Loading

Alexander-Kleymenov Sep 22, 2022 •

edited

Loading