Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LAPACK][CUSOLVER] Add potrf and getrs batch functions to cuSolver #209

Merged
merged 11 commits into from
Oct 6, 2022

Conversation

AidanBeltonS
Copy link
Contributor

@AidanBeltonS AidanBeltonS commented Jun 17, 2022

Description

This PR extends potrf_batch to additional overloads. Additionally it implements all getrs_batch overloads. This change fully implements both potrf_batch and getrs_batch for USM and Buffer for both group and strided batch operations.
In neither case does cuSolver have a direct equivalent to the oneMKL function.

So in both cases some manipulation has to occur to make things work. potrf_batch implements the strided batch operation with the cuSolver group batch operation. This should be quite efficient, as it is just reformatting the input. getrs_batch is implemented using non-batched functions so this just loops over the cuSolver function for each matrix in the batch. The performance of this would not be significantly different than the user just looping over the oneMKL function themselves. This I believe this is okay, as it would provide additional convenience and portability between backends to the user.

The motivation for this change is to add greater functionality to the cuSolver backend and to improve portability between cuSolver and other lapack backends. Additionally this would provide a framework for further implementations of missing batch functions which do not align well with cuSolver functions.

Edit: Follow up commit implements remaining batched functions, with the exception of getrf group batched due to an issue with test.

Test logs:
potrf_and_getrs_results.txt

@AidanBeltonS
Copy link
Contributor Author

Remaining batch functions implemented using same method as potrf and getrs functions.

Getrf group batched operation commented out and set to unimplemented due to test segfaulting. Error appears to not be related to oneMKL but lapack. Further investigation will be done on it.

Test results: Some non-batched tests are failing locally, this occurs with and without this patch.
test_result.txt

@ericlars
Copy link
Contributor

Thanks for the PR @AidanBeltonS

Can you provide more details on the non batch failures? Setting the environment variable CTEST_OUTPUT_ON_FAILURE=1 with ctest can provide more details. Unusual errors are sometimes due to linking against Netlib libraries compiled with 32 bit integers, can you verify your reference libraries were compiled for 64bit integers?

I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.

@AidanBeltonS
Copy link
Contributor Author

Can you provide more details on the non batch failures? Setting the environment variable CTEST_OUTPUT_ON_FAILURE=1 with ctest can provide more details. Unusual errors are sometimes due to linking against Netlib libraries compiled with 32 bit integers, can you verify your reference libraries were compiled for 64bit integers?

I have attached the results below with ctest output on failures. I have double checked I have built the netlib libraries for 64bits and am linking with the correct ones.

results.txt

I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.

Thanks for the log. I will look into the failing tests.

@AidanBeltonS
Copy link
Contributor Author

@ericlars I found the issue with getrf_batch_group the problem was with my implementation, which was causing an odd failure in the reference lapack check. This has been resolved.

I did see some batch failures relating to illegal memory accesses and invalid parameters passed to cuda. Log attached here. I didn't see anything immediately wrong in the parameter passing but I'll continue to look.

I have not been able to reproduce the errors you have found however I have looked further into the issue of non-batched operations failing. The problem appears to be with the addition of multiple CUDA streams per SYCL queue. This likely messes with the cusolver scope handler which assumes there is one stream per device per thread. It may be that the batch test failures you are seeing are also caused by the multistream changes. Would you be able to re-run the tests on a commit before the addition of multiple streams per queue? The DPCPP commit just before multi-streams is: d149ec39e7791a2d70858a7cf10261d6353b01be

I have attached the test logs below.
Commit: d149ec39e7791a2d70858a7cf10261d6353b01be
results-no-multistream.txt

Commit: dd418459868a976cd2eeae367fea6b92795ea611
results-with-multistream.txt

@ericlars
Copy link
Contributor

@AidanBeltonS My logs were from a build of llvm from 3/19, and it looks like the multistream patch was committed on 5/17, but for sanity I'll try building from your suggested commits. Thanks for the sleuthing.

GEQRF_GROUP_LAUNCHER_SCRATCH(double, cusolverDnDgetrf_bufferSize)
GEQRF_GROUP_LAUNCHER_SCRATCH(std::complex<float>, cusolverDnCgetrf_bufferSize)
GEQRF_GROUP_LAUNCHER_SCRATCH(std::complex<double>, cusolverDnZgetrf_bufferSize)

Copy link
Contributor

@ericlars ericlars Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getrf -> geqrf

Found the source for geqrf failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching the mistake! I have fixed the scratch function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is GetrsBatchStride still failing? I found a mistake where there was a call to free in the queue submit and not the host_task, this was fixed in previous commit.

@JackAKirk
Copy link
Contributor

@AidanBeltonS My logs were from a build of llvm from 3/19, and it looks like the multistream patch was committed on 5/17, but for sanity I'll try building from your suggested commits. Thanks for the sleuthing.

New failures confirmed as resulting from multi-streams: although it is really a bug in oneMKL because interop streams are not synced. Proposed fix here: #215

@AerialMantis
Copy link

@ericlars now that #215 has been merged the issues resulting from the multiple CUDA stream implementation in DPC++ should be resolved, would you be able to review this again?

@AidanBeltonS
Copy link
Contributor Author

I have updated this PR to use the new functionality introduced in #215

@ericlars
Copy link
Contributor

Looks good to me!

log_llvm_cusolver_.txt

// Creates list of matrix/vector pointers from initial ptr and stride
// Note: user is responsible for deallocating memory
template <typename T>
T **create_ptr_list_from_stride(T *ptr, int64_t ptr_stride, int64_t batch_size) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is introduced, but never used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alexander-Kleymenov I believe this func is used in cusolver_batch.cpp in the same directory for translating the strided APIs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I see

/* batched helpers */

// Creates list of matrix/vector pointers from initial ptr and stride
// Note: user is responsible for deallocating memory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like none of the usages follow this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I have updated this to free all the instances where this allocates memory

cusolverStatus_t err;

// Uses scratch so sync between each cuSolver call
for (int64_t i = 0; i < batch_size; ++i) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how strict it is, but int64_t here is used without std namespace, a bit different vs everywhere else, and 32 bit int type is used not as int32_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the std namespace.

and 32 bit int type is used not as int32_t.

Could you clarify what you are referring to, which 32 bit int type is not used as a int32_t and in what way?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment regarding legacy API makes use of 32-bit int a special case vs use of int64_t, so this special case could be explicitly highlighted with explicit fixed width type int32_t, not implementation dependent int type. In the current building environment int equals to int32_t. Use of int32_t would be just for code readability (highlight special case) and consistency (use only fixed-width integer types). This is just for your consideration.

Copy link
Contributor Author

@AidanBeltonS AidanBeltonS Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be better to have this as explicitly stated as int32_t rather than int. I think it should be a separate PR as the non-batched implementation also uses int in this way so it would be out of scope of this PR. I am happy to make a follow up PR to change this for both implementations.

sycl::event done_casting = queue.submit([&](sycl::handler &cgh) {
cgh.depends_on(done);
cgh.parallel_for(sycl::range<1>{ ipiv_size }, [=](sycl::id<1> index) {
ipiv[index] = static_cast<std::int64_t>(ipiv32[index]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original ipiv could be passed as *int32_t, and then resulting 32-bit values could be expanded to 64-bit (in-place).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain further how the 32-bit values can be expanded to 64-bit in place? Are you suggesting use the 64-bit memory as if it were 32-bit then sequentially shift each value to the correct place?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the idea. It would reduce amount of code (especially in array-of-pointers case), and runtime overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this would be an interested and likely more efficient approach.

I think however it should be separated to this PR as the non-batched operations use a kernel for casting between types, same as here, so this change would be out of scope of this batch PR. Additionally, the change would be done for performance reasons so it would be good to measure if there is any performance benefit to the expanding method before implementing.

My thought is that it may make sense to create an issue to track this change so a future PR can be made once performance measurements have confirmed this approach is better and the PR can address both batched and non-batched implementations. Would this approach work for you?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch case is different from regular by the size of problems: regular API is intended for larger cases, and this pre/post call overheads amortized by long kernel computations time. Batch case works on smaller problems, and run-time is also pretty short (due to high parallelism allowing to handle lots of small problems simultaneously). So any additional overhead is significant. Do you have any measurements of corresponding CUDA API on NV HW to compare it with the performance done through these SYCL interfaces?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this approach work for you?

Would work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any measurements of corresponding CUDA API on NV HW to compare it with the performance done through these SYCL interfaces?

I currently do not have any measurements to compare the performance of this backend to native cuSolver

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue to track future change: #230

sycl::event e = queue.submit([&](sycl::handler &cgh) {
cgh.depends_on(done);
cgh.parallel_for(sycl::range<1>{ ipiv_size }, [=](sycl::id<1> index) {
d_ipiv[index] = static_cast<std::int64_t>(d_ipiv32[index]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why using static_cast here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the static_cast

for (int64_t i = 0; i < num_events; i++) {
cgh.depends_on(casting_dependencies[i]);
}
cgh.host_task([=](sycl::interop_handle ih) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would not be required if ipiv was passed as **int32_t, and converted in-pace later,


overflow_check(n, nrhs, lda, ldb, stride_a, stride_b, batch_size, scratchpad_size);

// cusolver function only supports nrhs = 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuSolver is referred with different letter casing across the comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Enqueue free memory, don't return event as not-neccessary for user to wait for ipiv32 being released
queue.submit([&](sycl::handler &cgh) {
cgh.depends_on(done_casting);
cgh.host_task([=](sycl::interop_handle ih) { sycl::free(ipiv32, queue); });

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host_task lambda here can capture queue by reference, avoiding object copying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// cusolver function only supports nrhs = 1
if (nrhs != 1)
throw unimplemented("lapack", "potrs_batch", "cusolver potrs_batch only supports nrhs = 1");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if at some point the support will be expanded, is there any way to bypass this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to check if cuSolver supports this nrhs > 1 case. If support is expanded this check will have to be modified or deleted depending on the circumstances at the time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be some single build-time constant, or preprocessor macro, switching it in single place. Like static constexpr bool multiple_nrhs_support = false; ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have wrapped the check with a macros preprocessor, so if the user defines POTRS_BATCHED_MULTIPLE_NRHS_SUPPORTED then the check will be removed.

scratch_size);
});
})
.wait();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something happened to code formatting in this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed formatting

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates.

Copy link

@Alexander-Kleymenov Alexander-Kleymenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

// Enqueue free memory, don't return event as not-neccessary for user to wait for ipiv32 being released
queue.submit([&](sycl::handler &cgh) {
cgh.depends_on(done_casting);
cgh.host_task([&](sycl::interop_handle ih) { sycl::free(ipiv32, queue); });

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lead to sporadic crashes: passing queue by reference makes sense as queue object is not going to be destroyed before host_tasl lambda is finished, but ipiv32 variable, which is automatic stack-allocated, can be destroyed by the time of host_task execution, and contain some other data, so free it will crash. [=,&queue]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted this change

overflow_check(n, nrhs, lda, ldb, stride_a, stride_b, batch_size, scratchpad_size);

// cuSolver function only supports nrhs = 1
#ifndef POTRS_BATCHED_MULTIPLE_NRHS_SUPPORTED

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is C++ way of doing this, not involving preprocessor's macro (if constexpr (not cusolver_batched_potrs_supports_multiple_nrhs) { if (nrhs != 1) { throw } }).
Here it is also not clear from the code if it is supported, or not, without explicit definition.
Comment on line 346 got detached. Better to remove it as the code is quite self-documenting here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue with this approach is where is cusolver_batched_potrs_supports_multiple_nrhs going to reside? If this is to be set in a single place it would have to be a global variable, which is undesirable, as each of these are free functions.

Personally I think having this check done this way makes sense as we are using the legacy API. It seems very unlikely that NVIDIA will come out with an update eliminating this constraint for an API they are probably going to deprecate at some point. So any support to allow nrhs > 1 would require a change to the code and therefore the check can be simply deleted at that time. However, I still do not have any issues with making a way to by pass it. Just noting that it is unlikely to be needed.

Copy link

@Alexander-Kleymenov Alexander-Kleymenov Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue with this approach is where is cusolver_batched_potrs_supports_multiple_nrhs going to reside? If this is to be set in a single place it would have to be a global variable, which is undesirable, as each of these are free functions.

This is not a variable, but a constant, leaving no trace in the binary. Could be declared at the beginning of the source.

It seems very unlikely that NVIDIA will come out with an update eliminating this constraint for an API they are probably going to deprecate at some point.

This is a good point. So we can omit on this expectation.
Documentation explicitly states The routine will be removed in the next major release. Why we are using it here if it won't work in a year?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the macro.
The reason to use the legacy API is that the new API is not yet fleshed out. There would be a lot of missing functions, I would expect that when the legacy API is deprecated they will also bring in more supported operations for the new API.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to use the legacy API is that the new API is not yet fleshed out.

I skimmed over the spec and all API, having deprecation notices, also have a proposed new API to use. What is missed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take a look at the number of supported lapack operations, the 64-bit API does not support nearly as many operations. For example the 64-bit dense eigenvalue solver section only really contains two lapack operations. You can then see that the legacy API has many more.
Legacy API eigenvalue solvers: https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-eigensolver-reference
64-bit API eigenvalue solvers: https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-eigensolver-reference-64bit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this update only concerned with potrs and getrs which are being deprecated and have new API proposed?


// cuSolver function only supports nrhs = 1
if (nrhs != 1)
throw unimplemented("lapack", "potrs_batch", "cusolver potrs_batch only supports nrhs = 1");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I don't see any deprecation notices for cusolverDnDpotrsBatched

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legacy API functions are not "deprecated" functions. They just follow an old interface, have fewer features, and will be made deprecated in the future.

auto ipiv32_acc = ipiv32.template get_access<sycl::access::mode::write>(cgh);
auto ipiv_acc = ipiv.template get_access<sycl::access::mode::read>(cgh);
cgh.parallel_for(sycl::range<1>{ ipiv_size }, [=](sycl::id<1> index) {
ipiv32_acc[index] = static_cast<std::int32_t>(ipiv_acc[index]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think if it makes sense to have overflow checks for integer arrays same as on line 114?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave this without checks. While it would be nice to provide the user more error information, if the user provided valid dimensions but then requests out of bound ipiv values then that is more of an issue on their end.
There is also downside of having to check every ipiv value across the batches then copy the results back to the host to decide to throw an exception or not. So this means more memory being copied from device to host and the host cannot just queue up as much asynchronous work as possible and move on.

// Create new buffer with 32-bit ints then copy over results
std::uint64_t ipiv_size = stride_ipiv * batch_size;
sycl::buffer<int, 1> ipiv32(sycl::range<1>{ ipiv_size });
sycl::buffer<int> devInfo{ batch_size };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl::buffer<int> is equivalent to sycl::buffer<int, 1>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, I have removed the dim value.

Copy link

@Alexander-Kleymenov Alexander-Kleymenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. Looks good!

@AerialMantis
Copy link

@ericlars it looks like we have approvals for this, could this be merged now, or is there any further discussions to which need to be resolved?

@ericlars ericlars merged commit c502dec into uxlfoundation:develop Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants