You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CUDA hijack off: data/memory corruption (see below)
Frontier: works (this is HIP so hijack on/off is sort of irrelevant)
In the case with the CUDA hijack off, errors look like:
DID NOT CONVERGE!!!!
[2 - 7f0703e40000] 5.668820 {6}{gpu}: CUDA error reported on GPU 0: an illegal memory access was encountered (CUDA_ERROR_ILLEGAL_ADDRESS)
*** Caught a fatal signal (proc 2): SIGABRT(6)
Because it goes away with hijack (at least on Sapling), this smells like a synchronization issue.
What's notable to me here is that Frontier and Perlmutter (with hijack off) should be following the same code paths. HIP, as you may recall, has never really had a hijack, so users have always been required to query the task HIP stream for kernel launches. I have now gone and unified the code so that, as much as possible, we are running identical code in the CUDA case. It should be difficult or impossible for the HIP code to be obtaining the task stream but not doing so in the CUDA case. However, we still hit the issue above.
Since I suspected a synchronization issue, I went and commented out every instance of set_task_ctxsync_required(false) in the application. If I understand correctly, this should force a synchronization after every task. This is the primary difference between hijack and non-hijack modes, so it seems like the more likely culprit. To be really sure, I also applied the following diff to Realm:
diff --git a/runtime/realm/cuda/cuda_module.cc b/runtime/realm/cuda/cuda_module.cc
index f06e9cd21..628fe70e0 100644
--- a/runtime/realm/cuda/cuda_module.cc+++ b/runtime/realm/cuda/cuda_module.cc@@ -4218,6 +4218,7 @@ namespace Realm {
void CudaModule::set_task_ctxsync_required(bool is_required)
{
+ abort(); // Elliott: make sure we never hit this
// if we're not in a gpu task, setting this will have no effect
ThreadLocal::context_sync_required = (is_required ? 1 : 0);
}
If I understand correctly, this ensures that we do not hit this code path, anywhere in the application. But after rebuilding I'm still hitting the error above.
At this point, I wonder if cuCtxRecordEvent (from #1730 (comment)) is somehow not having the behavior we expect? Again, I don't see what else could be different between the hijack and non-hijack builds. We are running the same application code and Legion version. In the case of Sapling, I can literally rebuild with one flag set.
Is there a way to shut off the cuCtxRecordEvent code path and just do a plain old cuCtxSynchronize? Unless someone else has another suggestion, this seems like the next thing to try.
It does not, you need to compile with at least 12.5 and the driver needs to be at least r550 I believe. I don't think cuCtxRecordEvent is the problem here. My guess is there is some race happening somewhere that the hijack makes less likely somehow. I would try to look at the address that is causing the failure and backtrack where it came from and why it's invalid.
This is a follow-on from #1682. I'm building S3D on a variety of machines, and the behavior I currently see is:
In the case with the CUDA hijack off, errors look like:
Because it goes away with hijack (at least on Sapling), this smells like a synchronization issue.
What's notable to me here is that Frontier and Perlmutter (with hijack off) should be following the same code paths. HIP, as you may recall, has never really had a hijack, so users have always been required to query the task HIP stream for kernel launches. I have now gone and unified the code so that, as much as possible, we are running identical code in the CUDA case. It should be difficult or impossible for the HIP code to be obtaining the task stream but not doing so in the CUDA case. However, we still hit the issue above.
Since I suspected a synchronization issue, I went and commented out every instance of
set_task_ctxsync_required(false)
in the application. If I understand correctly, this should force a synchronization after every task. This is the primary difference between hijack and non-hijack modes, so it seems like the more likely culprit. To be really sure, I also applied the following diff to Realm:If I understand correctly, this ensures that we do not hit this code path, anywhere in the application. But after rebuilding I'm still hitting the error above.
At this point, I wonder if
cuCtxRecordEvent
(from #1730 (comment)) is somehow not having the behavior we expect? Again, I don't see what else could be different between the hijack and non-hijack builds. We are running the same application code and Legion version. In the case of Sapling, I can literally rebuild with one flag set.Is there a way to shut off the
cuCtxRecordEvent
code path and just do a plain oldcuCtxSynchronize
? Unless someone else has another suggestion, this seems like the next thing to try.@muraj for visibility.
The text was updated successfully, but these errors were encountered: