Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption with CUDA hijack off #1782

Open
elliottslaughter opened this issue Oct 25, 2024 · 3 comments
Open

Data corruption with CUDA hijack off #1782

elliottslaughter opened this issue Oct 25, 2024 · 3 comments
Assignees

Comments

@elliottslaughter
Copy link
Contributor

This is a follow-on from #1682. I'm building S3D on a variety of machines, and the behavior I currently see is:

In the case with the CUDA hijack off, errors look like:

DID NOT CONVERGE!!!!
[2 - 7f0703e40000]    5.668820 {6}{gpu}: CUDA error reported on GPU 0: an illegal memory access was encountered (CUDA_ERROR_ILLEGAL_ADDRESS)
*** Caught a fatal signal (proc 2): SIGABRT(6)

Because it goes away with hijack (at least on Sapling), this smells like a synchronization issue.

What's notable to me here is that Frontier and Perlmutter (with hijack off) should be following the same code paths. HIP, as you may recall, has never really had a hijack, so users have always been required to query the task HIP stream for kernel launches. I have now gone and unified the code so that, as much as possible, we are running identical code in the CUDA case. It should be difficult or impossible for the HIP code to be obtaining the task stream but not doing so in the CUDA case. However, we still hit the issue above.

Since I suspected a synchronization issue, I went and commented out every instance of set_task_ctxsync_required(false) in the application. If I understand correctly, this should force a synchronization after every task. This is the primary difference between hijack and non-hijack modes, so it seems like the more likely culprit. To be really sure, I also applied the following diff to Realm:

diff --git a/runtime/realm/cuda/cuda_module.cc b/runtime/realm/cuda/cuda_module.cc
index f06e9cd21..628fe70e0 100644
--- a/runtime/realm/cuda/cuda_module.cc
+++ b/runtime/realm/cuda/cuda_module.cc
@@ -4218,6 +4218,7 @@ namespace Realm {
 
     void CudaModule::set_task_ctxsync_required(bool is_required)
     {
+      abort(); // Elliott: make sure we never hit this
       // if we're not in a gpu task, setting this will have no effect
       ThreadLocal::context_sync_required = (is_required ? 1 : 0);
     }

If I understand correctly, this ensures that we do not hit this code path, anywhere in the application. But after rebuilding I'm still hitting the error above.

At this point, I wonder if cuCtxRecordEvent (from #1730 (comment)) is somehow not having the behavior we expect? Again, I don't see what else could be different between the hijack and non-hijack builds. We are running the same application code and Legion version. In the case of Sapling, I can literally rebuild with one flag set.

Is there a way to shut off the cuCtxRecordEvent code path and just do a plain old cuCtxSynchronize? Unless someone else has another suggestion, this seems like the next thing to try.

@muraj for visibility.

@elliottslaughter
Copy link
Contributor Author

elliottslaughter commented Oct 25, 2024

Here's the driver on the node in question. I guess it's too old to have cuCtxRecordEvent.

$ srun -n 1 -N 1 nvidia-smi
Fri Oct 25 11:28:03 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+

That was my last idea. Suggestions?

Edit: actually, I guess the driver is distinct from the CUDA version. Can someone check if this driver has cuCtxRecordEvent?

@muraj muraj self-assigned this Oct 26, 2024
@muraj
Copy link

muraj commented Oct 26, 2024

It does not, you need to compile with at least 12.5 and the driver needs to be at least r550 I believe. I don't think cuCtxRecordEvent is the problem here. My guess is there is some race happening somewhere that the hijack makes less likely somehow. I would try to look at the address that is causing the failure and backtrack where it came from and why it's invalid.

@lightsighter
Copy link
Contributor

@elliottslaughter Can you run with compute-sanitizer --lineinfo ... to see which line of which kernel is doing the illegal memory access?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants