Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in ListenForPolicyViolations #61

Open
sanjams2 opened this issue Feb 26, 2024 · 1 comment
Open

Deadlock in ListenForPolicyViolations #61

sanjams2 opened this issue Feb 26, 2024 · 1 comment

Comments

@sanjams2
Copy link
Contributor

sanjams2 commented Feb 26, 2024

Hi Team,

We have uncovered that a deadlock can occur when calling the ListenForPolicyViolations function. This happens when there are immediate (or pending) policy violations to a previously registered policy that is being subscribed to.

Here are the steps where this happens in the underlying registerPolicy function called by ListenForPolicyViolations

  1. Policy is created in set by the call to setPolicy (code)
  2. Policy violation occurs for the group (e.g. a PCIe replay counter increment)
  3. Callback is registered for the policy created in step (1) by calling dcgmPolicyRegister (code)

At this point, the callback is invoked BEFORE the dcgmPolicyRegister call completes. The dcgmPolicyRegister function cannot complete until the callbacks complete due to locking in the underlying dcgm c++ library (not exactly sure which locks are causing it but there is extensive locking going on — ex1, ex2). The callback function is ViolationRegistration which performs blocking writes to the underlying go channels (note: these go channels are only buffered by 1 item so this blocks when there are >1 notifications to send). However, nothing is reading from the callback go channels because the processing of the callback channels is not setup till AFTER the dcgmPolicyRegister function returns (code). This leads to the deadlock — the notification callbacks cannot proceed because nothing is processing them, but the notification processor cannot be started because the callbacks cannot complete (and therefore allow dcgmPolicyRegister to complete).

A few solutions I can quickly think of:

  1. Increase the buffer size of the channels — this would provide temporary - but not complete - relief from the problem since the callback function can write to the channels even if nothing is processing yet. However, the channels can still fill up before processing starts so this only reduces the probability of the issue occurring.
  2. Start processing the callbacks channels before registering the callback function on the policy — this solution would likely require that the library user pass the violation channel into the ListenForPolicyViolations function. They would also need to begin asynchronously processing notifications out of the passed channel before the call. This should prevent the channel writes from blocking since something is already processing.
  3. Drop messages when channel is full — Instead of blocking until a write to the channel can occur, the notifications could simply be logged and dropped when the channel is full
  4. Address issue in the underlying c++ library — There are some comments in the code which suggest this issue of processing the violations before the callback registration can complete is a known issue (code). Therefore, it seems that it's possible to handle such cases in the library itself — possibly by dropping the notifications similar to option (3). Perhaps the synchronization can also be improved to avoid this as well.

It is likely that all of these could be used together in unison as well.

@nvvfedorov
Copy link
Collaborator

Thank you for reporting about the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants