You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have uncovered that a deadlock can occur when calling the ListenForPolicyViolations function. This happens when there are immediate (or pending) policy violations to a previously registered policy that is being subscribed to.
Here are the steps where this happens in the underlying registerPolicy function called by ListenForPolicyViolations
Policy is created in set by the call to setPolicy (code)
Policy violation occurs for the group (e.g. a PCIe replay counter increment)
Callback is registered for the policy created in step (1) by calling dcgmPolicyRegister (code)
At this point, the callback is invoked BEFORE the dcgmPolicyRegister call completes. The dcgmPolicyRegister function cannot complete until the callbacks complete due to locking in the underlying dcgm c++ library (not exactly sure which locks are causing it but there is extensive locking going on — ex1, ex2). The callback function is ViolationRegistration which performs blocking writes to the underlying go channels (note: these go channels are only buffered by 1 item so this blocks when there are >1 notifications to send). However, nothing is reading from the callback go channels because the processing of the callback channels is not setup till AFTER the dcgmPolicyRegister function returns (code). This leads to the deadlock — the notification callbacks cannot proceed because nothing is processing them, but the notification processor cannot be started because the callbacks cannot complete (and therefore allow dcgmPolicyRegister to complete).
A few solutions I can quickly think of:
Increase the buffer size of the channels — this would provide temporary - but not complete - relief from the problem since the callback function can write to the channels even if nothing is processing yet. However, the channels can still fill up before processing starts so this only reduces the probability of the issue occurring.
Start processing the callbacks channels before registering the callback function on the policy — this solution would likely require that the library user pass the violation channel into the ListenForPolicyViolations function. They would also need to begin asynchronously processing notifications out of the passed channel before the call. This should prevent the channel writes from blocking since something is already processing.
Drop messages when channel is full — Instead of blocking until a write to the channel can occur, the notifications could simply be logged and dropped when the channel is full
Address issue in the underlying c++ library — There are some comments in the code which suggest this issue of processing the violations before the callback registration can complete is a known issue (code). Therefore, it seems that it's possible to handle such cases in the library itself — possibly by dropping the notifications similar to option (3). Perhaps the synchronization can also be improved to avoid this as well.
It is likely that all of these could be used together in unison as well.
The text was updated successfully, but these errors were encountered:
Hi Team,
We have uncovered that a deadlock can occur when calling the
ListenForPolicyViolations
function. This happens when there are immediate (or pending) policy violations to a previously registered policy that is being subscribed to.Here are the steps where this happens in the underlying
registerPolicy
function called byListenForPolicyViolations
setPolicy
(code)dcgmPolicyRegister
(code)At this point, the callback is invoked BEFORE the
dcgmPolicyRegister
call completes. ThedcgmPolicyRegister
function cannot complete until the callbacks complete due to locking in the underlying dcgm c++ library (not exactly sure which locks are causing it but there is extensive locking going on — ex1, ex2). The callback function isViolationRegistration
which performs blocking writes to the underlying go channels (note: these go channels are only buffered by 1 item so this blocks when there are >1 notifications to send). However, nothing is reading from the callback go channels because the processing of the callback channels is not setup till AFTER thedcgmPolicyRegister
function returns (code). This leads to the deadlock — the notification callbacks cannot proceed because nothing is processing them, but the notification processor cannot be started because the callbacks cannot complete (and therefore allowdcgmPolicyRegister
to complete).A few solutions I can quickly think of:
ListenForPolicyViolations
function. They would also need to begin asynchronously processing notifications out of the passed channel before the call. This should prevent the channel writes from blocking since something is already processing.It is likely that all of these could be used together in unison as well.
The text was updated successfully, but these errors were encountered: