You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently using dcgm_fi_dev_gpu_util to monitor GPU utilization but running into an issue where it will occasionally spit out a data point that isn't between 0 and 100. The highest observed value was 4294967295 (max supported by UINT32 which might be a hint), but most often it's in range of 1k to 200k. This appears to happen both in situations where there is load on the GPUs and also in situations where the GPUs are sitting at 0% before and after the erroneous data point. Has anyone else encountered problems with this metric?
I've seen it suggested elsewhere that there's a newer DCGM_FI_PROF_GR_ENGINE_ACTIVE which might replace it, but I don't know whether the root cause here is the metric itself or something in the collection code. Anyone know whether collecting the 'prof' metric would incur a greater performance penalty than the 'dev' metric?
Thanks!
The text was updated successfully, but these errors were encountered:
Hi all,
Currently using dcgm_fi_dev_gpu_util to monitor GPU utilization but running into an issue where it will occasionally spit out a data point that isn't between 0 and 100. The highest observed value was 4294967295 (max supported by UINT32 which might be a hint), but most often it's in range of 1k to 200k. This appears to happen both in situations where there is load on the GPUs and also in situations where the GPUs are sitting at 0% before and after the erroneous data point. Has anyone else encountered problems with this metric?
I've seen it suggested elsewhere that there's a newer DCGM_FI_PROF_GR_ENGINE_ACTIVE which might replace it, but I don't know whether the root cause here is the metric itself or something in the collection code. Anyone know whether collecting the 'prof' metric would incur a greater performance penalty than the 'dev' metric?
Thanks!
The text was updated successfully, but these errors were encountered: