Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Corruption on dcgm_fi_dev_gpu_util Metric #75

Open
TortoiseHam opened this issue Nov 13, 2024 · 1 comment
Open

Data Corruption on dcgm_fi_dev_gpu_util Metric #75

TortoiseHam opened this issue Nov 13, 2024 · 1 comment

Comments

@TortoiseHam
Copy link

Hi all,

Currently using dcgm_fi_dev_gpu_util to monitor GPU utilization but running into an issue where it will occasionally spit out a data point that isn't between 0 and 100. The highest observed value was 4294967295 (max supported by UINT32 which might be a hint), but most often it's in range of 1k to 200k. This appears to happen both in situations where there is load on the GPUs and also in situations where the GPUs are sitting at 0% before and after the erroneous data point. Has anyone else encountered problems with this metric?

I've seen it suggested elsewhere that there's a newer DCGM_FI_PROF_GR_ENGINE_ACTIVE which might replace it, but I don't know whether the root cause here is the metric itself or something in the collection code. Anyone know whether collecting the 'prof' metric would incur a greater performance penalty than the 'dev' metric?

Thanks!

@TortoiseHam
Copy link
Author

(cross posted to NVIDIA/DCGM#199 since I'm not sure whether this a problem with the metric itself of the go wrapper being used to extract it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant