You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.
We're running a deepops deployment with the current DCGM exporter docker image.
I noticed /var/log/nv-hostengine.log on most machines was full of messages like these:
ERROR [3468:4824] ReadNvSwitchStatusAllSwitches() returned No data is available [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:372] [DcgmNs::DcgmModuleNvSwitch::RunOnce]
Most machines because this doesn't happen on DGX-2 or DGX-A100, i.e., those that have an NVSwitch 😉
This isn't a huge problem. Mostly it spams this log with ~2 messages/minute, so if you leave it running for weeks the log will get very big. It would be nice if the exporter could detect whether NVSwitch is not present and turn collection off.
DGX-OS version doesn't seem to matter, but it happens with 5.0.5 and earlier.
Here's our /etc/dcgm-exporter/default-counters.csv, which should be the default except for we turned DCGM_FI_DEV_GPU_UTIL back on.
# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message
# Clocks,,
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power,,
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes
# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
The text was updated successfully, but these errors were encountered:
jfolz
changed the title
ReadNvSwitchStatusAllSwitches() returned No data is available
Log spam in nv-hostengine.log due to ReadNvSwitchStatusAllSwitches() returned No data is available
Jun 9, 2021
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We're running a deepops deployment with the current DCGM exporter docker image.
I noticed
/var/log/nv-hostengine.log
on most machines was full of messages like these:Most machines because this doesn't happen on DGX-2 or DGX-A100, i.e., those that have an NVSwitch 😉
This isn't a huge problem. Mostly it spams this log with ~2 messages/minute, so if you leave it running for weeks the log will get very big. It would be nice if the exporter could detect whether NVSwitch is not present and turn collection off.
DGX-OS version doesn't seem to matter, but it happens with 5.0.5 and earlier.
Here's our
/etc/dcgm-exporter/default-counters.csv
, which should be the default except for we turnedDCGM_FI_DEV_GPU_UTIL
back on.The text was updated successfully, but these errors were encountered: