Log spam in nv-hostengine.log due to ReadNvSwitchStatusAllSwitches() returned No data is available #194

jfolz · 2021-06-09T07:40:33Z

We're running a deepops deployment with the current DCGM exporter docker image.
I noticed /var/log/nv-hostengine.log on most machines was full of messages like these:

ERROR [3468:4824] ReadNvSwitchStatusAllSwitches() returned No data is available [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:372] [DcgmNs::DcgmModuleNvSwitch::RunOnce]

Most machines because this doesn't happen on DGX-2 or DGX-A100, i.e., those that have an NVSwitch 😉
This isn't a huge problem. Mostly it spams this log with ~2 messages/minute, so if you leave it running for weeks the log will get very big. It would be nice if the exporter could detect whether NVSwitch is not present and turn collection off.

DGX-OS version doesn't seem to matter, but it happens with 5.0.5 and earlier.

Here's our /etc/dcgm-exporter/default-counters.csv, which should be the default except for we turned DCGM_FI_DEV_GPU_UTIL back on.

# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes

# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

The text was updated successfully, but these errors were encountered:

jfolz mentioned this issue Jun 9, 2021

nvidia-dcgm-exporter creates huge logs inside container #182

Open

jfolz changed the title ~~ReadNvSwitchStatusAllSwitches() returned No data is available~~ Log spam in nv-hostengine.log due to ReadNvSwitchStatusAllSwitches() returned No data is available Jun 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log spam in nv-hostengine.log due to ReadNvSwitchStatusAllSwitches() returned No data is available #194

Log spam in nv-hostengine.log due to ReadNvSwitchStatusAllSwitches() returned No data is available #194

jfolz commented Jun 9, 2021 •

edited

Loading

Log spam in nv-hostengine.log due to ReadNvSwitchStatusAllSwitches() returned No data is available #194

Log spam in nv-hostengine.log due to ReadNvSwitchStatusAllSwitches() returned No data is available #194

Comments

jfolz commented Jun 9, 2021 • edited Loading

jfolz commented Jun 9, 2021 •

edited

Loading