You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#> milkcheck -c /root/admin/atosfa/etc/milkcheck status -n 'aa1-[1022,2000,2009,2029,3028,4060],aa4-[1008,2018,2021,3022],aa2-[1035,3014,3020],aa3-[3007,3037,4012],aa5-2002' ; echo $?
bmcping - Check if BMC respond to ping [ OK ]
ib-HDR200-interswitch-links - Check that all interswitch links are at 4x HDR [ OK ]
ibswitches-computes - Check that compute switches are 202 [ OK ]
powerstatus - Check Power Status is on [ OK ]
kernel - Check if the host respond to ssh and check kernel version [ OK ]
image-version - Check image version for /IMAGE_NAME [ OK ]
nb-numa-nodes - Check if number of NUMA nodes is coherent [ OK ]
proc-cmdline - Test that /proc/cmdline [ OK ]
slurm-version - Check if SLURM version is fine [ OK ]
memsize-cn - Check if memory size is fine [ OK ]
biossettings-testing - Check BIOS TESTING settings [ OK ]
status ib-firmware.mlx5_0 ran in 0.48 s-BIOS-rome,lspci,memory-manuf,memory-rank,memory-speed,nb-cpus,selfull,user-id]
> aa1-[1022,2000,2009,2029,3028,4060],aa4-[1008,2018,2021,3022],aa2-[1035,3014,3020],aa3-[3007,3037,4012],aa5-2002 exited with 1
ib-firmware.mlx5_0 - Check IB firmware mlx5_0 [ ERROR ]
ib-firmware [DEP_ERROR]
lspci - Check lspci md5sum of the node [ OK ]
user-id - Check if the user naih is available [ OK ]
nb-cpus - Check if number of CPUs is coherent [ OK ]
memory-manuf - Check Manufacturer is Samsung for all DIMMs [ OK ]
memory-speed - Check Speed of all DIMMs [ OK ]
check-TS-compute-mount - Check if TS is mounted. If not mount it on gidc [ OK ]
memory-rank - Check Rank of all DIMMs [ OK ]
ib-state.mlx5_0 - Check IB state mlx5_0 is Active [ OK ]
ib-state [ OK ]
microcode - Check if the microcode of the CPU is correctly loaded [ OK ]
ib-rate-hdr.mlx5_0 - Check IB rate mlx5_0 is HDR200 [ OK ]
ib-rate-hdr [ OK ]
ib-temperature.mlx5_0 - Check IB HCA mlx5_0 temperature (requires mft RPM) [ OK ]
ib-temperature [ OK ]
bmc-ntp - Activate NTP on BMC [ OK ]
selfull - Check sel usage nearly full [ OK ]
compute-BIOS-rome - Check BIOS is Rome on compute node [ OK ]
compute-snmp-trap - Check if component as traps configured correctly [ OK ]
compute-firmware-testing - Check firmwares of compute node against TS TESTING [ OK ]
compute-backup-image-testing - Check that image backup is up2date [ OK ]
ib-params.PCI_WR_ORDERING.mlx5_0 - Check IB HCA param PCI_WR_ORDERING [ OK ]
ib-params.MAX_ACC_OUT_READ.mlx5_0 - Check IB HCA param MAX_ACC_OUT_READ [ OK ]
ib-params.ADVANCED_PCI_SETTINGS.mlx5_0 - Check IB HCA param ADVANCED_PCI_SETTINGS [ OK ]
ib-mode.mlx5_0 - Check IB mode [ OK ]
ib-mode [ OK ]
ib-lspci.mlx5_0 - Check lspci on IB card for MaxReadReq [ OK ]
ib-lspci [ OK ]
0
Hi, thanks for the report, I think its a duplicate from issue #34.
Can you confirm by launching the command without any node arguments: milkcheck -c /root/admin/atosfa/etc/milkcheck status ; echo $?
I think you have some skipped services leading to the #34 issue.
Dear milkcheck dev,
I have faced a case were the exit code of milkcheck is incoherent with the fact that one test failed.
Here is an example:
The debug verbose version in attachment
milkcheck.github.txt.
In some other tests, the return code is not 0 but in that specific case it comes back to 0. Any hint?
Best regards,
Fabien Archambault
The text was updated successfully, but these errors were encountered: