Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit code incoherent if a test fails #59

Open
MarbolanGos opened this issue Jul 8, 2022 · 1 comment
Open

Exit code incoherent if a test fails #59

MarbolanGos opened this issue Jul 8, 2022 · 1 comment

Comments

@MarbolanGos
Copy link

Dear milkcheck dev,

I have faced a case were the exit code of milkcheck is incoherent with the fact that one test failed.

# milkcheck --version
milkcheck 1.2.2
# uname -r
3.10.0-1160.42.2.el7.x86_64

Here is an example:

#> milkcheck -c /root/admin/atosfa/etc/milkcheck status -n 'aa1-[1022,2000,2009,2029,3028,4060],aa4-[1008,2018,2021,3022],aa2-[1035,3014,3020],aa3-[3007,3037,4012],aa5-2002' ; echo $?
bmcping - Check if BMC respond to ping                                                                     [    OK   ]
ib-HDR200-interswitch-links - Check that all interswitch links are at 4x HDR                               [    OK   ]
ibswitches-computes - Check that compute switches are 202                                                  [    OK   ]
powerstatus - Check Power Status is on                                                                     [    OK   ]
kernel - Check if the host respond to ssh and check kernel version                                         [    OK   ]
image-version - Check image version for /IMAGE_NAME                                                        [    OK   ]
nb-numa-nodes - Check if number of NUMA nodes is coherent                                                  [    OK   ]
proc-cmdline - Test that /proc/cmdline                                                                     [    OK   ]
slurm-version - Check if SLURM version is fine                                                             [    OK   ]
memsize-cn - Check if memory size is fine                                                                  [    OK   ]
biossettings-testing - Check BIOS TESTING settings                                                         [    OK   ]
status ib-firmware.mlx5_0 ran in 0.48 s-BIOS-rome,lspci,memory-manuf,memory-rank,memory-speed,nb-cpus,selfull,user-id]
 > aa1-[1022,2000,2009,2029,3028,4060],aa4-[1008,2018,2021,3022],aa2-[1035,3014,3020],aa3-[3007,3037,4012],aa5-2002 exited with 1
ib-firmware.mlx5_0 - Check IB firmware mlx5_0                                                              [  ERROR  ]
ib-firmware                                                                                                [DEP_ERROR]
lspci - Check lspci md5sum of the node                                                                     [    OK   ]
user-id - Check if the user naih is available                                                              [    OK   ]
nb-cpus - Check if number of CPUs is coherent                                                              [    OK   ]
memory-manuf - Check Manufacturer is Samsung for all DIMMs                                                 [    OK   ]
memory-speed - Check Speed of all DIMMs                                                                    [    OK   ]
check-TS-compute-mount - Check if TS is mounted. If not mount it on gidc                                   [    OK   ]
memory-rank - Check Rank of all DIMMs                                                                      [    OK   ]
ib-state.mlx5_0 - Check IB state mlx5_0 is Active                                                          [    OK   ]
ib-state                                                                                                   [    OK   ]
microcode - Check if the microcode of the CPU is correctly loaded                                          [    OK   ]
ib-rate-hdr.mlx5_0 - Check IB rate mlx5_0 is HDR200                                                        [    OK   ]
ib-rate-hdr                                                                                                [    OK   ]
ib-temperature.mlx5_0 - Check IB HCA mlx5_0 temperature (requires mft RPM)                                 [    OK   ]
ib-temperature                                                                                             [    OK   ]
bmc-ntp - Activate NTP on BMC                                                                              [    OK   ]
selfull - Check sel usage nearly full                                                                      [    OK   ]
compute-BIOS-rome - Check BIOS is Rome on compute node                                                     [    OK   ]
compute-snmp-trap - Check if component as traps configured correctly                                       [    OK   ]
compute-firmware-testing - Check firmwares of compute node against TS TESTING                              [    OK   ]
compute-backup-image-testing - Check that image backup is up2date                                          [    OK   ]
ib-params.PCI_WR_ORDERING.mlx5_0 - Check IB HCA param PCI_WR_ORDERING                                      [    OK   ]
ib-params.MAX_ACC_OUT_READ.mlx5_0 - Check IB HCA param MAX_ACC_OUT_READ                                    [    OK   ]
ib-params.ADVANCED_PCI_SETTINGS.mlx5_0 - Check IB HCA param ADVANCED_PCI_SETTINGS                          [    OK   ]
ib-mode.mlx5_0 - Check IB mode                                                                             [    OK   ]
ib-mode                                                                                                    [    OK   ]
ib-lspci.mlx5_0 - Check lspci on IB card for MaxReadReq                                                    [    OK   ]
ib-lspci                                                                                                   [    OK   ]
0

The debug verbose version in attachment
milkcheck.github.txt.

In some other tests, the return code is not 0 but in that specific case it comes back to 0. Any hint?

Best regards,
Fabien Archambault

@cedeyn
Copy link
Collaborator

cedeyn commented Jul 19, 2022

Hi, thanks for the report, I think its a duplicate from issue #34.
Can you confirm by launching the command without any node arguments:
milkcheck -c /root/admin/atosfa/etc/milkcheck status ; echo $?

I think you have some skipped services leading to the #34 issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants