Added logging of Slurm subprocess failures and testing for all parsing functions #43

Rovanion · 2021-03-02T10:57:52Z

Hi,
Over the last two weeks I've been working on improving the error messages and testing of the Prometheus Slurm Exporter. I found that if any of the subprocesses failed the main go process terminated, only printing exit 1 before doing so; implying that the go process itself exited with status 1 when it was in fact the subprocess doing so.

I think that I have gone over all parts of the project that execute a subprocess and parses the text outputted by that subprocess. I have refactored these to use a common function/procedure for starting these subprocesses and reporting errors should they fail. I have also refactored them to have a pure function for parsing the output of the subprocess and have added unittests for all of these pure functions. I have also added system tests for the impure functions that execute sinfo, squeue and so on.

Over all the result is that there are now unit tests for all parser functions and system tests for all the subprocess launching functions. The subprocess launching functions now report the stderr of the subprocess should they fail.

This PR also encompasses the two previous PR's for fixing the formatting and splitting the old tests into system and unit tests. It was too much work rebasing it on master and going through every single commit to undo the autoformatting.

Why did I do all this? Because it was impossible to figure out why the system tests failed without this work and I needed to know this in order to write system tests for Nix. It was also the case that the unit tests would never fail since they never asserted anything.

The system tests require a working Slurm cluster to be present while running, or at least a working Slurm controller. The unit tests operate within the confines of the Go code and are independent of the rest of the system. `make test` still does the same thing as before, it runs all available tests. The commands `make unittest` and `make systemtest` were added for when only running tests within these categories are of interest. These changes were motivated by me trying to package prometheus-slurm-exporter for Nix. Nix wants to run the available tests after a build is complete but is unable to provide services, such as a Slurm cluster, during the check phase of a build. Categorizing the tests in this way allows for some tests to run during the build of the package rather than none. The categories themselves are declared at the top of the Go source files through the `//build $xyz` header.

I found there were inconsistencies in tab characters within the files. So I ran `go fmt` and this is the result.

It is a compiled binary, should not be tracked by git.

... with logging functions that print from where the call was made. This significantly improves the ability to figure out where in the code the error was caused.

Instead of merely printing "exit status 1" (which could reasonably be understood as "the main process, prometheus-slurm-exporter, exited with status 1") when a subprocces fails to execute. It looks like: ``` INFO[0000] Starting Server: :8080 source="main.go:48" 2021/02/22 13:17:50 exit status 1 ``` We now instead print that a subprocess has failed to run along with the error message given by that subprocess: ``` INFO[0000] Starting Server: :8080 source="main.go:48" 2021/02/22 13:16:31 main.AccountsData: The subprocess squeue failed with the following error: 2021/02/22 13:16:31 main.AccountsData: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: No error squeue: error: fetch_config: DNS SRV lookup failed squeue: error: _establish_config_source: failed to fetch config squeue: fatal: Could not establish a configuration source 2021/02/22 13:16:31 main.AccountsData: The subprocess squeue terminated with: exit status 1 ```

Also added docstrings to all logging functions.

…bprocess.go

…m tests

Not sure if the module behaves correctly though.

The existing "test" which only printed the parsed data had been broken since august 2020 and commit 1277b79.

Rovanion · 2021-03-30T07:56:46Z

Is there anything that I can do to help this PR get merged @vpenso?

Rovanion · 2021-07-21T07:39:47Z

This branch is starting to bitrot. As the author I obviously think it brings with it significant improvements and would suggest I rebase it on master and you merge it in before it rots even more. The value of code deduplication and having actual tests for the codebase should outweigh the risk of introduced bugs.

mtds · 2021-07-30T07:43:31Z

@Rovanion : yes, if you want to rebase your PR on the current master branch then now is probably a good time since in August it is unlikely that we will add/modify the code base. Note that the next overhaul may be in the near future when we will integrate our new GPU nodes.

One advice: if possible, try to remove entirely from your PR any kind of formatting adjustments of the source code,
since usually it tends to make the review of external contributions quite 'painful' to disentangle from the actual enhancement.

The value of code deduplication and having actual tests for the codebase should outweigh the risk of introduced bugs.

Well, having more tests will help but keep in mind that this exporter is targeting HPC clusters of different size and complexity,
and usually sysadmins in charge take a very conservative approaches in terms of updating something that it works.
And nobody like to have a piece of their monitoring falling on itself or starting to report wrong stats.

…utput-merge

To decrease the size of the pull request

Rovanion · 2021-08-04T20:46:58Z

I, erm, did not end up rebasing but merging in the changes from vpenso/master and manually reverted almost all style changes I could find; some remaining as I couldn't find the cause of the diff. I hope it will be a lighter read now.

We're hoping to put this into production on our HPC cluster this fall after upgrading to Slurm 20.11. I think we both want the same thing here, reliable monitoring of our clusters. I just value testing and code reuse more than absence of code changes.

Is there anything more that I can help with to get these changes merged?

Rovanion added 28 commits February 12, 2021 11:12

Corrected function name in cpus_test.go

8420a24

Corrected formatting of the project

e7064fd

I found there were inconsistencies in tab characters within the files. So I ran `go fmt` and this is the result.

Add prometheus-slurm-exporter to .gitignore

e5d513b

It is a compiled binary, should not be tracked by git.

Added Logging library

66472ef

... with logging functions that print from where the call was made. This significantly improves the ability to figure out where in the code the error was caused.

Added two logging functions for use in utility functions

af6975f

Also added docstrings to all logging functions.

Extracted the unix command execution from accounts.go to a generic su…

403a39f

…bprocess.go

Refactored cpus to use subprocess.go

706748f

Refactored gpus.go to use subprocess.go and also added unit and syste…

8644bba

…m tests

Refactored nodes.go to use subprocess.go

2b46da1

Refactored partitions.go to use subprocess.go

95c6ffb

Added system and unit tests for partition data gathering

fe6be80

Refactored queue.go to use subprocess.go

2fd60b2

Added real unit tests for queue.go

f1b1d22

Refactored scheduler.go to use subprocess.go

260c772

Added actual unit tests for scheduler.go

e851819

Refactored sshare.go to use subprocess.go

00a4cbc

Added tests for sshare.go

13c7526

Not sure if the module behaves correctly though.

Refactored users.go to use subprocess.go

bb630bd

Added tests for users.go

d1b68e6

Clearify an argument name in accounts.go

fbda610

Added tests for accounts.go

3574e0e

Corrected function name in nodes_system_test.go

bcc30db

Corrected imports in system tests for partitions, users and sshare

ca0012f

Corrected function name in users_system_test.go

ec689e7

Implemented an actual unit tests for cpus.go's ParseCPUsMetrics

a755555

Wrote actual tests for nodes.go

dca9b63

The existing "test" which only printed the parsed data had been broken since august 2020 and commit 1277b79.

Rovanion mentioned this pull request Mar 18, 2021

Exporter dies when Slurm accounting not enabled #45

Closed

Rovanion added 2 commits August 4, 2021 09:40

Merge remote-tracking branch 'origin/master' into log-slurm-failure-o…

9a3570b

…utput-merge

Refactored node.go to use subprocess.go

9afe3af

Rovanion force-pushed the log-slurm-failure-output branch 6 times, most recently from bb768e7 to ccf39b6 Compare August 4, 2021 20:12

Revert e7064fd, undoing style fixes

611df5a

To decrease the size of the pull request

Rovanion force-pushed the log-slurm-failure-output branch from ccf39b6 to 611df5a Compare August 4, 2021 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added logging of Slurm subprocess failures and testing for all parsing functions #43

Added logging of Slurm subprocess failures and testing for all parsing functions #43

Rovanion commented Mar 2, 2021 •

edited

Loading

Rovanion commented Mar 30, 2021

Rovanion commented Jul 21, 2021 •

edited

Loading

mtds commented Jul 30, 2021

Rovanion commented Aug 4, 2021

Added logging of Slurm subprocess failures and testing for all parsing functions #43

Are you sure you want to change the base?

Added logging of Slurm subprocess failures and testing for all parsing functions #43

Conversation

Rovanion commented Mar 2, 2021 • edited Loading

Rovanion commented Mar 30, 2021

Rovanion commented Jul 21, 2021 • edited Loading

mtds commented Jul 30, 2021

Rovanion commented Aug 4, 2021

Rovanion commented Mar 2, 2021 •

edited

Loading

Rovanion commented Jul 21, 2021 •

edited

Loading