Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added logging of Slurm subprocess failures and testing for all parsing functions #43

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

Rovanion
Copy link

@Rovanion Rovanion commented Mar 2, 2021

Hi,
Over the last two weeks I've been working on improving the error messages and testing of the Prometheus Slurm Exporter. I found that if any of the subprocesses failed the main go process terminated, only printing exit 1 before doing so; implying that the go process itself exited with status 1 when it was in fact the subprocess doing so.

I think that I have gone over all parts of the project that execute a subprocess and parses the text outputted by that subprocess. I have refactored these to use a common function/procedure for starting these subprocesses and reporting errors should they fail. I have also refactored them to have a pure function for parsing the output of the subprocess and have added unittests for all of these pure functions. I have also added system tests for the impure functions that execute sinfo, squeue and so on.

Over all the result is that there are now unit tests for all parser functions and system tests for all the subprocess launching functions. The subprocess launching functions now report the stderr of the subprocess should they fail.

This PR also encompasses the two previous PR's for fixing the formatting and splitting the old tests into system and unit tests. It was too much work rebasing it on master and going through every single commit to undo the autoformatting.

Why did I do all this? Because it was impossible to figure out why the system tests failed without this work and I needed to know this in order to write system tests for Nix. It was also the case that the unit tests would never fail since they never asserted anything.

The system tests require a working Slurm cluster to be present while running, or at least a working Slurm controller. The unit tests operate within the confines of the Go code and are independent of the rest of the system.

`make test` still does the same thing as before, it runs all available tests. The commands `make unittest` and `make systemtest` were added for when only running tests within these categories are of interest.

These changes were motivated by me trying to package prometheus-slurm-exporter for Nix. Nix wants to run the available tests after a build is complete but is unable to provide services, such as a Slurm cluster, during the check phase of a build. Categorizing the tests in this way allows for some tests to run during the build of the package rather than none.

The categories themselves are declared at the top of the Go source files through the `//build $xyz` header.
I found there were inconsistencies in tab characters within the files. So I ran `go fmt` and this is the result.
It is a compiled binary, should not be tracked by git.
... with logging functions that print from where the call was made. This significantly improves the ability to figure out where in the code the error was caused.
Instead of merely printing "exit status 1" (which could reasonably be
understood as "the main process, prometheus-slurm-exporter, exited
with status 1") when a subprocces fails to execute. It looks like:

```
INFO[0000] Starting Server: :8080                        source="main.go:48"
2021/02/22 13:17:50 exit status 1
```

We now instead print that a subprocess has failed to run along with
the error message given by that subprocess:

```
INFO[0000] Starting Server: :8080                        source="main.go:48"
2021/02/22 13:16:31 main.AccountsData: The subprocess squeue failed with the following error:
2021/02/22 13:16:31 main.AccountsData: squeue: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
squeue: error: fetch_config: DNS SRV lookup failed
squeue: error: _establish_config_source: failed to fetch config
squeue: fatal: Could not establish a configuration source
2021/02/22 13:16:31 main.AccountsData: The subprocess squeue terminated with: exit status 1
```
Also added docstrings to all logging functions.
Not sure if the module behaves correctly though.
The existing "test" which only printed the parsed data had been broken since august 2020 and commit 1277b79.
@Rovanion
Copy link
Author

Is there anything that I can do to help this PR get merged @vpenso?

@Rovanion
Copy link
Author

Rovanion commented Jul 21, 2021

This branch is starting to bitrot. As the author I obviously think it brings with it significant improvements and would suggest I rebase it on master and you merge it in before it rots even more. The value of code deduplication and having actual tests for the codebase should outweigh the risk of introduced bugs.

@mtds
Copy link
Collaborator

mtds commented Jul 30, 2021

@Rovanion : yes, if you want to rebase your PR on the current master branch then now is probably a good time since in August it is unlikely that we will add/modify the code base. Note that the next overhaul may be in the near future when we will integrate our new GPU nodes.

One advice: if possible, try to remove entirely from your PR any kind of formatting adjustments of the source code,
since usually it tends to make the review of external contributions quite 'painful' to disentangle from the actual enhancement.

The value of code deduplication and having actual tests for the codebase should outweigh the risk of introduced bugs.

Well, having more tests will help but keep in mind that this exporter is targeting HPC clusters of different size and complexity,
and usually sysadmins in charge take a very conservative approaches in terms of updating something that it works.
And nobody like to have a piece of their monitoring falling on itself or starting to report wrong stats.

@Rovanion Rovanion force-pushed the log-slurm-failure-output branch 6 times, most recently from bb768e7 to ccf39b6 Compare August 4, 2021 20:12
To decrease the size of the pull request
@Rovanion
Copy link
Author

Rovanion commented Aug 4, 2021

I, erm, did not end up rebasing but merging in the changes from vpenso/master and manually reverted almost all style changes I could find; some remaining as I couldn't find the cause of the diff. I hope it will be a lighter read now.

We're hoping to put this into production on our HPC cluster this fall after upgrading to Slurm 20.11. I think we both want the same thing here, reliable monitoring of our clusters. I just value testing and code reuse more than absence of code changes.

Is there anything more that I can help with to get these changes merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants