Slurmmon is a system for gaining insight into Slurm and the jobs it runs. It's meant for cluster administrators looking to raise cluster utilization and measure the effects of configuration changes. Features include:
- trending all the scheduler performance diagnostics (
sdiag
output) - measuring job turnaround time of probe jobs, as a bellwether of scheduling issues
- creating daily whitespace reports -- identifying specific users and jobs with low utilization of their allocations (the jobs that lead to the dreaded whitespace gap in plots of total resources vs. used resources)
Slurmmon is meant to run on a RHEL/CentOS/SL 6 based system and currently uses Ganglia for data collection and Apache/mod_python for reporting. The components are:
- slurmmon-daemon -- the daemons that query Slurm and send data to Ganglia
- slurmmon-ganglia -- the Ganglia custom reports that use php to stack raw rrd data
- slurmmon-web -- a set of web pages that organize all the reports and relevant plots
- slurmmon-python -- a general python interface to Slurm, using lazy evaluation
See the doc
directory for more information, specifically:
Here is a screenshot of the basic diagnostic report from the production cluster at FASRC: