Skip to content

Commit

Permalink
Merge pull request #352 from ubccr/web105
Browse files Browse the repository at this point in the history
Update website for 10.5
  • Loading branch information
jpwhite4 authored Sep 11, 2023
2 parents 9044a8a + 579ab2b commit f6dd756
Show file tree
Hide file tree
Showing 66 changed files with 2,912 additions and 31 deletions.
1 change: 0 additions & 1 deletion 10.0/supremm-overview.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
---
redirect_from:
- "/10.0/"
- ""
---

This documentation is intended to be used by system administrators to install and configure
Expand Down
122 changes: 122 additions & 0 deletions 10.5/analytics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
---
title: Customizing Help Text for Efficiency Analytics
---

This document describes how to customize the help text displayed when viewing the analytics in the Efficiency Tab and remove any analytics that are not pertinent to your center.
These instructions are intended for a system administrator of an Open XDMoD instance. Information about the use of the efficiency tab can be found in the user manual which is available from the "Help" menu on the top right of the XDMoD portal.
## Efficiency Analytics

The efficiency analytics are shown when you first navigate to the efficiency tab. With each analytic,
there is associated text explaining what the analytic is, what is considered inefficient in regard to the analytic, and
supporting text giving suggestions of how to improve efficiency regarding a specific analytic.

As of the 10.0 release, there are six analytics. These are CPU Usage, GPU Usage, Memory Usage, Homogeneity, Wall Time Accuracy, and Short Jobs. When you first navigate to the efficiency tab, you will see cards each representing an analytic. Each card includes a brief description of the analytic as well as a scatter plot that allows you to visualize how users rank in each analytic (see Figure 1 below).

<figure>
<img src="{{ site.baseurl }}/assets/images/efficiency_tab.png" width="700" height="400" alt="Screenshot of the initial view when first navigating to the efficiency tab. The view shows six cards broken down into two categories - one for usage analytics which includes CPU Usage, GPU Usage, Memory Usage and Homogeneity and one for design analytics which includes Wall Time Accuracy and Short Jobs. Each card displays a short description of the analytic and a thumbnail view of the scatter plot related to that analytic. The scatter plot points represent a user's usage on the resource and efficiency of their jobs for the specific analytic." />
<figcaption>Figure 1. Example screenshot of initial view when navigating to the efficiency tab.</figcaption>
</figure>

Upon clicking on one of the analytic cards, a user will be shown a larger view of the scatter plot that they can interact with by filtering or drilling down. In this view, more text is presented to give more information about the specific analytic and how efficiency may be improved for that analytic. Figure 2 below shows this view for CPU Usage. In this image, you can see the help text at the top of the image above the scatter plot. The associated help text shown in this view can be customized as needed for different centers.

<figure>
<img src="{{ site.baseurl }}/assets/images/cpu_usage.png" width="700" height="400" alt="Screenshot of the scatter plot view for CPU Usage in the efficiency tab. Each analytic from the initial efficiency tab page can be viewed in more detail by clicking on the analytic card. This view shows the same scatter plot from the analytic card, but the scatter plot can be filtered or drilled down to learn more information about the users and their respective jobs represented by the scatter plot points. In addition to the scatter plot, there is a side bar that allows filtering and above the scatter plot is the help text explaining the analytic in more detail and giving more information on how to improve efficiency in regard to this analytic." />
<figcaption>Figure 2. Example of CPU Usage scatter plot view and corresponding help text.</figcaption>
</figure>

You may want to customize this text to provide more focused feedback for users at your center or provide links to help text as appropriate. To do so, please follow the directions provided below.

**These instructions only apply to Open XDMoD 10.0. For later versions please refer to the documentation for that release.**

To customize text for an analytic you need to edit the `/etc/xdmod/efficiency_analytics.json` configuration file. For example, to edit the help text for the CPU Usage scatter plot you would edit the html that is under "documentation" in lines 28-37. To edit the help text for the CPU Usage histogram, you would edit the html that is under "histogramHelpText" in lines 46-55. The lines to edit for each are shown below.
```json
28 "documentation": [
39 "<div class='text'>",
30 "<h6> What is this analytic? </h6>",
31 "<p>The chart below shows the percentage of time that the CPU cores were idle compared to overall usage. Each point on the plot shows the data for the jobs for a particular user.</p>",
32 "<h6> Why is this analytic important? </h6>",
33 "<p>Making sure jobs use the right number of CPU cores helps ensure that the compute resources are used efficiently.</p>",
34 "<h6> How do I improve future jobs? </h6>",
35 "<p>For typical compute intensive jobs, the overall CPU usage should be &gt; 90 % (i.e. CPU core idle &lt; 10 %). Consider requesting fewer CPU cores for future jobs, or adjust the configuration settings of the software to make use of all the cores that have been requested.</p>",
36 "</div>"
37 ],
```

```json
46 "histogramHelpText": [
47 "<div class='text'>",
48 "<h6> What is this analytic? </h6>",
49 "<p>The chart below shows the percentage of time that the CPU cores were idle compared to overall usage.</p>",
50 "<h6> Why is this analytic important? </h6>",
51 "<p>Making sure jobs use the right number of CPU cores helps ensure that the compute resources are used efficiently.</p>",
52 "<h6> How do I improve future jobs? </h6>",
53 "<p>For typical compute intensive jobs, the overall CPU usage should be &gt; 90 % (i.e. CPU core idle &lt; 10 %). Consider requesting fewer CPU cores for future jobs, or adjust the configuration settings of the software to make use of all the cores that have been requested.</p>",
54 "</div>"
55 ]
```

For editing help text for other analytics, the process is the same, but you need to change the "documentation" that corresponds to that specific analytic. Not all analytics include alternative help text for the histogram view. If you want to include this text, add in the histogramHelpText section under the histogram json for the analytic that you are interested in changing. If you want to exclude this text, remove the histogramHelpText section from the histogram json for that specific analytic.

In addition to editing the help text, you can also remove an analytic if it is not pertinent to your center. To do this, you need to remove the analytic from the `/etc/xdmod/efficiency_analytics.json` configuration file. For example, to remove GPU Usage from showing in the XDMoD interface, you would remove lines 58-109 from this file. The lines to remove are shown below.

```json
58 {
59 "analytic": "GPU Usage",
60 "description": "How busy were the GPU cores for GPU jobs?",
61 "title": "GPU Usage",
62 "field": "avg_percent_gpu_active",
63 "statistics": [
64 "avg_percent_gpu_active",
65 "gpu_time"
66 ],
67 "statisticLabels": [
68 "Avg GPU % Active: Weighted by GPU Hour",
69 "GPU Hours: Total"
70 ],
71 "statisticDescription": [
72 "<ul><li><b>Avg GPU active: weighted by gpu-hour: </b> The average GPU usage % weighted by gpu hours, over all jobs that were executing.</li></ul><ul><li><b>GPU Hours: Total</b> The total GPU time in hours for all jobs that were executing during the time period. The GPU time is calculated as the number of allocated GPU devices multiplied by the wall time of the job.</li></ul>"
73 ],
74 "valueLabels": [
75 "%",
76 "GPU Hours"
77 ],
78 "reversed": true,
79 "realm": "SUPREMM",
80 "documentation": [
81 "<div class='text'>",
82 "<h6> What is this analytic? </h6>",
83 "<p>The chart below shows the percentage of time that the GPUs were busy compared to overall usage. Each point on the plot shows the GPU jobs for a particular user.</p>"
84 "<h6> Why is this analytic important? </h6>",
85 "<p>Making sure jobs use the right number of GPUs helps ensure that the compute resources are used efficiently.</p>",
86 "<h6> How do I improve future jobs? </h6>",
87 "<p>Try to ensure that the number of GPUs requested matches the number required. If a code is not using all GPUs adjust the configuration settings of the software to make use of all the requested GPUs or consider requesting fewer GPUs in future jobs. If you have jobs with 0% GPU usage, double check that the code is compiled correctly to make use of the GPUs and is not defaulting to CPU-only calculations.</p>",
88 "</div>"
89 ],
90 "histogram": {
91 "title": "GPU Usage",
92 "metric": "gpu_time",
93 "metricTitle": "GPU Hours Total",
94 "group_by" : "gpu_usage_bucketid",
95 "groupByTitle": "GPU Active Value",
96 "rotate": true,
97 "arrowImg": "right_arrow.png",
98 "histogramHelpText": [
99 "<div class='text'>",
100 "<h6> What is this analytic? </h6>",
101 "<p>The chart below shows the percentage of time that the GPUs were busy compared to overall usage.</p>",
102 "<h6> Why is this analytic important? </h6>",
103 "<p>Making sure jobs use the right number of GPUs helps ensure that the compute resources are used efficiently.</p>",
104 "<h6> How do I improve future jobs? </h6>",
105 "<p>Try to ensure that the number of GPUs requested matches the number required. If a code is not using all GPUs adjust the configuration settings of the software to make use of all the requested GPUs or consider requesting fewer GPUs in future jobs. If you have jobs with 0% GPU usage, double check that the code is compiled correctly to make use of the GPUs and is not defaulting to CPU-only calculations.</p>",
106 "</div>"
107 ]
108 }
109 },
```

Figure 3 below shows an example of what the updated efficiency tab interface would look like if you were to remove GPU Usage.

<figure>
<img src="{{ site.baseurl }}/assets/images/efficiency_tab_no_gpu.png" width="700" height="400" alt="Screenshot of the updated efficiency tab interface after customizing the interface to remove the GPU Usage analytic." />
<figcaption>Figure 3. Example of an updated efficiency tab interface after removing GPU Usage analytic.</figcaption>
</figure>
69 changes: 69 additions & 0 deletions 10.5/customization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: Customization
---

This document describes some advanced customizations for the Job Performance module.

**The automated upgade scripts do not have any support for preserving
customizations. Any changes made to the underlying Open XDMoD source code
will likely be overwitten when the software us upgraded.**

## Job Analytics

The Job analytics panel shows selected job performance metrics in color
coded plots across the top of the job tab in the Job Viewer. The value of
each metric in the panel is normalized so a value near 1 means a favourable
value and a value near 0 indicates an unfavourable value.

There are five default analytics. These are the CPU usage,
CPU Balance, Walltime Accuracy, Memory Efficiency and Homogeneity, see Figure 1
below. If the CPU usage metric is unavailable then the analytics toolbar is not displayed.
If any of the other metrics are unavailable then an error message is displayed.

<figure>
<img src="{{ site.baseurl }}/assets/images/analytics_with_five.png" alt="Screenshot of the analytics toolbar. The toolbar has five different boxes arranged in a line. Each box shows a performance metric name, metric value and a bar plot that also indicates the value." />
<figcaption>Figure 1. Example screenshot of the analytics toolbar.</figcaption>
</figure>

A common reason why an analytic is unavailable is that the underlying data was not collected
when the job was running. For example, the homogeneity analytic uses the L1D load count and
CPU clock tick counter hardware counter data. If the hardware counter data was not configured
to be collected or the hardware does not support a L1D load counter then the homogeneity
metric will be unavailable. An example of the display in this case is shown in Figure 2.

<figure>
<img src="{{ site.baseurl }}/assets/images/analytics_unavailable.png" alt="Screenshot of showing a performance metric from the analytics toolbar where the performance datum is unavailable. The metric display shows an exclaimation mark icon with the text 'Metric Missing: Not Available On The Compute Nodes" />
<figcaption>Figure 2. Example analytics metric display when the datum is unavailable.</figcaption>
</figure>

If an analytic will always be unavailable (for example, due to the absence of
hardware support), then the Open XDMoD instance can be customized to never show it.

**This customization will not be preserved if the Open XDMoD software is updated.**

**These instructions only apply to Open XDMoD {{ page.sw_version }}. For other
versions please refer to the documentation for that release.**

To remove an analytic, you need to edit `/usr/share/xdmod/classes/DataWarehouse/Query/SUPREMM/JobDataset.php`
and remove the code associated with the analytic. For example, to remove the homogeneity
analytic you would remove (or comment out) lines 330-346. I.e. the function call to `addFieldWithError` and the
update to the documentation object. The lines to remove are shown below.
```php
330 $this->addFieldWithError(
331 new FormulaField("(1.0 - (1.0 / (1.0 + 1000.0 * jf.catastrophe)))", "homogeneity"),
332 'catastrophe',
333 $joberrors,
334 'homogeneity_error'
335 );
336 $this->documentation['homogeneity'] = array(
337 'name'=> 'Homogeneity',
338 'units' => 'ratio',
339 'per' => 'job',
340 'visibility' => 'public',
341 'documentation' => 'A measure of how uniform the L1D load rate is over the lifetime of the job.
342 Jobs with a low homogeneity value (~0) should be investigated to check if there
343 has been a catastrophic failure during the job',
344 'batchExport' => true,
345 'dtype' => 'analysis'
346 );
```
Loading

0 comments on commit f6dd756

Please sign in to comment.