DWR Monitoring, Alerting and Issue Resolution Strategy #48

gjscheer-ucd · 2019-04-24T19:31:49Z

DWR Monitoring, Alerting and Issue Resolution Strategy

Goals

Enable DWR to monitor, identify and rectify most if not all of the DWR GOES17 data issues.

Justification

Because of the significantly larger data size and frequency of GOES17 data as compared to GOES15, data processing for Spatial CIMIS introduces significantly higher probability for data corruption. It is for this fact that a premature promotion of the DWR GOES17 Spatial CIMIS processes to production / live status will unnecessarily put the team (DWR & UCD) on endless alert potentially introducing delays in providing ETo data to customers.

What is the strategy?

Prior to promoting the DWR GOES17 Spatial CIMIS processes to production / live status, DWR should be able to demonstrate the ability to go 2 weeks without a major processing issue while being able to adequately address live data delivery issues in a timely manner in order to avoid data loss and an interruption to their ETo delivery responsibilities.

To accomplish this the following strategy should be considered:

Designate DWR personnel for the alert team
Create an alert email list populated with the alert team
DWR alert team members should setup Pushover
DWR alert team members should be actively participating at the Spatial CIMIS Slack Channel
Develop a monitoring system such as the UCD status page
- There should be basic monitoring of critical systems (ping, http, ssh, etc.)
- There should be monitoring of quality of real-time data which could produce erroneous data such as at UCD
Receive training on how to identify and resolve issues once an alert is sent out.
- Currently there is documentation on how to resolve every known issue at UCD.
- Solutions to better handle existing issues is always ongoing.

Specific resources to monitor

AppDynamics can monitor and provide basic host alert information such as general availability, CPU RAM & disk usage. In addition to general availability alerts these are the specific services that need monitoring with alerts.

CIMIS grb-box

search file for keywords dsp-box down in the following file:
/home/cimis/logs/status
when filesystem /grb reaches a certain % usage (80% ?) send a warning alert

CIMIS processor - test

search file for keywords max = 0 and no data in the following file:
http://process-test/status/band-2 (located in /var/www/status/band-2)
when filesystem /apps reaches a certain % usage (80% ?) send a warning alert

CIMIS processor -prod

search file for keywords max = 0 and no data in the following file:
http://process-prod/status/band-2 (located in /var/www/status/band-2)
when filesystem /apps reaches a certain % usage (80% ?) send a warning alert

Requested Strategy

This strategy requires DWR firewall rules to allow remote monitoring service to access ports 22, 80 and 443 for the following Spatial CIMIS servers are required:

source IP	port(s)	protocol	destination IP	host
see whitelist	22,80,443	TCP	see "Monitoring" in request log	dev
see whitelist	22,80,443	TCP	see "Monitoring" in request log	testing
see whitelist	22,80,443	TCP	see "Monitoring" in request log	prod
see whitelist	22,80,443	TCP	see "Monitoring" in request log	W Sac dsp-box
see whitelist	22,80,443	TCP	see "Monitoring" in request log	W Sac grb-box

Must know public facing IPs for all spatial cimis servers.

Remote monitoring service is Uptime Robot. IP's to white list are listed here:
https://uptimerobot.com/locations.php
https://uptimerobot.com/inc/files/ips/IPv4.txt

gjscheer-ucd added the Documentation label Apr 24, 2019

gjscheer-ucd changed the title ~~DWR Alerting and High Availability Strategy~~ DWR Monitoring, Alerting and High Availability Strategy Apr 24, 2019

gjscheer-ucd changed the title ~~DWR Monitoring, Alerting and High Availability Strategy~~ DWR Monitoring, Alerting and Issue Resolution Strategy Apr 24, 2019

qjhart mentioned this issue Apr 25, 2019

Development Notes #44

Open

qjhart mentioned this issue Apr 29, 2019

DWR users monitor GOES satellite #8

Closed

2 tasks

gjscheer-ucd assigned gjscheer-ucd and ajdelmundo and unassigned gjscheer-ucd Jun 6, 2019

gjscheer-ucd assigned dafernan Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DWR Monitoring, Alerting and Issue Resolution Strategy #48

DWR Monitoring, Alerting and Issue Resolution Strategy #48

gjscheer-ucd commented Apr 24, 2019 •

edited

Loading

DWR Monitoring, Alerting and Issue Resolution Strategy #48

DWR Monitoring, Alerting and Issue Resolution Strategy #48

Comments

gjscheer-ucd commented Apr 24, 2019 • edited Loading