Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

DWR Monitoring, Alerting and Issue Resolution Strategy #48

Open
gjscheer-ucd opened this issue Apr 24, 2019 · 0 comments
Open

DWR Monitoring, Alerting and Issue Resolution Strategy #48

gjscheer-ucd opened this issue Apr 24, 2019 · 0 comments
Assignees

Comments

@gjscheer-ucd
Copy link
Collaborator

gjscheer-ucd commented Apr 24, 2019

DWR Monitoring, Alerting and Issue Resolution Strategy

Goals

Enable DWR to monitor, identify and rectify most if not all of the DWR GOES17 data issues.

Justification

Because of the significantly larger data size and frequency of GOES17 data as compared to GOES15, data processing for Spatial CIMIS introduces significantly higher probability for data corruption. It is for this fact that a premature promotion of the DWR GOES17 Spatial CIMIS processes to production / live status will unnecessarily put the team (DWR & UCD) on endless alert potentially introducing delays in providing ETo data to customers.

What is the strategy?

Prior to promoting the DWR GOES17 Spatial CIMIS processes to production / live status, DWR should be able to demonstrate the ability to go 2 weeks without a major processing issue while being able to adequately address live data delivery issues in a timely manner in order to avoid data loss and an interruption to their ETo delivery responsibilities.

To accomplish this the following strategy should be considered:

  • Designate DWR personnel for the alert team
  • Create an alert email list populated with the alert team
  • DWR alert team members should setup Pushover
  • DWR alert team members should be actively participating at the Spatial CIMIS Slack Channel
  • Develop a monitoring system such as the UCD status page
    • There should be basic monitoring of critical systems (ping, http, ssh, etc.)
    • There should be monitoring of quality of real-time data which could produce erroneous data such as at UCD
  • Receive training on how to identify and resolve issues once an alert is sent out.
    • Currently there is documentation on how to resolve every known issue at UCD.
    • Solutions to better handle existing issues is always ongoing.

Specific resources to monitor

AppDynamics can monitor and provide basic host alert information such as general availability, CPU RAM & disk usage. In addition to general availability alerts these are the specific services that need monitoring with alerts.

CIMIS grb-box

  • search file for keywords dsp-box down in the following file:
    /home/cimis/logs/status
  • when filesystem /grb reaches a certain % usage (80% ?) send a warning alert

CIMIS processor - test

  • search file for keywords max = 0 and no data in the following file:
    http://process-test/status/band-2 (located in /var/www/status/band-2)
  • when filesystem /apps reaches a certain % usage (80% ?) send a warning alert

CIMIS processor -prod

  • search file for keywords max = 0 and no data in the following file:
    http://process-prod/status/band-2 (located in /var/www/status/band-2)
  • when filesystem /apps reaches a certain % usage (80% ?) send a warning alert

Requested Strategy

This strategy requires DWR firewall rules to allow remote monitoring service to access ports 22, 80 and 443 for the following Spatial CIMIS servers are required:

source IP port(s) protocol destination IP host
see whitelist 22,80,443 TCP see "Monitoring" in request log dev
see whitelist 22,80,443 TCP see "Monitoring" in request log testing
see whitelist 22,80,443 TCP see "Monitoring" in request log prod
see whitelist 22,80,443 TCP see "Monitoring" in request log W Sac dsp-box
see whitelist 22,80,443 TCP see "Monitoring" in request log W Sac grb-box

Must know public facing IPs for all spatial cimis servers.

Remote monitoring service is Uptime Robot. IP's to white list are listed here:
https://uptimerobot.com/locations.php
https://uptimerobot.com/inc/files/ips/IPv4.txt

@gjscheer-ucd gjscheer-ucd changed the title DWR Alerting and High Availability Strategy DWR Monitoring, Alerting and High Availability Strategy Apr 24, 2019
@gjscheer-ucd gjscheer-ucd changed the title DWR Monitoring, Alerting and High Availability Strategy DWR Monitoring, Alerting and Issue Resolution Strategy Apr 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants