Troubleshooting monitoring alerts

Alerts

`backup_is_too_old`

This alert indicates no new backup has been uploaded to azure blob storage for a while

Troubleshooting

Check the jenkins job *_Backup has run and succeeded in expected duration

`backup_size_is_too_small`

This alert indicates size backup file uploaded to azure blob is less than expected size. This would happen if backup script produces dummy file due to failure as happened in gitlab database incident

Troubleshooting

Check the jenkins job *_Backup has run and succeeded without errors or warnings which might produce dummy backup file

`docker_swarm_node_down`

This alert indicates one of node in docker swarm is down. This could happen either if the server running docker crashes(likely) or loses connectivity to swarm(less likely)

Troubleshooting

SSH into docker swarm manager and check which node is down by executing docker node ls. Try to SSH to the node which is down and see if it can ping the swarm master

Action Items

If node is shutdown, restart the server(from azure portal)
Follow the instructions in https://github.com/project-sunbird/sunbird-devops/wiki/Docker-swarm-troubleshooting#2-docker-swarm-worker-node-is-down

`logs_ingestion_slow`

This alert indicates logs are not central log aggregation server(logs elastic search)

Troubleshooting

Search for logs in Kibana -> Discover -> Search *. On left side panel, click on program filter to check if logs not flowing for a particular service or for all services

Action Items

`high_memory_usage_on_container`

This alert indicates container is using over 90% memory. If it reaches the memory limit configured in docker swarm, this container will get restarted due to OutOfMemory error.

Troubleshooting

Open Container Details dashboard in grafana
Select the time range between which alert was generated
Select the container interested in all the graphs in this dashboard
"Memory Usage" and "Memory Usage %" graphs shows memory usage over time

Action Items

If this is unexpected behaviour for this application, debug the application running for memory leak issues
If high memory usage is expected across all environments, increment the default memory limit for this service in the ansible role defaults/main.yml in public repo
If high memory usage is expected in a specific environment, increment the memory limit for this service in the inventory group_vars of this environment

`monitoring_service_down`

This indicates the exporter configured for scraping metrics is not reachable by prometheus. This is usually because exporter may not be running due to an issue

Troubleshooting

Check service info following steps in Docker swarm management commands

`service_down`

This alert indicates service is down as the url of the service or tcp port of service is not reachable

Troubleshooting

Open Availability dashboard in grafana
Select the time range between which alert was generated
Select the url interested in all the graphs in this dashboard
"Availability" graph shows up(value 1) and down(value 0) status for the url
"Probe Time" graph shows the time taken for accessing url
"Probe Status Code" shows the HTTP status code for the url. Status code 0 is returned when a url is unreachable or connection timed out after 5 seconds (configured time out value for urls)
Search for logs of this service in Kibana -> Discover -> Search program: "*<service-name>*". Check if there are any errors related to service. Example search query: program: "*learner-service*"
If the logs doesn't not have enough details, check service info following steps in Docker swarm management commands

Action Items

Follow the instructions in https://github.com/project-sunbird/sunbird-devops/wiki/Docker-swarm-troubleshooting#1-service-not-starting-or-gets-killed-often

`service_replication_failure`

This alert indicates the docker swarm is not able to launch all replicas for this service

Troubleshooting

Check service info following steps in Docker swarm management commands
Search for logs of this service in Kibana -> Discover -> Search program: "*<service-name>*". Check if there are any errors related to service. Example search query: program: "*learner-service*"

Action Items

Follow the instructions in https://github.com/project-sunbird/sunbird-devops/wiki/Docker-swarm-troubleshooting#1-service-not-starting-or-gets-killed-often

`service_flapping`

This alert indicates the service status has been flapping between up down for specified time

Troubleshooting Same as service_down

`service_replication_failure`

This alert indicates the docker swarm is not able to launch all replicas for this service

Troubleshooting

Check service info following steps in Docker swarm management commands
Search for logs of this service in Kibana -> Discover -> Search program: "*<service-name>*". Check if there are any errors related to service. Example search query: program: "*learner-service*"

`too_many_server_side_http_errors`

This alert indicates there were too many 5xx HTTP status code(server side errors) responses are logged in proxy(nginx) logs

Troubleshooting

Search for proxy logs in Kibana -> Discover -> Search program: *proxy* AND (message: "HTTP/1.1 500" OR message: "HTTP/1.1 503"). In the log message, check the request paths for which these error logs are generated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting monitoring alerts

Alerts

`backup_is_too_old`

`backup_size_is_too_small`

`docker_swarm_node_down`

`logs_ingestion_slow`

`high_memory_usage_on_container`

`monitoring_service_down`

`service_down`

`service_replication_failure`

`service_flapping`

`service_replication_failure`

`too_many_server_side_http_errors`

Clone this wiki locally