-
Notifications
You must be signed in to change notification settings - Fork 368
Troubleshooting monitoring alerts
This alert indicates no new backup has been uploaded to azure blob storage for a while
Troubleshooting
- Check the jenkins job
*_Backup
has run and succeeded in expected duration
This alert indicates size backup file uploaded to azure blob is less than expected size. This would happen if backup script produces dummy file due to failure as happened in gitlab database incident
Troubleshooting
- Check the jenkins job
*_Backup
has run and succeeded without errors or warnings which might produce dummy backup file
This alert indicates one of node in docker swarm is down. This could happen either if the server running docker crashes(likely) or loses connectivity to swarm(less likely)
Troubleshooting
- SSH into docker swarm manager and check which node is down by executing
docker node ls
. Try to SSH to the node which is down and see if it can ping the swarm master
Action Items
- If node is shutdown, restart the server(from azure portal)
- Follow the instructions in https://github.com/project-sunbird/sunbird-devops/wiki/Docker-swarm-troubleshooting#2-docker-swarm-worker-node-is-down
This alert indicates logs are not central log aggregation server(logs elastic search)
Troubleshooting
- Search for logs in Kibana -> Discover -> Search
*
. On left side panel, click on program filter to check if logs not flowing for a particular service or for all services
Action Items
This alert indicates container is using over 90% memory. If it reaches the memory limit configured in docker swarm, this container will get restarted due to OutOfMemory
error.
Troubleshooting
- Open
Container Details
dashboard in grafana - Select the time range between which alert was generated
- Select the container interested in all the graphs in this dashboard
- "Memory Usage" and "Memory Usage %" graphs shows memory usage over time
Action Items
- If this is unexpected behaviour for this application, debug the application running for memory leak issues
- If high memory usage is expected across all environments, increment the default memory limit for this service in the ansible role
defaults/main.yml
in public repo - If high memory usage is expected in a specific environment, increment the memory limit for this service in the inventory group_vars of this environment
This indicates the exporter configured for scraping metrics is not reachable by prometheus. This is usually because exporter may not be running due to an issue
Troubleshooting
- Check service info following steps in Docker swarm management commands
This alert indicates service is down as the url of the service or tcp port of service is not reachable
Troubleshooting
- Open
Availability
dashboard in grafana - Select the time range between which alert was generated
- Select the url interested in all the graphs in this dashboard
- "Availability" graph shows up(value 1) and down(value 0) status for the url
- "Probe Time" graph shows the time taken for accessing url
- "Probe Status Code" shows the HTTP status code for the url. Status code
0
is returned when a url is unreachable or connection timed out after5
seconds (configured time out value for urls) - Search for logs of this service in Kibana -> Discover -> Search
program: "*<service-name>*"
. Check if there are any errors related to service. Example search query:program: "*learner-service*"
- If the logs doesn't not have enough details, check service info following steps in Docker swarm management commands
Action Items
- Follow the instructions in https://github.com/project-sunbird/sunbird-devops/wiki/Docker-swarm-troubleshooting#1-service-not-starting-or-gets-killed-often
This alert indicates the docker swarm is not able to launch all replicas for this service
Troubleshooting
- Check service info following steps in Docker swarm management commands
- Search for logs of this service in Kibana -> Discover -> Search
program: "*<service-name>*"
. Check if there are any errors related to service. Example search query:program: "*learner-service*"
Action Items
- Follow the instructions in https://github.com/project-sunbird/sunbird-devops/wiki/Docker-swarm-troubleshooting#1-service-not-starting-or-gets-killed-often
This alert indicates the service status has been flapping between up down for specified time
Troubleshooting Same as service_down
This alert indicates the docker swarm is not able to launch all replicas for this service
Troubleshooting
- Check service info following steps in Docker swarm management commands
- Search for logs of this service in Kibana -> Discover -> Search
program: "*<service-name>*"
. Check if there are any errors related to service. Example search query:program: "*learner-service*"
This alert indicates there were too many 5xx HTTP status code(server side errors) responses are logged in proxy(nginx) logs
Troubleshooting
- Search for proxy logs in Kibana -> Discover -> Search
program: *proxy* AND (message: "HTTP/1.1 500" OR message: "HTTP/1.1 503")
. In the logmessage
, check the request paths for which these error logs are generated