Skip to content

Incident Reports

Bianca Rivera Alvelo edited this page Oct 22, 2024 · 1 revision

Overview

This page lists incidents impacting VRO partner teams. For incidents that disrupt the production environment, a further writeup is included: at minimum a high-level narrative of how the incident started and how it was resolved, timestamps of significant events, and follow-up tasks are documented. Where appropriate, troubleshooting tips and log snippets are included.

Incidents of high severity may have a more detailed post-incident review (aka "incident postmortem") filed in Post‐incident reviews (private wiki).

Metrics on incidents can be viewed on PagerDuty or on the Metrics Page

date severity app(s) affected contributing factor(s) GH Issue
10-08-2024 SEV 1 BIE-kafka, VRO RabbitMQ pods in diff env Non-prod RabbitMQ pods not starting properly due to syntax error https://github.com/department-of-veterans-affairs/abd-vro/issues/3574
09-09-2024 SEV 2 EP Merge Merge jobs not succeeding https://github.com/department-of-veterans-affairs/abd-vro/issues/3436
07-16-2024 SEV 4 EP Merge Dockerfile command incompatible with build environment #3189
07-11-2024 SEV 4 EP Merge BIP swagger UI feature disabled beyond development cluster #3178
07-01-2024 SEV 1 EP Merge disruption in rabbitmq connection
06-24-2024 SEV 3 EP Merge expired datadog credentials
05-10-2024 SEV 1 EP Merge disruption in rabbitmq connection
expired SecRel signature for deployed version
04-22-2024 SEV 1 CC bug in helm configuration
expired SecRel signature for deployed version
04-01-2024 SEV 1 EP Merge cluster maintenance, k8s pod eviction

Detailed Incident Reports

10-08-2024: VRO Team - RabbitMQ

##summary

On 10-08-2024 11:17 ET, Derek used the #benefits-vro-engineering channel to report RabbitMQ being down which was affecting BIE Kafka app. VRO triage discovered logs on the Rabbitmq non-prod pods showing they weren't starting properly but nothing specific pointed to the cause. Further investigation of ArgoCD events of Rabbitmq pods in non-prod environments showed that this break occured after this change to env variables to capture the appropriate environments dynamically when setting up our monitors for Rabbitmq on DD. The incident was resolved by fixing the env syntax and redeploying RabbitMQ.

Stats

  • Severity: SEV 1. Core functionality of Rabbitmq was affected.
  • Time to acknowledge: 4 minutes
  • Time to resolve from notification: 35 minutes
  • Duration of outage: ~1 hr 10 minutes

Timeline

07-01-2024: EE Team - EP Merge

##summary

On 07-01-2024 18:36 ET, the EE team used the VRO support channel to report no CPU or memory utilization on the BGS dataservice provided by VRO (svc-bgs-api). EE had discovered this while troubleshooting failed jobs in EP Merge. VRO triage discovered logs on the svc-bgs-api pod showing connection failures to RabbitMQ. Further investigation showed that built-in failover processes to redeploy the pod were unsuccessful. The incident was resolved by deploying a newer build of svc-bgs-api.

The root cause would be what led to the connection failures from svc-bgs-api to RabbitMQ. That root cause has not been investigated. More immediately, if the failover process of svc-bgs-api had been able to succeed, the EP Merge app would in theory not have been affected. The failover process was not able to succeed because the associated image had expired from the container registry. We identify the expired image for the (at the time) current deployment as a major contributing factor of this incident.

Infrastructure monitoring shows instability of svc-bgs-api from 17:33 to 19:14 (coinciding with deployment of newer build of svc-bgs-api. EP Merge dashboards captured two failed merge jobs in this period.

Stats

  • Severity: SEV 1. Core functionality of EP Merge was affected.
  • Time to acknowledge: 4 minutes
  • Time to resolve from notification: 38 minutes
  • Duration of outage: 1 hr 41 minutes

Timeline

07-01-2024

(all times ET)

18:36: the EE team reports an issue with svc-bgs-api. They note that the pod is reporting no CPU or memory utilization; and that an EP Merge feature is failing, due to exhausting max retries to svc-bgs-api.

~18:40: VRO acknowledges the message with an emoji

18:44: VRO confirms seeing errors in the logs for svc-bgs-api. The errors are of this nature: Could not establish TCP connection to vro-rabbitmq:5672: Connection refused . VRO checks for instability in other services.

18:52: VRO assesses that a redeployment of svc-bgs-api is unlikely to succeed, given the age of the underlying image (from 2 months ago); and that the best option would be to deploy svc-bgs-api with a more recent build.

19:00: VRO identifies a deployable build (31115b9) and initiates deployments, starting with lower environments

19:15: Deployment to prod finishes.

19:20: VRO verifies infrastructure monitoring shows the CPU and memory utilization of svc-bgs-api are corrected; and that the EP Merge feature that had been affected is shows evidence of successfully processing tasks that are dependent on svc-bgs-api.

What did not go well

  • lack of alerting on svc-bgs-api. Dashboards for the infrastructure did detect the incident; however, no alert was triggered.
  • status and SHA of a deployable build was not immediately known
  • process-wise: identification of severity level was delayed. communication on secondary channels: unclear value.
  • encountered different build versions of svc-bgs-api on prod-test and prod - this reduces our scope for debugging behavior
  • delay in identifying a deployable build of svc-bgs-api. notably, builds of the default branch for the past week did not have valid SecRel signatures - this was due to an issue with the SecRel pipeline, resolved 7/2.

What went well

  • a build of svc-bgs-api with a valid SecRel signature was available
  • EP Merge monitoring and alerting

follow-up items

06-24-2024: EE Team - EP Merge

Summary

On 06-24-2024 15:18 ET, the EE team used the VRO support channel to report a stoppage in datadog metrics submitted by the EP Merge app and asked whether an API key had changed. Joint triage between EE and VRO ensued, and VRO consulted LHDI support. The root cause is that the Datadog API key in use by the EP Merge app was no longer valid. This issue was resolved by updating the deployment environment and deploying the EP Merge app.

The scope of this incident was limited to the submission of Datadog metrics. Business functionality - for example fetching claim and contention details, setting temporary station of jurdiction, canceling claims, and adding notes to claims - was not impacted, inferring from other logging in-place on the EP Merge app.

Stats

  • Severity: SEV 3. Gap in metrics related to core functionality, although core functionality not impacted and no noticeable performance degradation.
  • Time to acknowledge: 3 minutes
  • Time to resolve from notification: 4h 20 minutes

Timeline

06-24-2024

(all times ET)

15:18: the EE team reports a stoppage in datadog metrics submitted by the EP Merge app and asked whether an API key had changed.

15:21: VRO acknowledges the message. Triage follows. EE notes that the EP Merge app appears to be completing tasks (despite not submitting corresponding metrics). VRO does not see issues in the app logs nor instability in the k8s environment (logs and stability assessed through Lens). VRO determines LHDI support is needed.

15:45: VRO reaches out to LHDI

15:50: LHDI recommends a re-deployment would resolve the issue. VRO assesses that the version currently in production likely has an expired security signature, due to its signature and deployment being >2 months ago; and that a deployment of the latest code would be more straightforward, rather than preparing the older version for re-signing.

16:04: VRO deploys the latest EP Merge app to sandbox and prod-test.

16:11: VRO requests EE’s consent for a deployment of the latest version of the code, rather than a re-deployment. EE is open to the deployment and requests time to complete regression tests. While holding for EE’s consent, VRO finds in the Vault audit trail that the API key had indeed recently changed, as EE had suspected; and that the timing corresponded to when the stoppage started. This finding increases VRO’s confidence that the root cause had been found, and that a deployment would resolve the issue. VRO also determines that EP Merge is the only VRO partner app that is submitting DataDog metrics in production, thus VRO does not need to deploy other partner apps as part of resolving this incident. (There are other instances of VRO partner apps submitting logs to DataDog; however, log submission uses a different means that we observed is not disrupted by the API key change)

16:34: VRO posts that the incident is a SEV 3.

19:00: EE consents to deploying to prod.

19:02: VRO initiates the deployment.

19:05: The deployment process finishes. VRO monitors the logs and DD dashboard, intending to wait for a DD metric to appear.

19:15: EE asks whether the API key was updated on the deployment. The VRO engineer queries the production environment and sees that the new API key was NOT in-place; and realizes a missed step in propagating the API key to the k8s namespace. They address the missed steps and re-deploy EP Merge.

19:38: The deployment completes. VRO notes they will take a break and re-assess in a few hours.

21:29: VRO reports that a metric was submitted from the EP Merge app at 20:39.

06-24-2024

10:07: VRO reports that additional metrics have been submitted.

What did not go well

  • delay in stating the severity of the incident - this was done at 16:34, more than one hour after the incident was reported. Additionally, it was not communicated in the channel where the incident was actively being discussed.
  • low confidence in option to re-deploy the version running in production at the time (concern of expired security signature), forcing time and resources to take steps to roll forward.
  • delay to verify that the latest code on the default branch was ok to deploy to production
  • mistakes in the deployment process: failure to propagate the API key before the initial corrective deployment attempt, and delayed detection of the failure.

What went well

  • VRO engineer was equipped with appropriate resources (tools, permissions, and documentation)
  • the latest code was primed for deployment (it had completed SecRel signature signing)
05-10-2024: EP Merge - BGS Service down and issues with image signatures Oncall PoC: Teja

On 05-10-2024, EE team identified an issue where the BGS service wasn't functional, causing the EP Merge application to experience downtime. This was reported at 11:04 AM ET, with BGS stopping requests until 11:10 AM. This occurred because RabbitMQ service restarted, which led to BGS service to error. The LHDI team has a policy to expire images after a certain number of days. Consequently, a process is needed to ensure the continuous deployment of images before their security signatures expire. Cory Sohrakoff mentioned that it was possible that the external SOAP BGS service also supposed working and wasn't responding properly to requests; VRO will need additional processes to monitor BGS service better.

Action Taken:

  • BGS service was redeployed using the lasted Secrel signed image

Ticket Creation:

  • A ticket was created to minor BGS service: VRO Issue #2980.
  • A ticket was created to continuously deploy all services to prod even there are no changes so that services have unexpired images associated with them - VRO Issue #2976

Root Cause Analysis:

It was identified that the expired BGS image signature caused the pod to fail to restart, leading to the application downtime.

Proposed Solution: A process needs to be established to continuously deploy images before their security signatures expire. Acceptance Criteria:

BGS Service was potentially down and VRO wasn't aware

Proposed Solution: A better monitoring system needs to be in place to if BGS service is functional.

Image Expiration Update:

Previous conclusions about image expiration policies were incorrect. The image signatures never actually expire. Rather, the LHDI signature requirements do require that containers have been signed by active keys. In other words, when a signing key expires and rotates out, there is a rotation period where both new and old keys are allowed. Once the rotation period elapses, LHDI will stop recognizing the expired key's signature and those keys have been decommissioned. Thus, the outage described here requires a combination of pods becoming unresponsive and their attempted replacements having been signed by expired keys. With signing keys rotated every 6 months, we need to apply the following intentions:

  1. VRO images need to be signed with the active key for LHDI to host it.
  2. When keys rotate, VRO needs to re-sign any images that we expect to host.
  3. Ensure that we've signed all active images with the updated keys as soon as they've become available. This third step is the tricky part. Pay attention to key rotation announcements from LHDI. These will be communicated ahead of time to give teams a chance to prepare, preparation period, where both new and old keys are recognized.
04-22-2024: CC TeamOn 04-22-2024 at 8:10am (MDT-mountain time), VRO team received a message in our support channel from a member of the CC team that the CC API service was down in the production environment. They had noticed that something wasn't right and were reviewing Datadog health check logs as a result. As they found, the last entry indicated that the application was shutting down, followed by silence from the logging.

After acknowledging their message, both the Primary and Secondary on-call were investigating the outage as top priority. During this initial troubleshooting, it was quickly identified that the CC service's helm configuration was unexpectedly broken. A known issue (a typo in the helm config) that was present in the develop branch of abd-vro, had somehow been deployed to prod. Additionally the image that was referenced as the deployed version, was specifying an image that had not been signed by SecRel, and therefore would not ever work in the production environment. This combination of broken helm config and incorrect image reference was preventing redeployment via the github actions that normally perform the deployment.

To bring things back online, Erik applied the helm config fix that was needed and proceeded to find the most recently signed image of CC. With both of these in hand, he ran the helm update command locally. This approach ensured that the VRO team could apply the patched helm config and specify a SecRel passing CC image without further delays in waiting on github actions to complete, thereby recovering from the outage sooner.

The Datadog http status monitoring of the CC service shows that the outage started at: Apr 19th, 2024 at 11:52am returning to a fully operational state at: Apr 22, 2024 at 11:34:43.934 am.

Further root cause analysis will be performed in issue #2883. The VRO Team also plans to develop an incident response plan as described in #2570).

04-01-2024: EE Team - EP Merge On 04-01-24 3:38 PM Pacific time, VRO team received a message in our support channel from a member of the EE team that their EP Merge application was down in the production environment. They received an alert from a DataDog healthcheck alert they had set up.

We had received an alert in our alerts channel at 3:30 PM on the same day that there were unavailable pods in prod. However, this alert alone does not necessarily indicate that an application will experience downtime. These alerts are most often triggered by cluster changes made by the LHDI team which cause many new instances of pods to begin spinning up. The K8S agent will wait for these new pods to report a ready status before evicting older versions of the pods. Once all of these pods reporting ready status, the alert is often resolved without any action from VRO engineers. So while we were alerted of a potential issue before our partner team engineer reported it, it was not clear that their application began to experience downtime.

A ticket was created https://github.com/department-of-veterans-affairs/abd-vro/issues/2816 for discovery along with a followup for execution. These tickets will focus on how to ensure pod evictions are handled gracefully so that applications hosted on the VRO platform can experience a greater deal of stability.

Clone this wiki locally