Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 - Association-Notification-Events for AWS-GatherSoftwareInventory shows StuckAtInProgress/Failed status, but Console reports Success (which is correct) #584

Open
rgoltz opened this issue Sep 1, 2024 · 0 comments

Comments

@rgoltz
Copy link

rgoltz commented Sep 1, 2024

Describe the Setup & the Bug

  • We are running a association via State Manager which using the AWS-managed document AWS-GatherSoftwareInventory. The association has a schedule expression as rate, which is to every 12 hours.
  • We are using Event Bridge rule, which listing to the pattern with source => aws.ssm + detail-status => Failed. Each time an association state was failed, we are getting notified via SNS/Mail.
  • Looking through the UpdateInstanceAssociationStatus CloudTrail events, we found that the association was failing on a variety of instances (Windows and Linux) - Details see "Current Behavior". This triggering the EventBridge rule.
  • Looking through the AWS Console, the association is in status "Success" within execution history. Hence, it's not fitting the status "Failed" from the event. Furthermore, we don't see errors in applying the document.

Current Behavior

  • As the association state is currently showing 'Success' in Console on each instance, the association seems to be failing intermittently based on the events.
  • Checking the UpdateInstanceAssociationStatus events (occuring multiple times per instance at the same point in time), we see error-details with StuckAtInProgress / Association stuck at InProgress for longer than 2 hours for at least one failed event for AWS-GatherSoftwareInventory:
<snip>
    "eventTime": "2024-08-23T15:23:17Z",
    "eventSource": "ssm.amazonaws.com",
    "eventName": "UpdateInstanceAssociationStatus",
    "awsRegion": "eu-central-1",
    "sourceIPAddress": "1.2.3.4",
    "userAgent": "aws-sdk-go/1.51.20 (go1.21.11; windows; amd64) amazon-ssm-agent/",
    "requestParameters": {
        "associationId": "cdf20e5a-1234-abcd-4321-11223344rogo",
        "instanceId": "i-abcd1234abcd1234",
        "executionResult": {
            "executionDate": "Aug 23, 2024, 3:23:17 PM",
            "status": "Failed",
            "executionSummary": "Association stuck at InProgress for longer than 2 hours",
            "errorCode": "StuckAtInProgress"
        }
    },
    "responseElements": null,
    "requestID": "1...",
    "eventID": "2...",
    "readOnly": false,
    "resources": [
        {
            "accountId": "1234567890",
            "ARN": "arn:aws:ssm:eu-central-1:1234567890:association/cdf20e5a-1234-abcd-4321-11223344rogo"
        },
        {
            "accountId": "1234567890",
            "ARN": "arn:aws:ec2:eu-central-1:1234567890:instance/i-abcd1234abcd1234"
        }
    ],
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "eventCategory": "Management",
<snip>

Checking the local errors.log of SSM Agent at the same time of the event, we see:
(Windows - Server 2016 Datacenter Build 14393, SSM Agent Version: 3.3.551.0)

2024-08-23 17:23:17 ERROR [runScheduledAssociation @ processor.go.305] [ssm-agent-worker] [MessageService] [Association] Association stuck at InProgress for longer than 2 Hours

(Linux - Amazon Linux 2, SSM Agent Version: 3.3.380.0)

2024-08-22 04:57:28 ERROR [runScheduledAssociation @ processor.go.313] [ssm-agent-worker] [MessageService] [Association] Association stuck at InProgress for longer than 2 Hours
  • We checked this behavior with the SSM Team (Ethan) - We uploaded the full local logs and the full events to case 172380074600709.

Expected Behavior:

Workaround:

  • For now, if we receive any notifications for the AWS-GatherSoftwareInventory association has failed, we verify its status in the State Manager Console first. If the association shows as success, then we have to ignore the notification. That's annoying repetitive daily work.
  • Furthermore, we are trying to utilize the anything-but matching to "filter" on associationId (which have the AWS-GatherSoftwareInventory) within the rule. So we "disabled" event-based notification for this dedicated association completely for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant