Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition exists on rapid delete/create of Iamroles, or when IAM Role already exists in AWS #83

Open
diranged opened this issue Jul 12, 2021 · 0 comments

Comments

@diranged
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST?:

This is a bug report, with a fix at #82.

What happened:

We began running into a situation where our ServiceAccount annotations would be set with eks.amazonaws.com/role-arn: "". After digging into the code, we discovered that the default case handler in the controller has this if statement that is problematic. I think its intent was to handle an intermittent failure in the READY handler, where the ServiceAccount was already setup properly... but what it fails to do is handle the situation where the IAMRole already exists.

Here is the rundown of events that can cause this failure:

  1. A new Iamrole is created.. and the controller reconciliation starts.
  2. The r.IAMClient.CreateRole function is called.
    1. The function sets roleAlreadyExists=true and logs the error.
    2. Because roleAlreadyExists, the code then skips populating the IAMRoleResponse object.
  3. The handler checks if resp.RoleARN is set. It is not because we got an empty response object.
  4. The handler tries to use the previously known good state in iamRole.Status.RoleARN ... but remember, this is a newly created Iamrole resource, so that is empty too.
  5. The ServiceAccount is populated with an empty string annotation.

What you expected to happen:

I expect the code to realize that the IAM Role in AWS is indeed the one we are trying to use (likely previously created but perhaps the reconciliation loop did not complete, OR a fast create/delete was called on the Iamrole resource by a tool like ArgoCD, or any other number of intermittent situations). Either way, i expect eventual consistency to work out the issue.

How to reproduce it (as minimally and precisely as possible):

Pre-create an IAM Role... then try creating an Iamrole resource.

Anything else we need to know?:

While digging into this, I also noticed that there is no reconciliation that happens for the ServiceAccount annotation. I have a second PR (prepped at Nextdoor#4) that significantly revamps the k8s/rbac.go package to be more testable, and implements regular reconciliation of the ServiceAccount annotation. We've been running the combination of these two in production now for 2 weeks and we've had great success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant