You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We began running into a situation where our ServiceAccount annotations would be set with eks.amazonaws.com/role-arn: "". After digging into the code, we discovered that the default case handler in the controller has this if statement that is problematic. I think its intent was to handle an intermittent failure in the READY handler, where the ServiceAccount was already setup properly... but what it fails to do is handle the situation where the IAMRole already exists.
Here is the rundown of events that can cause this failure:
A new Iamrole is created.. and the controller reconciliation starts.
The handler checks if resp.RoleARN is set. It is not because we got an empty response object.
The handler tries to use the previously known good state in iamRole.Status.RoleARN ... but remember, this is a newly createdIamrole resource, so that is empty too.
I expect the code to realize that the IAM Role in AWS is indeed the one we are trying to use (likely previously created but perhaps the reconciliation loop did not complete, OR a fast create/delete was called on the Iamrole resource by a tool like ArgoCD, or any other number of intermittent situations). Either way, i expect eventual consistency to work out the issue.
How to reproduce it (as minimally and precisely as possible):
Pre-create an IAM Role... then try creating an Iamrole resource.
Anything else we need to know?:
While digging into this, I also noticed that there is no reconciliation that happens for the ServiceAccount annotation. I have a second PR (prepped at Nextdoor#4) that significantly revamps the k8s/rbac.go package to be more testable, and implements regular reconciliation of the ServiceAccount annotation. We've been running the combination of these two in production now for 2 weeks and we've had great success.
The text was updated successfully, but these errors were encountered:
Is this a BUG REPORT or FEATURE REQUEST?:
This is a bug report, with a fix at #82.
What happened:
We began running into a situation where our
ServiceAccount
annotations would be set witheks.amazonaws.com/role-arn: ""
. After digging into the code, we discovered that thedefault
case handler in the controller has thisif
statement that is problematic. I think its intent was to handle an intermittent failure in theREADY
handler, where theServiceAccount
was already setup properly... but what it fails to do is handle the situation where the IAMRole already exists.Here is the rundown of events that can cause this failure:
Iamrole
is created.. and the controller reconciliation starts.r.IAMClient.CreateRole
function is called.roleAlreadyExists=true
and logs the error.roleAlreadyExists
, the code then skips populating theIAMRoleResponse
object.resp.RoleARN
is set. It is not because we got an empty response object.iamRole.Status.RoleARN
... but remember, this is a newly createdIamrole
resource, so that is empty too.ServiceAccount
is populated with an empty string annotation.What you expected to happen:
I expect the code to realize that the IAM Role in AWS is indeed the one we are trying to use (likely previously created but perhaps the reconciliation loop did not complete, OR a fast create/delete was called on the Iamrole resource by a tool like ArgoCD, or any other number of intermittent situations). Either way, i expect eventual consistency to work out the issue.
How to reproduce it (as minimally and precisely as possible):
Pre-create an IAM Role... then try creating an Iamrole resource.
Anything else we need to know?:
While digging into this, I also noticed that there is no reconciliation that happens for the
ServiceAccount
annotation. I have a second PR (prepped at Nextdoor#4) that significantly revamps thek8s/rbac.go
package to be more testable, and implements regular reconciliation of theServiceAccount
annotation. We've been running the combination of these two in production now for 2 weeks and we've had great success.The text was updated successfully, but these errors were encountered: