Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support IAM Roles for Service Accounts #947

Open
SapientGuardian opened this issue Sep 20, 2019 · 21 comments
Open

Support IAM Roles for Service Accounts #947

SapientGuardian opened this issue Sep 20, 2019 · 21 comments

Comments

@SapientGuardian
Copy link

Amazon recently released support for assigning IAM roles to Service Accounts (https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/).

In order to take advantage of this, the AWS SDK needs to be upgraded (The Go SDK needs to be a minimum of version 1.23.13), and we need a way to indicate the IAM role to be used, so it can be added as an annotation to the service account(s) used by the operator.

Assigning IAM roles through this method is easier and more secure than keeping long-running IAM users around, which have fixed credentials and [ideally] need them rotated periodically.

@chancez
Copy link
Contributor

chancez commented Sep 20, 2019

This seems reasonable. Sounds like we just need to upgrade the dependency then. I can take a look at this next week.

@SapientGuardian
Copy link
Author

SapientGuardian commented Oct 12, 2019

I started taking a look at this myself, and ended up down the rabbit hole. Upgrading the aws sdk reference for operator-metering itself was straightforward, and I started things up using the custom docker image I built with that. It wasn't clear from the documentation exactly which pods needed access to AWS resources, but the hive-server seemed to be one of them, as its logs reported access denied trying to access the s3 bucket.

Looking at the code to Hive, it appears to leverage Hadoop's s3a plugin to connect to s3, and doesn't itself actually have the AWS SDK as a dependency. So I looked at the source to s3a, and it appears to just grab the latest AWS SDK when it is built with the rest of Hadoop, which is great. Unfortunately, it seems like operator-metering uses a really ancient fork of Hive (https://github.com/operator-framework/hive) 3804 commits behind mainline, and its Hadoop dependency is years old and by extension, so is its AWS SDK.

Does that all sound about right? I'm really eager to try this project, but because our corporate security policy doesn't allow IAM users (only assumed IAM roles and instance profiles) I'm not able to.

@chancez
Copy link
Contributor

chancez commented Oct 14, 2019

So there's a few things that will talk to AWS. Reporting-operator, Presto, and Hive (both metastore and server).

Hive has multiple versions, and version 3 hasn't been "stable" that long, and most distributions of Hive are based on Hive 2.3.x which is still maintained. We're not that far behind on that either, we're at 2.3.3 and latest is 2.3.6, and we follow the releases if there are fixes critical to our project. Presto, until recently hasn't worked with Hive 3.x at all really so we've been stuck on Hive 2.3.x.

S3A comes from the core hadoop-common libraries which come from the hadoop installation which isn't part of Hive. Similarly, most are using Hadoop 2.x, and we're actually on hadoop 3.1.1 https://github.com/operator-framework/hadoop.

I would expect all of these libraries to behave similarly to all AWS credentials chains, trying to use whichever credentials are available. My only thought it that we may be setting the AWS_* environment variables unconditionally, perhaps to empty values which might be an issue if that's the case. I know quite a bit in the past in our earlier days people had success with the kube2iam project, so I know these things work to some degree, but the service account stuff is new so we may need to upgrade some more things.

@chancez
Copy link
Contributor

chancez commented Oct 14, 2019

For what it's worth, the version of the AWS java SDK is here: https://github.com/operator-framework/hadoop/blob/master/hadoop-project/pom.xml#L137

<aws-java-sdk.version>1.11.271</aws-java-sdk.version>

@SapientGuardian
Copy link
Author

Thanks for clarifying which services need to be targeted. For the java1 sdk, 1.11.623 is the minimum to support the new authentication mechanism. I started by making a simple docker image like this:

FROM quay.io/openshift/origin-metering-hive:latest
RUN rm /opt/hive/lib/aws-java-sdk-bundle-*
COPY aws-java-sdk-bundle-1.11.651.jar /opt/hive/lib/

Hoping I could get lucky and there wouldn't be any breaking changes. It didn't actually explode, but s3 is still unable to authenticate. I am currently working on rebuilding the hive image with debug logging turned on for s3a (it didn't look like there was any easier way to do that, but please let me know if there is).

It's interesting that you say there has been success with kube2iam - we do actually use that, though we are trying to retire it in favor of the new service account mechanism. In order to get the project to deploy at all, I had to remove secretName as a required field from the s3 storage configuration, Once it deploys, I annotated the hive serviceaccount and deleted the hive-server-0 pod. I confirmed that when it came back up, the expected env var and volume mount was present, but I'll double check that there aren't any empty vars throwing things off.

@chancez
Copy link
Contributor

chancez commented Oct 14, 2019

The files in /opt/hive/lib/ are symlinks, though I don't even think the symlinks are necessary anymore. The files are actually in ${HADOOP_HOME}/share/hadoop/tools/lib/, among many others. Perhaps:

FROM quay.io/openshift/origin-metering-hive:latest
RUN rm ${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-bundle-*
COPY aws-java-sdk-bundle-1.11.651.jar ${HADOOP_HOME}/share/hadoop/tools/lib/

@SapientGuardian
Copy link
Author

SapientGuardian commented Oct 14, 2019

Ok, I've put the new jar there (as well as in /opt/hive/lib just to be sure, since the symlink gets broken when I remove the file, I guess) and confirmed that it's on the classpath (according to the log that's printed at the start). I also confirmed that there are no empty or otherwise conflicting env vars that might throw off the AWS SDK. Unfortunately, it still isn't authenticating and I can't think of a way to figure out which creds (if any) it is trying to use, without the app itself logging that.

I just got the hive image rebuilt with debug logging (if I did it right), will see how that goes. Edit: Poorly. My image seems to be defective, it doesn't manage to start the hive server at all, it just logs

/hadoop-config/hdfs-site.xml doesnt exist, skipping symlink
: No such file or directory

I'll dig into that tomorrow.

@chancez
Copy link
Contributor

chancez commented Oct 14, 2019 via email

@SapientGuardian
Copy link
Author

I figured out what's going on here.

First, I rebuilt the ansible-operator image using a modified hive-configmap.yaml that enabled the debug logging for s3a. This confirmed that s3a was using only the BasicAWSCredentialsProvider, EnvironmentVariableCredentialsProvider, and InstanceProfileCredentialsProvider - not the new WebIdentityTokenCredentialsProvider.

Then I edited _hadoop_config_helpers.tpl to force the use of the new provider, by setting

  <property>
      <name>fs.s3a.aws.credentials.provider</name>
      <value>com.amazonaws.auth.WebIdentityTokenCredentialsProvider</value>
  </property>

which then caused it to throw the following stack trace:

Caused by: java.io.IOException: com.amazonaws.auth.WebIdentityTokenCredentialsProvider constructor exception.  A class specified in fs.s3a.aws.credentials.provider must provide a public constructor accepting Configuration, or a public factory method named getInstance that accepts no arguments, or a public default constructor.
        at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProvider(S3AUtils.java:661) ~[hadoop-aws-3.1.1.jar:?]
        at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:566) ~[hadoop-aws-3.1.1.jar:?]

And sure enough, WebIdentityTokenCredentialsProvider offers neither of those things. Looking at how it's used in the AWS SDK (https://github.com/aws/aws-sdk-java/blob/53683b836dec55226f45060dd5af3f00259650c3/aws-java-sdk-core/src/main/java/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.java) you can see that it uses .create() rather than .getInstance()

So I guess that puts me back to editing the Hadoop fork to use the new provider from the new SDK.

@chancez
Copy link
Contributor

chancez commented Oct 15, 2019

Looking from it, the default provider chain should make it so that WebIdentityTokenCredentialsProvider is used when the creds aren't specified in the env, java system properties, aws creds file. The fact that the web identity creds provider uses create makes it not work for passing to fs.s3a.aws.credentials.provider, but DefaultAWSCredentialsProviderChain does implement getInstance(), and it does have WebIdentityTokenCredentialsProvider in it as part of the chain.

So if you leave fs.s3a.aws.credentials.provider unspecified, or set to com.amazonaws.auth.DefaultAWSCredentialsProviderChain it should go through the list, and eventually land on the WebIdentityTokenCredentialsProvider. Perhaps the env vars are being used if set, even when empty preventing it from going to the next provider.

Either way, WebIdentityTokenCredentialsProvider should probably also have a default constructor in it so you can specify it directly.

@SapientGuardian
Copy link
Author

DefaultAWSCredentialsProviderChain does indeed fall back to WebIdentityTokenCredentialsProvider, but Hadoop/s3a doesn't use DefaultAWSCredentialsProviderChain . It uses its own ( https://github.com/operator-framework/hadoop/blob/master/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/AWSCredentialProviderList.java ) and populates it here: https://github.com/operator-framework/hadoop/blob/master/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java#L558

I'm working on rebuilding with credentials.add(WebIdentityTokenCredentialsProvider.create()); added in there.

@chancez
Copy link
Contributor

chancez commented Oct 15, 2019

Ah I see. I can look at making a PR to add a default constructor to this type in the aws-java-sdk since it seems reasonably simple and easy.

@chancez
Copy link
Contributor

chancez commented Oct 15, 2019

aws/aws-sdk-java#2121 with this, if we updated the AWS SDK to use these changes I think that WebIdentityTokenCredentialsProvider should just work if specified in fs.s3a.aws.credentials.provider

@SapientGuardian
Copy link
Author

Cool. Yeah that ought to work, though since we would still need to rebuild Hadoop to pick up the new SDK it probably makes sense to put it in the default AWSCredentialProviderList anyway (ideally that would happen in the upstream Hadoop repo and backported, of course).

@chancez
Copy link
Contributor

chancez commented Oct 15, 2019

You shouldn't need to rebuild hadoop, you should just be able to update the AWS SDK jar.

@SapientGuardian
Copy link
Author

Right - patching over the docker image like I've been doing.

@SapientGuardian
Copy link
Author

I managed to get everything working! Here's what I did:

  • Hadoop
    ○ Rebuilt with the latest SDK and added credentials.add(WebIdentityTokenCredentialsProvider.create()); to S3AUtils.java (should be avoidable when Add public factory method getInstance to WebIdentityTokenCredentialsProvider aws/aws-sdk-java#2121 is accepted and released, with configuration in metering-ansible-operator)
  • Hive
    ○ Rebuilt with the new Hadoop (should be avoidable when Add public factory method getInstance to WebIdentityTokenCredentialsProvider aws/aws-sdk-java#2121 is accepted and released, with configuration in metering-ansible-operator)
  • Metering-ansible-operator
    ○ Modified templates for hive, presto, and reporting-operator serviceAccounts to have the IAM annotation (long term should be configurable from the main repo, maybe in metering-custom.yaml)
    ○ Modified presto-common-config.yaml to set hive.s3.use-instance-credentials=false
  • Presto
    ○ Rebuilt presto-hive plugin with latest SDK and added WebIdentityTokenCredentialsProvider.create(); to PrestoS3ClientFactory.java and PrestoS3FileSystem.java (should be avoidable if https://github.com/prestosql/presto/pull/741/files got pulled downstream)
    ○ Modified PrestoS3FileSystem.java to not verify that the IAM role or useInstanceCredentials must be set.
    ○ Patched the docker image to remove the old AWS SDK and presto-hive plugin, replacing it with the one I built (docker build seems to just download the plugins from maven, rather than build the ones in the repo?)
  • Reporting-operator
    ○ Rebuilt the reporting operator using the latest SDK (dep ensure -update github.com/aws/aws-sdk-go)

With all that in place, I was able to generate the sample namespace-cpu-request report from the documentation.

Thanks for your help with this, I'm looking forward to "proper" support for the new creds so I don't have to run with custom docker images, but this is good enough for now.

@chancez
Copy link
Contributor

chancez commented Oct 16, 2019

○ Modified templates for hive, presto, and reporting-operator serviceAccounts to have the IAM annotation (long term should be configurable from the main repo, maybe in metering-custom.yaml)

we already support annotations on every component that I'm aware of.

○ Modified presto-common-config.yaml to set hive.s3.use-instance-credentials=false

we can add this as an option you can set in the MeteringConfig probably

trinodb/trino#741

this was in Presto 316 and we'll be updating to a newer Presto pretty soon, we're on 311 now.

○ Rebuilt the reporting operator using the latest SDK (dep ensure -update github.com/aws/aws-sdk-go)

I can take a look at getting this done too.

Overall seems reasonsable. I think once the aws-java-sdk change is merged we'll need to bump the version in our pom.xml files of each to get the latest, but I could be wrong, as I'm not sure if they've just got constraints or if they're pinned.

@SapientGuardian
Copy link
Author

Could you show me an example for how to add an annotation at "deployment" time? The only avenue for customization I saw was metering-custom.yaml, and it wasn't obvious to me how I would put an annotation in there that would get applied to a ServiceAccount.

I've had to do one more thing for the reporting operator, which I didn't realize until I tried to use the aws billing support: I needed to add fsGroup: 3001 to its securityContext because the projected token is by default only accessible to root, but the container runs as a different user (see kubernetes-sigs/external-dns#1185 for discussion)

@chancez
Copy link
Contributor

chancez commented Oct 17, 2019

Hmm we don't yet support annotations on serviceAccounts, just the pods mostly, but we could add that. What do you need it for and what would be an example annotation?

@SapientGuardian
Copy link
Author

The service account is bound to the IAM role via annotation, and it is from there that the token gets projected into the pod.

The full technical details are in the link I included in the first post of this issue, but in short, when a pod is created it gets intercepted by a mutating webhook (https://github.com/aws/amazon-eks-pod-identity-webhook) which inspects the eks.amazonaws.com/role-arn annotation on the Service Account to which the pod is bound. It runs off and does security token things, and mounts a token as a volume in the pod, along with adding two environment variables.

Here's an example annotation applied to a ServiceAccount-

 annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/eksctl-irptest-addon-iamsa-default-my-serviceaccount-Role1-UCGG6NDYZ3UE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants