Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Allow per-context identity configuration + in-cluster auth fixes #4136

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

romilbhardwaj
Copy link
Collaborator

Closes #4131. Also updates how in-cluster auth/context detection works. Instead of using in-cluster, we now pass the context name so that the remote_identity specified in config.yaml can be parsed correctly.

Tested:

  • Manual tests with sky jobs launch on a sky local up and GKE cluster, with both clusters configured in allowed_contexts.

@romilbhardwaj romilbhardwaj added this to the v0.7 milestone Oct 22, 2024
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @romilbhardwaj! I am bit concerned for passing around the context name just for looking up the remote identity. Just left a comment : )

Comment on lines +185 to +187
in_cluster_region = (
kubernetes_utils.get_in_cluster_context_name())
regions.append(clouds.Region(in_cluster_region))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update the comments above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated!

Comment on lines +373 to +379
{% if k8s_env_vars is not none %}
env:
{% for key, value in k8s_env_vars.items() %}
- name: {{ key }}
value: {{ value }}
{% endfor %}
{% endif %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be overwrite by a user specifying env in their pod_config in config.yaml?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the dictionaries will be merged. E.g., with config.yaml like:

kubernetes:
  pod_config:
    spec:
      containers:
        - env:
            - name: MY_ENV_VAR
              value: "my_value"

Jobs controller pod has:

          env:
          - name: SKYPILOT_IN_CLUSTER_CONTEXT_NAME
            value: gke_sky-dev-465_us-central1-c_gkeusc6
          - name: MY_ENV_VAR
            value: my_value

Comment on lines +381 to +388
if isinstance(remote_identity, dict):
# If remote_identity is a dict, use the service account for the
# current context
k8s_service_account_name = remote_identity.get(context, None)
if k8s_service_account_name is None:
err_msg = (f'Context {context!r} not found in '
'remote identities from config.yaml')
raise ValueError(err_msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid the stacktrace with ux_utils?

Will this work with failover, e.g., if an invalid remote_identity is specified, will it cause SkyPilot failing to failover to other clouds?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we enforce any context used to exist in the dict, should we add this to precheck for config, e.g. in the schema.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch, added ux_utils to hide stack trace.

Currently this doesn't work with failover, but maybe that's desirable? This failure is not a failure of the cloud, but rather configuration failure on the user's end, so maybe good to surface this instead of silently failing over? lmk if you think otherwise, I can figure out a way to move this further downstream in the code so it becomes a part of failover.

should we add this to precheck for config, e.g. in the schema.py?

That might be tricky since schema.py only supports simple json schema validation. We can consider adding to skypilot_config.py::_try_load_config(); noting that might require importing k8s module dynamically if this field is specified.

Comment on lines +427 to +432
# Set environment variables for the pod. Note that SkyPilot env vars
# are set separately when the task is run. These env vars are
# independent of the SkyPilot task to be run.
k8s_env_vars = {
kubernetes_utils.IN_CLUSTER_CONTEXT_NAME_ENV_VAR: context
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super sure of using this env var for passing around the context just for looking up the context in the remote_identity dict (correct me if I am wrong).

If this is mainly for looking up a remote identity for jobs/serve controller, it might be better to just replace the key in remote_identity to in-cluster instead, like the one we did here:

def replace_skypilot_config_path_in_file_mounts(
cloud: 'clouds.Cloud', file_mounts: Optional[Dict[str, str]]):
"""Replaces the SkyPilot config path in file mounts with the real path."""
# TODO(zhwu): This function can be moved to `backend_utils` once we have
# more predefined file mounts that needs to be replaced after the cluster
# is provisioned, e.g., we may need to decide which cloud to create a bucket
# to be mounted to the cluster based on the cloud the cluster is actually
# launched on (after failover).
if file_mounts is None:
return
replaced = False
for remote_path, local_path in list(file_mounts.items()):
if local_path is None:
del file_mounts[remote_path]
continue
if local_path.endswith(_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX):
with tempfile.NamedTemporaryFile('w', delete=False) as f:
user_config = common_utils.read_yaml(local_path)
config = _setup_proxy_command_on_controller(cloud, user_config)
common_utils.dump_yaml(f.name, dict(**config))
file_mounts[remote_path] = f.name
replaced = True
if replaced:
logger.debug(f'Replaced {_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX} '
f'with the real path in file mounts: {file_mounts}')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than looking up remote_identity, this envvar is also required to maintain parity with the region<->context mapping that we have created for SkyPilot.

E.g., if a user defines a YAML like:

resources:
  cloud: kubernetes
  region: gke_sky-dev-465_us-central1-c_gkeusc6

This YAML does not work on sky jobs launch on current master:

(sky-3ff3-romilb, pid=1481) ValueError: Context gke_sky-dev-465_us-central1-c_gkeusc6 not found in kubeconfig. Kubernetes only supports context names as regions. Available contexts: ['in-cluster']

With this PR, we can have region map to the right context consistently, even if it may be running in-cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Remote identity support when multiple contexts are configured
2 participants