Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Allow per-context identity configuration + in-cluster auth fixes #4136

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -380,8 +380,8 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
secret_field_name = clouds.Kubernetes().ssh_key_secret_field_name
context = config['provider'].get(
'context', kubernetes_utils.get_current_kube_config_context_name())
if context == kubernetes_utils.IN_CLUSTER_REGION:
# If the context is set to IN_CLUSTER_REGION, we are running in a pod
if context == kubernetes_utils.get_in_cluster_context_name():
# If the context is a in-cluster context name, we are running in a pod
# with in-cluster configuration. We need to set the context to None
# to use the mounted service account.
context = None
Expand Down
36 changes: 29 additions & 7 deletions sky/clouds/kubernetes.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,8 +182,9 @@ def regions_with_offering(cls, instance_type: Optional[str],
if context is None:
# If running in-cluster, we allow the region to be set to the
# singleton region since there is no context name available.
regions.append(clouds.Region(
kubernetes_utils.IN_CLUSTER_REGION))
in_cluster_region = (
kubernetes_utils.get_in_cluster_context_name())
regions.append(clouds.Region(in_cluster_region))
romilbhardwaj marked this conversation as resolved.
Show resolved Hide resolved
else:
regions.append(clouds.Region(context))

Expand Down Expand Up @@ -376,20 +377,32 @@ def make_deploy_resources_variables(
remote_identity = skypilot_config.get_nested(
('kubernetes', 'remote_identity'),
schemas.get_default_remote_identity('kubernetes'))
if (remote_identity ==

if isinstance(remote_identity, dict):
# If remote_identity is a dict, use the service account for the
# current context
k8s_service_account_name = remote_identity.get(context, None)
if k8s_service_account_name is None:
err_msg = (f'Context {context!r} not found in '
'remote identities from config.yaml')
raise ValueError(err_msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid the stacktrace with ux_utils?

Will this work with failover, e.g., if an invalid remote_identity is specified, will it cause SkyPilot failing to failover to other clouds?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we enforce any context used to exist in the dict, should we add this to precheck for config, e.g. in the schema.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch, added ux_utils to hide stack trace.

Currently this doesn't work with failover, but maybe that's desirable? This failure is not a failure of the cloud, but rather configuration failure on the user's end, so maybe good to surface this instead of silently failing over? lmk if you think otherwise, I can figure out a way to move this further downstream in the code so it becomes a part of failover.

should we add this to precheck for config, e.g. in the schema.py?

That might be tricky since schema.py only supports simple json schema validation. We can consider adding to skypilot_config.py::_try_load_config(); noting that might require importing k8s module dynamically if this field is specified.

else:
# If remote_identity is not a dict, use
k8s_service_account_name = remote_identity

if (k8s_service_account_name ==
schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value):
# SA name doesn't matter since automounting credentials is disabled
k8s_service_account_name = 'default'
k8s_automount_sa_token = 'false'
elif (remote_identity ==
elif (k8s_service_account_name ==
schemas.RemoteIdentityOptions.SERVICE_ACCOUNT.value):
# Use the default service account
k8s_service_account_name = (
kubernetes_utils.DEFAULT_SERVICE_ACCOUNT_NAME)
k8s_automount_sa_token = 'true'
else:
# User specified a custom service account
k8s_service_account_name = remote_identity
k8s_automount_sa_token = 'true'

fuse_device_required = bool(resources.requires_fuse)
Expand All @@ -410,6 +423,14 @@ def make_deploy_resources_variables(
('kubernetes', 'provision_timeout'),
10,
override_configs=resources.cluster_config_overrides)

# Set environment variables for the pod. Note that SkyPilot env vars
# are set separately when the task is run. These env vars are
# independent of the SkyPilot task to be run.
k8s_env_vars = {
kubernetes_utils.IN_CLUSTER_CONTEXT_NAME_ENV_VAR: context
}
Comment on lines +429 to +434
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super sure of using this env var for passing around the context just for looking up the context in the remote_identity dict (correct me if I am wrong).

If this is mainly for looking up a remote identity for jobs/serve controller, it might be better to just replace the key in remote_identity to in-cluster instead, like the one we did here:

def replace_skypilot_config_path_in_file_mounts(
cloud: 'clouds.Cloud', file_mounts: Optional[Dict[str, str]]):
"""Replaces the SkyPilot config path in file mounts with the real path."""
# TODO(zhwu): This function can be moved to `backend_utils` once we have
# more predefined file mounts that needs to be replaced after the cluster
# is provisioned, e.g., we may need to decide which cloud to create a bucket
# to be mounted to the cluster based on the cloud the cluster is actually
# launched on (after failover).
if file_mounts is None:
return
replaced = False
for remote_path, local_path in list(file_mounts.items()):
if local_path is None:
del file_mounts[remote_path]
continue
if local_path.endswith(_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX):
with tempfile.NamedTemporaryFile('w', delete=False) as f:
user_config = common_utils.read_yaml(local_path)
config = _setup_proxy_command_on_controller(cloud, user_config)
common_utils.dump_yaml(f.name, dict(**config))
file_mounts[remote_path] = f.name
replaced = True
if replaced:
logger.debug(f'Replaced {_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX} '
f'with the real path in file mounts: {file_mounts}')

Copy link
Collaborator Author

@romilbhardwaj romilbhardwaj Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than looking up remote_identity, this envvar is also required to maintain parity with the region<->context mapping that we have created for SkyPilot.

E.g., if a user defines a YAML like:

resources:
  cloud: kubernetes
  region: gke_sky-dev-465_us-central1-c_gkeusc6

This YAML does not work on sky jobs launch on current master:

(sky-3ff3-romilb, pid=1481) ValueError: Context gke_sky-dev-465_us-central1-c_gkeusc6 not found in kubeconfig. Kubernetes only supports context names as regions. Available contexts: ['in-cluster']

With this PR, we can have region map to the right context consistently, even if it may be running in-cluster.

This mapping is also needed in #4188 to allow our api server to use the context name instead of in-cluster when running in a Kubernetes cluster.


deploy_vars = {
'instance_type': resources.instance_type,
'custom_resources': custom_resources,
Expand All @@ -431,6 +452,7 @@ def make_deploy_resources_variables(
'k8s_skypilot_system_namespace': _SKYPILOT_SYSTEM_NAMESPACE,
'k8s_spot_label_key': spot_label_key,
'k8s_spot_label_value': spot_label_value,
'k8s_env_vars': k8s_env_vars,
'image_id': image_id,
}

Expand Down Expand Up @@ -566,7 +588,7 @@ def validate_region_zone(self, region: Optional[str], zone: Optional[str]):
# TODO: Remove this after 0.9.0.
return region, zone

if region == kubernetes_utils.IN_CLUSTER_REGION:
if region == kubernetes_utils.get_in_cluster_context_name():
# If running incluster, we set region to IN_CLUSTER_REGION
# since there is no context name available.
return region, zone
Expand All @@ -575,7 +597,7 @@ def validate_region_zone(self, region: Optional[str], zone: Optional[str]):
if all_contexts == [None]:
# If [None] context is returned, use the singleton region since we
# are running in a pod with in-cluster auth.
all_contexts = [kubernetes_utils.IN_CLUSTER_REGION]
all_contexts = [kubernetes_utils.get_in_cluster_context_name()]
if region not in all_contexts:
raise ValueError(
f'Context {region} not found in kubeconfig. Kubernetes only '
Expand Down
24 changes: 20 additions & 4 deletions sky/provision/kubernetes/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,13 @@

# TODO(romilb): Move constants to constants.py
DEFAULT_NAMESPACE = 'default'
IN_CLUSTER_REGION = 'in-cluster'
DEFAULT_IN_CLUSTER_REGION = 'in-cluster'

# The name for the environment variable that stores the in-cluster context name
# for Kubernetes clusters. This is used to associate a name with the current
# context when running with in-cluster auth. If not set, the context name is
# set to DEFAULT_IN_CLUSTER_REGION.
IN_CLUSTER_CONTEXT_NAME_ENV_VAR = 'SKYPILOT_IN_CLUSTER_CONTEXT_NAME'

DEFAULT_SERVICE_ACCOUNT_NAME = 'skypilot-service-account'

Expand Down Expand Up @@ -1996,9 +2002,9 @@ def set_autodown_annotations(handle: 'backends.CloudVmRayResourceHandle',
def get_context_from_config(provider_config: Dict[str, Any]) -> Optional[str]:
context = provider_config.get('context',
get_current_kube_config_context_name())
if context == IN_CLUSTER_REGION:
# If the context (also used as the region) is set to IN_CLUSTER_REGION
# we need to use in-cluster auth.
if context == get_in_cluster_context_name():
# If the context (also used as the region) is in-cluster, we need to
# we need to use in-cluster auth by setting the context to None.
context = None
return context

Expand Down Expand Up @@ -2136,3 +2142,13 @@ def process_skypilot_pods(
num_pods = len(cluster.pods)
cluster.resources_str = f'{num_pods}x {cluster.resources}'
return list(clusters.values()), jobs_controllers, serve_controllers


def get_in_cluster_context_name() -> Optional[str]:
"""Returns the name of the in-cluster context from the environment.

If the environment variable is not set, returns the default in-cluster
context name.
"""
return (os.environ.get(IN_CLUSTER_CONTEXT_NAME_ENV_VAR) or
DEFAULT_IN_CLUSTER_REGION)
8 changes: 8 additions & 0 deletions sky/templates/kubernetes-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -370,6 +370,14 @@ available_node_types:
}
trap : TERM INT; log_tail || sleep infinity & wait

{% if k8s_env_vars is not none %}
env:
{% for key, value in k8s_env_vars.items() %}
- name: {{ key }}
value: {{ value }}
{% endfor %}
{% endif %}
Comment on lines +373 to +379
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be overwrite by a user specifying env in their pod_config in config.yaml?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the dictionaries will be merged. E.g., with config.yaml like:

kubernetes:
  pod_config:
    spec:
      containers:
        - env:
            - name: MY_ENV_VAR
              value: "my_value"

Jobs controller pod has:

          env:
          - name: SKYPILOT_IN_CLUSTER_CONTEXT_NAME
            value: gke_sky-dev-465_us-central1-c_gkeusc6
          - name: MY_ENV_VAR
            value: my_value


ports:
- containerPort: 22 # Used for SSH
- containerPort: {{ray_port}} # Redis port
Expand Down
9 changes: 8 additions & 1 deletion sky/utils/schemas.py
Original file line number Diff line number Diff line change
Expand Up @@ -668,7 +668,14 @@ def get_default_remote_identity(cloud: str) -> str:

_REMOTE_IDENTITY_SCHEMA_KUBERNETES = {
'remote_identity': {
'type': 'string'
'anyOf': [{
'type': 'string'
}, {
'type': 'object',
'additionalProperties': {
'type': 'string'
}
}]
},
}

Expand Down
Loading