Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][AWS] Serving a new controller can result in terminating previously served models. #4143

Open
JGSweets opened this issue Oct 22, 2024 · 2 comments

Comments

@JGSweets
Copy link
Contributor

JGSweets commented Oct 22, 2024

When these conditions are true, a new controller may terminate existing resources being served by another controller.

  • User id hash matches last 4 characters, but has a difference otherwise.
  • Service name is equivalent

This may results in a model node's cluster_name with the same name as an existing model node depending on version.
e.g.


Existing controller --
USER_ID_HASH=12345678
Controller node: sky-sky-serve-controller-12345678-5678-head
model node: sky-<SERVICE_NAME>-<VERSION>-5678-head


New Controller --
USER_ID_HASH=11115678
Controller node: sky-sky-serve-controller-11115678-5678-head
model node: sky-<SERVICE_NAME>-<VERSION>-5678-head


In this case, if the <VERSION> matches, the existing may get terminated.

I believe this results from the filter in terminate only looking for the Name as opposed:

return [{
    'Name': f'tag:{constants.TAG_RAY_CLUSTER_NAME}',
    'Values': [cluster_name_on_cloud],
}]

I'm not sure if this widespread across other deployment platforms.

In AWS, this could be resolved by including a new TAG in addition the the name that specifies the correct controller/cluster association and using that in the filter in as well.

Possible tags:

  • USER_ID_HASH (easy fix)
  • UUID for every controller / model associated with that controller (needs to be stored in the db)
@cblmemo
Copy link
Collaborator

cblmemo commented Oct 22, 2024

Hi @JGSweets ! Thanks for reporting this error. Just want to make sure, this is for multiple users (in multiple laptops) running SkyPilot in a shared AWS project?

@JGSweets
Copy link
Contributor Author

JGSweets commented Oct 22, 2024

Hi @JGSweets ! Thanks for reporting this error. Just want to make sure, this is for multiple users (in multiple laptops) running SkyPilot in a shared AWS project?

@cblmemo Actually, this case would be for a single compute resource where SKYPILOT_USER_ID is set via environment variables. I would be inclined to believe that multiple laptops would have a similar effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants