Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCS FT] GCS FT misconfiguration #2694

Open
1 of 2 tasks
kevin85421 opened this issue Dec 27, 2024 · 4 comments · May be fixed by #2726
Open
1 of 2 tasks

[GCS FT] GCS FT misconfiguration #2694

kevin85421 opened this issue Dec 27, 2024 · 4 comments · May be fixed by #2726
Assignees
Labels
1.3.0 bug Something isn't working gcs ft

Comments

@kevin85421
Copy link
Member

kevin85421 commented Dec 27, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

See this Slack thread for more details.

The user creates a RayService without setting the annotation ray.io/ft-enabled: "true", but they do set RAY_REDIS_ADDRESS and REDIS_PASSWORD. As a result, the RayCluster enables GCS FT and writes data to the external Redis, but KubeRay is unaware of this. KubeRay doesn't configure RAY_external_storage_namespace so that the RayCluster writes data to the key default in the Redis.

When the user triggers a zero-downtime upgrade, the new RayCluster also attempts to read metadata from the default key in Redis. Therefore, the new RayCluster will see some information from the old RayCluster.

Solution:

  • Explicitly disable GCS FT if KubeRay determines that the RayCluster has not enabled GCS FT.

Reproduction script

TODO

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added bug Something isn't working triage gcs ft and removed triage labels Dec 27, 2024
@kevin85421
Copy link
Member Author

/assign @rueian

@rueian
Copy link
Contributor

rueian commented Dec 28, 2024

Ray now relies on the RAY_REDIS_ADDRESS environment variable to determine if it should be in the "redis" mode. Once the variable is set, no matter the value, it will try to parse and connect to the RAY_REDIS_ADDRESS.

To disable the GCS FT explicitly from KubeRay, we now have three options:

  1. Add a new CLI option --disable-gcs-ft or a new environment variable DISABLE_RAY_GCS_FT to Ray. Once Ray finds that either is set, it doesn't go into "redis" mode. Then, We add the option from KubeRay explicitly on the new Ray versions to disable GCS FT.
  2. Ignore empty RAY_REDIS_ADDRESS in Ray. Ray will always try to parse the env variable even if it is an empty string. We can check if it is an empty string before paring it and treat the case as equal to not setting it. Then, we can explicitly set RAY_REDIS_ADDRESS to an empty string on the new Ray versions from KubeRay to disable GCS FT.
  3. Add unset RAY_REDIS_ADDRESS to the ray start command generated by KubeRay.

The first two options require us to make changes to Ray. Among them, I prefer the second one as it only slightly extends the RAY_REDIS_ADDRESS environment variable without adding completely new things. I think the second option will be much easier to be accepted by Ray.

However, I prefer the third option overall since it doesn't depend on Ray and it can work with all the current Ray versions. It also seems to be the easiest to implement.

Please let me know what you think about these approaches @kevin85421. Thanks.

@rueian
Copy link
Contributor

rueian commented Jan 7, 2025

After discussing this offline with @kevin85421, we decided not to override and explicitly disable users' configuration but to raise an error to users instead. i.e. If users configure the GCS FT in a way KubeRay doesn't support, the RayCluster will be rejected by the KubeRay operator. This mechanism will be added after #2712.

@kevin85421
Copy link
Member Author

@rueian another validation is also required: the annotation explicitly disables GCS FT (i.e., ray.io/ft-enabled: "false") and GcsFaultToleranceOptions is not nil. This case is invalid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.3.0 bug Something isn't working gcs ft
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants