-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Leader Election Lost: Kuberay pod restarts every 5mins! #2252
Comments
How frequently is it happening? I suggest removing the |
Btw, since you mentioend running on GKE, consider using the official GKE addon https://cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/how-to/enable-ray-on-gke However, it's only available on v1.30+ at the moment (you mentioned running v1.28.10) |
Thank you @andrewsykim for your quick reply, you're right, limit is no good here. more logs:
@kevin85421 suggested to config kuberay without leader election, do kuberay supports env var? RAY_DISABLE_LEADER_ELECTION or something to configure it via values.yaml? |
limit being set to default in kuberay chart: when removing limit and keeping request, getting this error ^ |
Created this MR to remove default in new version of kuberay: |
Does removing CPU limits resolve the leader election issue though? |
@andrewsykim I can't remove the limits currently as we've default in helpm chart, I need to do it manually |
Issue persist, even without cpu limits or downgrade to 1.1.0 kuberay version any help? |
Hmmm, not sure then. Since you're running a single kuberay-operator anyways, you can follow Kai-Hsun's suggestion and just disable leader election |
@andrewsykim should I increase replicas? this is prod. |
seems related: #601 |
Try setting |
can you also share the full pod YAML of the kuberay operator? |
@andrewsykim I can do it manually, but not via helm as we don't support this flag: https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/templates/deployment.yaml#L57 can I create MR for this? and here is the yaml:
|
Created this PR: Contributing to this project seems like a big milestone for me, TIA! 🥇 |
Just curious about have you check why leader election timeout? Did kuberay was killed by OOM-killer? or the networking issue about connecting to the api-server. if you simply disable leader-election, you might rise of lacking HA. |
@Irvingwangjr you're right, it seems like a timeout connecting to the api-server. @andrewsykim can you share more knowledge about that? In stg, after disabling leader election timeout, I see no restarts. @Irvingwangjr and anyway I'm using 1 replicas for the kuberay, should I increase it and enable leader election? or keep using 1 replica and disable leader election? |
To me it seems that we hit the K8S control plane API server limit, as we see a bunch of timeout. |
GCP acks that there is a limit of 3k / 1m for API server, so we get a timeout to the Control Plane api server. credit: @dyurchanka |
The 3K / minute quota is the default limit for the GKE API and NOT the Kubernetes API server. However, it's possible that your cluster is throttling API requests from kuberay operator. Usually you can figure out if this is happening by looking at the apiserver logs. See https://cloud.google.com/kubernetes-engine/docs/how-to/view-logs#control_plane_logs |
Also see apiserver metrics that could help identify throttling from API server: https://cloud.google.com/kubernetes-engine/docs/how-to/control-plane-metrics#api-server-metrics The ones containing |
Closing, moving to regional cluser solved the kuberay restarts. |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
Kuberay keep restarting because of leader election lost, I've raised it in Slack couple of times but no luck, we suspect it is causing us issues in some RayServices.
We use kuberay 1.1.1 helm chart version.
Main Issue is here:
{"level":"error","ts":"2024-07-17T13:33:36.127Z","logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"main.exitOnError\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:245\nmain.main\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:228\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.14/x64/src/runtime/proc.go:250"}
kuberay-logs.txt
Values configured for helm chart:
replicas: 1
Willing to provide documentation or code if needed :)
Reproduction script
Kuberay 1.1.1, deployed on GKE
v1.28.10-gke.107500
.Values configured above (nothing special IMHO).
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: