[Bug] Leader Election Lost: Kuberay pod restarts every 5mins! #2252

aviadshimoni · 2024-07-17T13:46:41Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Kuberay keep restarting because of leader election lost, I've raised it in Slack couple of times but no luck, we suspect it is causing us issues in some RayServices.
We use kuberay 1.1.1 helm chart version.
Main Issue is here:
{"level":"error","ts":"2024-07-17T13:33:36.127Z","logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"main.exitOnError\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:245\nmain.main\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:228\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.14/x64/src/runtime/proc.go:250"}
kuberay-logs.txt

Values configured for helm chart:

  env:
  - name: ENABLE_GCS_FT_REDIS_CLEANUP
    value: "false"
  nodeSelector:
    name: devops
  tolerations:
    - effect: NoSchedule
      key: CriticalAddonsOnly
      operator: Equal
      value: devops
  resources:
    limits:
      cpu: 500m
      # Anecdotally, managing 500 Ray pods requires roughly 500MB memory.
      # Monitor memory usage and adjust as needed.
      memory: 2Gi
    requests:
      cpu: 200m
      memory: 1Gi

replicas: 1
Willing to provide documentation or code if needed :)

Reproduction script

Kuberay 1.1.1, deployed on GKE v1.28.10-gke.107500.
Values configured above (nothing special IMHO).

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

andrewsykim · 2024-07-17T14:39:45Z

How frequently is it happening? I suggest removing the 500m CPU limit in your KubeRay pod spec, it could be leading to leader election requests timing out

andrewsykim · 2024-07-17T14:40:55Z

Btw, since you mentioend running on GKE, consider using the official GKE addon https://cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/how-to/enable-ray-on-gke

However, it's only available on v1.30+ at the moment (you mentioned running v1.28.10)

aviadshimoni · 2024-07-17T15:15:32Z

Thank you @andrewsykim for your quick reply, you're right, limit is no good here.

more logs:

{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"CheckAllPodsRunning: Pod is not running; Pod Name: rayservice-raycluster-v44tf-worker-ray-worker-gpu-s9k76; Pod Status.Phase: Failed","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc"}
{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc","cluster name":"clip-model-rayservice-raycluster-v44tf"}
{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"Unconditional requeue after","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc","cluster name":"clip-model-rayservice-raycluster-v44tf","seconds":300}
E0717 15:12:07.440182       1 leaderelection.go:369] Failed to update lock: Put "https://10.109.64.1:443/apis/coordination.k8s.io/v1/namespaces/devops-ray/leases/ray-operator-leader": context deadline exceeded
I0717 15:12:07.440241       1 leaderelection.go:285] failed to renew lease devops-ray/ray-operator-leader: timed out waiting for the condition
{"level":"error","ts":"2024-07-17T15:12:07.440Z","logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"main.exitOnError\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:245\nmain.main\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:228\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.14/x64/src/runtime/proc.go:250"}

@kevin85421 suggested to config kuberay without leader election, do kuberay supports env var? RAY_DISABLE_LEADER_ELECTION or something to configure it via values.yaml?
https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1721200745507799?thread_ts=1717181822.513479&cid=C02GFQ82JPM

aviadshimoni · 2024-07-17T15:20:27Z

@andrewsykim relevant lol: https://home.robusta.dev/blog/stop-using-cpu-limits

aviadshimoni · 2024-07-17T15:26:29Z

limit being set to default in kuberay chart:
2024-07-17T15:23:10+00:00 apps Deployment devops-ray kuberay-operator OutOfSync Healthy Deployment.apps "kuberay-operator" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "500m": must be less than or equal to cpu limit of 100m

when removing limit and keeping request, getting this error ^

aviadshimoni · 2024-07-17T15:31:18Z

Created this MR to remove default in new version of kuberay:
#2253

andrewsykim · 2024-07-17T15:37:22Z

Does removing CPU limits resolve the leader election issue though?

aviadshimoni · 2024-07-18T07:26:41Z

@andrewsykim I can't remove the limits currently as we've default in helpm chart, I need to do it manually

aviadshimoni · 2024-07-18T07:51:16Z

Edited manually in stg:

let's see

aviadshimoni · 2024-07-18T08:00:38Z

edited prod too:

but still, the best practice is to avoid setting CPU limits, so I would remove it from the default values in helm chart.

aviadshimoni · 2024-07-18T11:09:51Z

Issue persist, even without cpu limits or downgrade to 1.1.0 kuberay version

any help?

aviadshimoni · 2024-07-18T11:10:06Z

andrewsykim · 2024-07-19T18:22:22Z

Hmmm, not sure then. Since you're running a single kuberay-operator anyways, you can follow Kai-Hsun's suggestion and just disable leader election

aviadshimoni · 2024-07-20T08:05:03Z

@andrewsykim should I increase replicas? this is prod.
@kevin85421 how to pass that as env var to disable leader election? or how do I pass this flag to the binary?

aviadshimoni · 2024-07-20T08:23:49Z

seems related: #601

andrewsykim · 2024-07-20T21:27:06Z

Try setting --enable-leader-election=false in the kuberay-operator flags

andrewsykim · 2024-07-20T21:27:38Z

can you also share the full pod YAML of the kuberay operator? kubectl get pod <kuberay-operator-pod> -o yaml

aviadshimoni · 2024-07-21T10:36:59Z

@andrewsykim I can do it manually, but not via helm as we don't support this flag: https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/templates/deployment.yaml#L57

can I create MR for this?

and here is the yaml:

apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    creationTimestamp: "2024-07-21T09:25:06Z"
    generateName: kuberay-operator-54d77db944-
    labels:
      app.kubernetes.io/component: kuberay-operator
      app.kubernetes.io/instance: kuberay-operator
      app.kubernetes.io/name: kuberay-operator
      pod-template-hash: 54d77db944
    name: kuberay-operator-54d77db944-bsxkt
    namespace: devops-ray
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: kuberay-operator-54d77db944
      uid: f21b2341-a8b5-4856-8816-d1b5dfdb9812
    resourceVersion: "1257112557"
    uid: a40cb305-3bb2-4a22-a69e-8d5506531f1c
  spec:
    containers:
    - command:
      - /manager
      env:
      - name: ENABLE_GCS_FT_REDIS_CLEANUP
        value: "false"
      image: quay.io/kuberay/operator:v1.1.0
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 5
        httpGet:
          path: /metrics
          port: http
          scheme: HTTP
        initialDelaySeconds: 10
        periodSeconds: 5
        successThreshold: 1
        timeoutSeconds: 1
      name: kuberay-operator
      ports:
      - containerPort: 8080
        name: http
        protocol: TCP
      readinessProbe:
        failureThreshold: 5
        httpGet:
          path: /metrics
          port: http
          scheme: HTTP
        initialDelaySeconds: 10
        periodSeconds: 5
        successThreshold: 1
        timeoutSeconds: 1
      resources:
        limits:
          cpu: "8"
          memory: 8Gi
        requests:
          cpu: "1"
          memory: 1Gi
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
          - ALL
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-m56k4
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
    - name: dv-docker-main-gcp
    nodeName: gke-gp-ms-uc1-k8s-public-mle-devops-1-dab37758-rl5l
    nodeSelector:
      name: devops
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: kuberay-operator
    serviceAccountName: kuberay-operator
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoSchedule
      key: CriticalAddonsOnly
      operator: Equal
      value: devops
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-m56k4
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2024-07-21T09:26:15Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2024-07-21T10:28:56Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2024-07-21T10:28:56Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2024-07-21T09:26:15Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: containerd://8109abfec6a2c12c0aa452c3cd4fe997632065db3626df7f0874b669b95dc1d8
      image: quay.io/kuberay/operator:v1.1.0
      imageID: quay.io/kuberay/operator@sha256:a535757c28bcc27c06f906146970417cd88769f78df860131625d776608ec309
      lastState:
        terminated:
          containerID: containerd://bebcfb18507c4a237a14eebb85d3356c158155b8d00adaee117138769a2da929
          exitCode: 1
          finishedAt: "2024-07-21T10:28:43Z"
          reason: Error
          startedAt: "2024-07-21T10:05:33Z"
      name: kuberay-operator
      ready: true
      restartCount: 3
      started: true
      state:
        running:
          startedAt: "2024-07-21T10:28:44Z"
    hostIP: <host IP>
    phase: Running
    podIP: <pod IP>
    podIPs:
    - IP: <pod IP>
    qosClass: Burstable
    startTime: "2024-07-21T09:26:15Z"
kind: List
metadata:
  resourceVersion: ""

aviadshimoni · 2024-07-21T12:39:57Z

Created this PR:
#2262

Contributing to this project seems like a big milestone for me, TIA! 🥇

Irvingwangjr · 2024-07-22T01:54:40Z

Thank you @andrewsykim for your quick reply, you're right, limit is no good here.

more logs:

{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"CheckAllPodsRunning: Pod is not running; Pod Name: rayservice-raycluster-v44tf-worker-ray-worker-gpu-s9k76; Pod Status.Phase: Failed","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc"}
{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc","cluster name":"clip-model-rayservice-raycluster-v44tf"}
{"level":"info","ts":"2024-07-17T15:12:06.947Z","logger":"controllers.RayCluster","msg":"Unconditional requeue after","RayCluster":{"name":"clip-model-rayservice-raycluster-v44tf","namespace":"clip-model"},"reconcileID":"aa3e136c-fd9a-47ff-8b83-76cf27be3bdc","cluster name":"clip-model-rayservice-raycluster-v44tf","seconds":300}
E0717 15:12:07.440182       1 leaderelection.go:369] Failed to update lock: Put "https://10.109.64.1:443/apis/coordination.k8s.io/v1/namespaces/devops-ray/leases/ray-operator-leader": context deadline exceeded
I0717 15:12:07.440241       1 leaderelection.go:285] failed to renew lease devops-ray/ray-operator-leader: timed out waiting for the condition
{"level":"error","ts":"2024-07-17T15:12:07.440Z","logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"main.exitOnError\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:245\nmain.main\n\t/home/runner/work/kuberay/kuberay/ray-operator/main.go:228\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.14/x64/src/runtime/proc.go:250"}

@kevin85421 suggested to config kuberay without leader election, do kuberay supports env var? RAY_DISABLE_LEADER_ELECTION or something to configure it via values.yaml? https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1721200745507799?thread_ts=1717181822.513479&cid=C02GFQ82JPM

Just curious about have you check why leader election timeout? Did kuberay was killed by OOM-killer? or the networking issue about connecting to the api-server. if you simply disable leader-election, you might rise of lacking HA.

aviadshimoni · 2024-07-22T06:49:03Z

@Irvingwangjr you're right, it seems like a timeout connecting to the api-server.
any chance we hit API limit?
in Grafana we see 20k request per min, quota in gcp is 3k per min: https://cloud.google.com/kubernetes-engine/quotas

@andrewsykim can you share more knowledge about that?

In stg, after disabling leader election timeout, I see no restarts.

@Irvingwangjr and anyway I'm using 1 replicas for the kuberay, should I increase it and enable leader election? or keep using 1 replica and disable leader election?

aviadshimoni · 2024-07-22T08:29:21Z

Here we can see the big difference between leader election enabled or disabled.

To me it seems that we hit the K8S control plane API server limit, as we see a bunch of timeout.
the question is why? and if anyone else receive it in GKE? @andrewsykim tagging you as the GKE expert 🐐

aviadshimoni · 2024-07-22T11:46:36Z

GCP acks that there is a limit of 3k / 1m for API server, so we get a timeout to the Control Plane api server. credit: @dyurchanka

andrewsykim · 2024-07-22T14:22:26Z

The 3K / minute quota is the default limit for the GKE API and NOT the Kubernetes API server. However, it's possible that your cluster is throttling API requests from kuberay operator. Usually you can figure out if this is happening by looking at the apiserver logs. See https://cloud.google.com/kubernetes-engine/docs/how-to/view-logs#control_plane_logs

andrewsykim · 2024-07-22T14:31:02Z

Also see apiserver metrics that could help identify throttling from API server: https://cloud.google.com/kubernetes-engine/docs/how-to/control-plane-metrics#api-server-metrics

The ones containing flowcontrol in particular might indicate throttlign issues on your cluster

aviadshimoni · 2024-08-05T07:20:56Z

Closing, moving to regional cluser solved the kuberay restarts.

aviadshimoni added bug Something isn't working triage labels Jul 17, 2024

aviadshimoni mentioned this issue Jul 17, 2024

infra: using cpu limits is not good, let's remove defaults #2253

Open

4 tasks

kevin85421 removed the triage label Jul 17, 2024

aviadshimoni mentioned this issue Jul 21, 2024

Support disable leader election for manager go binary via Values.yaml to mitigate kuberay restarts #2262

Merged

4 tasks

aviadshimoni closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Leader Election Lost: Kuberay pod restarts every 5mins! #2252

[Bug] Leader Election Lost: Kuberay pod restarts every 5mins! #2252

aviadshimoni commented Jul 17, 2024 •

edited

Loading

andrewsykim commented Jul 17, 2024

andrewsykim commented Jul 17, 2024

aviadshimoni commented Jul 17, 2024 •

edited

Loading

aviadshimoni commented Jul 17, 2024

aviadshimoni commented Jul 17, 2024

aviadshimoni commented Jul 17, 2024

andrewsykim commented Jul 17, 2024

aviadshimoni commented Jul 18, 2024

aviadshimoni commented Jul 18, 2024

aviadshimoni commented Jul 18, 2024

aviadshimoni commented Jul 18, 2024

aviadshimoni commented Jul 18, 2024

andrewsykim commented Jul 19, 2024

aviadshimoni commented Jul 20, 2024 •

edited

Loading

aviadshimoni commented Jul 20, 2024

andrewsykim commented Jul 20, 2024

andrewsykim commented Jul 20, 2024

aviadshimoni commented Jul 21, 2024

aviadshimoni commented Jul 21, 2024

Irvingwangjr commented Jul 22, 2024

aviadshimoni commented Jul 22, 2024

aviadshimoni commented Jul 22, 2024 •

edited

Loading

aviadshimoni commented Jul 22, 2024 •

edited

Loading

andrewsykim commented Jul 22, 2024 •

edited

Loading

andrewsykim commented Jul 22, 2024

aviadshimoni commented Aug 5, 2024

[Bug] Leader Election Lost: Kuberay pod restarts every 5mins! #2252

[Bug] Leader Election Lost: Kuberay pod restarts every 5mins! #2252

Comments

aviadshimoni commented Jul 17, 2024 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

andrewsykim commented Jul 17, 2024

andrewsykim commented Jul 17, 2024

aviadshimoni commented Jul 17, 2024 • edited Loading

aviadshimoni commented Jul 17, 2024

aviadshimoni commented Jul 17, 2024

aviadshimoni commented Jul 17, 2024

andrewsykim commented Jul 17, 2024

aviadshimoni commented Jul 18, 2024

aviadshimoni commented Jul 18, 2024

aviadshimoni commented Jul 18, 2024

aviadshimoni commented Jul 18, 2024

aviadshimoni commented Jul 18, 2024

andrewsykim commented Jul 19, 2024

aviadshimoni commented Jul 20, 2024 • edited Loading

aviadshimoni commented Jul 20, 2024

andrewsykim commented Jul 20, 2024

andrewsykim commented Jul 20, 2024

aviadshimoni commented Jul 21, 2024

aviadshimoni commented Jul 21, 2024

Irvingwangjr commented Jul 22, 2024

aviadshimoni commented Jul 22, 2024

aviadshimoni commented Jul 22, 2024 • edited Loading

aviadshimoni commented Jul 22, 2024 • edited Loading

andrewsykim commented Jul 22, 2024 • edited Loading

andrewsykim commented Jul 22, 2024

aviadshimoni commented Aug 5, 2024

aviadshimoni commented Jul 17, 2024 •

edited

Loading

aviadshimoni commented Jul 17, 2024 •

edited

Loading

aviadshimoni commented Jul 20, 2024 •

edited

Loading

aviadshimoni commented Jul 22, 2024 •

edited

Loading

aviadshimoni commented Jul 22, 2024 •

edited

Loading

andrewsykim commented Jul 22, 2024 •

edited

Loading