Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use HTTP probes for Ray readiness and liviness probes #2360

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andrewsykim
Copy link
Collaborator

@andrewsykim andrewsykim commented Sep 6, 2024

Why are these changes needed?

HTTP probes are considered lighter-weight than exec probes. However, exec probes have the advantage of doing multiple health checks. In KubeRay, we use exec probes to execute "wget" commands against multiple endpoints. Use of exec probes seems to be causing some issues, as shown in #2264 and from KubeRay scalability testing.

This PR explores using HTTP probes instead. This PR needs more consideration as using HTTP probes means we can only health check 1 end point per probe. Marking WIP for now until that quesiton is resolved.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@@ -271,7 +256,7 @@ func initLivenessAndReadinessProbe(rayContainer *corev1.Container, rayNodeType r
SuccessThreshold: utils.DefaultLivenessProbeSuccessThreshold,
FailureThreshold: utils.DefaultLivenessProbeFailureThreshold,
}
rayContainer.LivenessProbe.Exec = &corev1.ExecAction{Command: []string{"bash", "-c", strings.Join(commands, " && ")}}
rayContainer.LivenessProbe.HTTPGet = &corev1.HTTPGetAction{Path: healthCheckPath, Port: healthCheckPort}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using HTTP probes means we can only query 1 endpoint per probe now. For head pod this would /api/gcs_healthz and for worker pod it would be api/local_raylet_healthz. I'm not sure if not health checking api/local_raylet_healthz in the head pod is problematic, it would depend on what whether /api/gcs_healthz incorporates raylet health in some way as well

@YQ-Wang
Copy link
Contributor

YQ-Wang commented Sep 10, 2024

We also face this issue when the workload is high.

@kevin85421
Copy link
Member

@andrewsykim do we still need this PR after #2353 has been merged?

@andrewsykim
Copy link
Collaborator Author

I think we should still consider use of HTTP probes, they are significantly ligher weight. I haven't root caused the issue I'm seeing, but increasing the timeout did not fully resolve the issue I'm seeing where exec probes cause high load

@joshhvulcan
Copy link

I have encountered some bizarre behavior with the exec probes that I think would be solved with http probes. The biggest issue may actually be a k8s bug though. The ray-head container had died but the autoscaler container was still running so the pod was kept alive by k8s. The probe was failing because the exec was failing and k8s took no action because the probe failed to fail? (lol). Anyway, http likely probes would have done the needful here.

Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "03e693753f58930cd9bf004e047ff1cf7c26afd30ea916cbe0d291e130ea9d27": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown

@andrewsykim
Copy link
Collaborator Author

andrewsykim commented Oct 31, 2024

I'm still in favor of this change and I would like to see it merged for v1.3. However, using http probes means we can only probe 1 HTTP endpoint per container. Specifically for the Head pod, it means probing only the dashboard endpoint and not the raylet agent. Are w okay with that change? @kevin85421 @joshhvulcan

@joshhvulcan
Copy link

I think an http probe on the dashboard would be sufficient for the failures we have experienced.

@andrewsykim andrewsykim changed the title [WIP] Use HTTP probes for Ray readiness and liviness probes Use HTTP probes for Ray readiness and liviness probes Oct 31, 2024
@andrewsykim
Copy link
Collaborator Author

PR updated, PTAL

@metasyn
Copy link
Contributor

metasyn commented Dec 10, 2024

A side benefit is that this does not force custom images to include wget

@andrewsykim
Copy link
Collaborator Author

Consider consolidating health check endpoints in Ray Core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants