Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob][Feature] add light weight job submitter in kuberay image #2587

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Nov 27, 2024

Why are these changes needed?

Currently, noted in the issue #2537, when a user comes with a RayJob CR, KubeRay uses the same image as the RayCluster to start another container to submit the Ray Job. However, if the container runs on a node without the image preloaded, it takes a long time to download the image and start since the image is usually large.

This PR adds a light submitter (45MB) that mimics the ray job submit behavior (submit + tail logs) into the KubeRay image which is usually smaller than the image used in the RayCluster. Users can try it with the submitterPodTemplate in their RayJob CR.

Example RayJob CR yaml:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  rayClusterSpec:
    ...
  submitterPodTemplate:
    spec:
      restartPolicy: Never
      containers:
        - name: my-custom-rayjob-submitter-pod
          image: kuberay/submitter:nightly
          args: ["--runtime-env-json", '{"pip":["requests==2.26.0","pendulum==2.1.2"],"env_vars":{"counter_name":"test_counter"}}', "--", "python", "/home/ray/samples/sample_code.py"]

And, this submitter will not fail when the job has already been submitted thus will also solve #2154.

Related issue number

#2537

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
image

@rueian rueian force-pushed the light-weight-job-submitter branch 20 times, most recently from 6a0092f to 9483127 Compare November 30, 2024 17:38
@rueian rueian marked this pull request as ready for review December 1, 2024 04:40
@andrewsykim
Copy link
Collaborator

andrewsykim commented Dec 3, 2024

Are we planning to merge this PR and #2579?

I would suggest holding on this until we hear feedback about #2579 and whether it addresses concerns with RayJob

@rueian
Copy link
Contributor Author

rueian commented Dec 3, 2024

Just to note that although both PRs can solve the duplicate submission issue, this lightweight committer can further shorten startup duration by a smaller image.

@andrewsykim
Copy link
Collaborator

Makes sense, but I'm concerned about kuberay operator image becoming a dependency at the cluster / job level. If we think this is worth doing, we should probably create a new image

@rueian rueian force-pushed the light-weight-job-submitter branch 2 times, most recently from 2bdb064 to d274346 Compare December 10, 2024 22:46
@rueian rueian force-pushed the light-weight-job-submitter branch 2 times, most recently from 66302f3 to d274346 Compare December 10, 2024 23:47
@rueian rueian force-pushed the light-weight-job-submitter branch from d274346 to e3fb564 Compare December 10, 2024 23:54
@rueian
Copy link
Contributor Author

rueian commented Dec 10, 2024

Hi @kevin85421, I have used a new GitHub action job to build a dedicated image for the submitter but the job requires credentials which I believe are only available after merging the PR.

image

Do you have any suggested way to test the GitHub action job before merging the PR? Or probably we just merge it first?

@andrewsykim
Copy link
Collaborator

IMO I don't think we need this with #2579 merged. Or at least we can revisit after v1.3 based on user feedback

@kevin85421
Copy link
Member

IMO I don't think we need this with #2579 merged. Or at least we can revisit after v1.3 based on user feedback

The lightweight job submitter still has its own benefits (e.g., much faster image pulling), but I agree that we can revisit this based on the feedback from v1.3 to determine if the image pulling overhead of the K8s Job Submitter is problematic. If users always run the submitter on a K8s node that caches the Ray image, the lightweight submitter may not be necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants