[RayJob] implement deletion policy API #2643

andrewsykim · 2024-12-12T19:53:13Z

Why are these changes needed?

Implement RayJob DeletionPolicy API

Related issue number

#2615

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

MortalHappiness

Could you resolve the conflicts? Also, should we add some tests for this feature?

ray-operator/controllers/ray/rayjob_controller.go

andrewsykim · 2024-12-16T02:47:00Z

Fixed conflicts, will add tests tomorrow

andrewsykim · 2024-12-17T00:46:01Z

Added unit tests, going to skip e2e tests for now since it's currently not trivial to enable feature gates in the e2e tests

ray-operator/controllers/ray/rayjob_controller_test.go

Signed-off-by: Andrew Sy Kim <[email protected]>

ray-operator/controllers/ray/rayjob_controller_test.go

MortalHappiness

LGTM

kevin85421 · 2024-12-17T20:59:49Z

ray-operator/controllers/ray/rayjob_controller.go

+			rayJobInstance.Spec.DeletionPolicy != nil &&
+			*rayJobInstance.Spec.DeletionPolicy != rayv1.DeleteNoneDeletionPolicy &&
+			len(rayJobInstance.Spec.ClusterSelector) == 0 {
+			logger.Info(


Move

logger.Info( "RayJob deployment status", "jobDeploymentStatus", rayJobInstance.Status.JobDeploymentStatus, "deletionPolicy", rayJobInstance.Spec.DeletionPolicy, "ttlSecondsAfterFinished", ttlSeconds, "Status.endTime", rayJobInstance.Status.EndTime, "Now", nowTime, "ShutdownTime", shutdownTime) if shutdownTime.After(nowTime) { delta := int32(time.Until(shutdownTime.Add(2 * time.Second)).Seconds()) logger.Info("shutdownTime not reached, requeue this RayJob for n seconds", "seconds", delta) return ctrl.Result{RequeueAfter: time.Duration(delta) * time.Second}, nil }

to the above of if features.Enabled(features.RayJobDeletionPolicy) && and remove the similar logics from L391 to L403.

kevin85421 · 2024-12-17T21:03:36Z

ray-operator/controllers/ray/rayjob_controller.go

@@ -617,6 +655,31 @@ func (r *RayJobReconciler) deleteClusterResources(ctx context.Context, rayJobIns
 	return isClusterDeleted, nil
 }

+func (r *RayJobReconciler) scaleWorkerReplicasToZero(ctx context.Context, rayJobInstance *rayv1.RayJob) (bool, error) {


do we need to return bool?

no, this is an oversight

kevin85421 · 2024-12-17T21:09:42Z

ray-operator/controllers/ray/rayjob_controller.go

@@ -617,6 +655,31 @@ func (r *RayJobReconciler) deleteClusterResources(ctx context.Context, rayJobIns
 	return isClusterDeleted, nil
 }

+func (r *RayJobReconciler) scaleWorkerReplicasToZero(ctx context.Context, rayJobInstance *rayv1.RayJob) (bool, error) {


This function may not work when autoscaling is enabled. If autoscaling is enabled, Pod deletion is always determined by the Ray Autoscaler. KubeRay will not delete any Pods, even if the number of Pods exceeds the goal state.

I see your point, but autoscaling with RayJob is pretty uncommon though right?

One way to fix this is to also set max replicas to 0

but autoscaling with RayJob is pretty uncommon though right?

I checked with my colleagues, and this may be incorrect. Autoscaling is not very common for Ray Train. However, it is commonly used for Ray Data, Ray Tune, and RLlib.

Most Ray Data users use autoscaling.

I don't mean the Ray API, I mean autoscaling is not common when using the RayJob custom resource. I am sure Ray Data with RayCluster + autoscaling is very common

kevin85421 · 2024-12-17T21:13:28Z

ray-operator/controllers/ray/rayjob_controller.go

@@ -617,6 +655,31 @@ func (r *RayJobReconciler) deleteClusterResources(ctx context.Context, rayJobIns
 	return isClusterDeleted, nil
 }

+func (r *RayJobReconciler) scaleWorkerReplicasToZero(ctx context.Context, rayJobInstance *rayv1.RayJob) (bool, error) {


Typically, a K8s controller should only write to the CR status and treat the CR spec as read-only, but implementing this feature without writing to the CR spec is challenging for us.

Perhaps a compromise solution is to add a new field to the RayCluster CRD (e.g., suspendWorkers: bool), where the RayJob controller only sets this field to true, and the RayCluster is responsible for deleting all Ray worker Pods.

This way, the RayJob controller doesn't need to modify replicas and minReplicas, which can also be modified by the Ray Autoscaler or users. Allowing multiple stakeholders to modify a field is typically the root cause of KubeRay's instability issues.

Controllers writing to spec is not necessarily bad, but I see what you mean. I think it would be wrong to write to RayJob spec from RayJob controller, but in this case we're writing to RayCluster spec from RayJob controller. I feel that updating replcias, minReplicas and maxReplicas for ephemeral RayCluster specifcally is actually fine because we don't care about the RayCluster spec once the cluster is deleted.

Will think about this more and get back to you.

The main concern is that multiple personas can modify these fields, such as users, the Autoscaler, and the RayJob controller.

Perhaps a compromise solution is to add a new field to the RayCluster CRD (e.g., suspendWorkers: bool), where the RayJob controller only sets this field to true, and the RayCluster is responsible for deleting all Ray worker Pods.

@kevin85421 how about a suspend field per worker group in WorkerGroupSpec? This allows for granularity of suspension per worker, and from RayJob we can just set suspend: true for all worker groups

Here's the draft PR #2663

Let me know what you think, I will clean up the PR and add tests if the API looks good to you

The API looks good to me.

kevin85421 · 2024-12-17T21:20:44Z

ray-operator/controllers/ray/rayjob_controller.go

+		nowTime := time.Now()
+		shutdownTime := rayJobInstance.Status.EndTime.Add(time.Duration(ttlSeconds) * time.Second)
+
+		if features.Enabled(features.RayJobDeletionPolicy) &&


Update validateRayJobSpec to ensure that the combination of ShutdownAfterJobFinishes: true and rayJobInstance.Spec.DeletionPolicy != rayv1.DeleteNoneDeletionPolicy is invalid.

kevin85421

#2643 (comment)

Maybe we can split this issue into 3 PRs if the comment makes sense to you?

Add a new field and feature flag in RayJob.
Add a new field in RayCluster CRD to terminate all worker Pods.
Implement the deletion policy API based on (2)

Signed-off-by: Andrew Sy Kim <[email protected]>

andrewsykim mentioned this pull request Dec 12, 2024

[Feature] Support preserving only the Head pod with RayJob when shutdownAfterJobFinishes=true #2615

Open

2 tasks

andrewsykim force-pushed the rayjob-delete-policy branch 2 times, most recently from 08bdbfb to b2d43be Compare December 15, 2024 04:53

andrewsykim changed the title ~~[WIP][RayJob] implement deletion policy API~~ [RayJob] implement deletion policy API Dec 15, 2024

andrewsykim requested review from kevin85421 and MortalHappiness December 15, 2024 04:53

andrewsykim assigned kevin85421 Dec 15, 2024

andrewsykim force-pushed the rayjob-delete-policy branch from b2d43be to 472b4c2 Compare December 15, 2024 05:22

MortalHappiness reviewed Dec 15, 2024

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

andrewsykim force-pushed the rayjob-delete-policy branch from 472b4c2 to 0db79c3 Compare December 16, 2024 02:31

andrewsykim force-pushed the rayjob-delete-policy branch 2 times, most recently from 30adbd6 to 33747ec Compare December 17, 2024 00:31

MortalHappiness requested changes Dec 17, 2024

View reviewed changes

ray-operator/controllers/ray/rayjob_controller_test.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/rayjob_controller_test.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/rayjob_controller_test.go Outdated Show resolved Hide resolved

MortalHappiness requested changes Dec 17, 2024

View reviewed changes

[RayJob] implement deletion policy API

921e990

Signed-off-by: Andrew Sy Kim <[email protected]>

andrewsykim force-pushed the rayjob-delete-policy branch from 33747ec to f08fe12 Compare December 17, 2024 02:52

MortalHappiness requested changes Dec 17, 2024

View reviewed changes

ray-operator/controllers/ray/rayjob_controller_test.go Outdated Show resolved Hide resolved

andrewsykim force-pushed the rayjob-delete-policy branch from f08fe12 to fed8484 Compare December 17, 2024 04:09

MortalHappiness approved these changes Dec 17, 2024

View reviewed changes

andrewsykim force-pushed the rayjob-delete-policy branch from fed8484 to 5637ea0 Compare December 17, 2024 14:57

kevin85421 reviewed Dec 17, 2024

View reviewed changes

kevin85421 requested changes Dec 17, 2024

View reviewed changes

add unit tests

ff2deca

Signed-off-by: Andrew Sy Kim <[email protected]>

andrewsykim force-pushed the rayjob-delete-policy branch from 5637ea0 to ff2deca Compare December 18, 2024 02:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayJob] implement deletion policy API #2643

[RayJob] implement deletion policy API #2643

andrewsykim commented Dec 12, 2024

MortalHappiness left a comment

andrewsykim commented Dec 16, 2024

andrewsykim commented Dec 17, 2024

MortalHappiness left a comment

kevin85421 Dec 17, 2024

kevin85421 Dec 17, 2024

andrewsykim Dec 17, 2024

kevin85421 Dec 17, 2024

andrewsykim Dec 17, 2024

andrewsykim Dec 17, 2024

kevin85421 Dec 17, 2024

kevin85421 Dec 17, 2024

andrewsykim Dec 18, 2024

kevin85421 Dec 17, 2024

andrewsykim Dec 17, 2024 •

edited

Loading

kevin85421 Dec 17, 2024

andrewsykim Dec 18, 2024

andrewsykim Dec 18, 2024

kevin85421 Dec 18, 2024

kevin85421 Dec 17, 2024

kevin85421 left a comment

[RayJob] implement deletion policy API #2643

Are you sure you want to change the base?

[RayJob] implement deletion policy API #2643

Conversation

andrewsykim commented Dec 12, 2024

Why are these changes needed?

Related issue number

Checks

MortalHappiness left a comment

Choose a reason for hiding this comment

andrewsykim commented Dec 16, 2024

andrewsykim commented Dec 17, 2024

MortalHappiness left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsykim Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

andrewsykim Dec 17, 2024 •

edited

Loading