Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster] support suspending worker groups #2663

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andrewsykim
Copy link
Collaborator

Why are these changes needed?

Support suspending worker groups in RayCluster

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@andrewsykim
Copy link
Collaborator Author

(will clean up implementation and add tests once we agree on the API)

@@ -776,6 +776,22 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
return err
}

// Delete all workers if worker group is suspended
if len(workerPods.Items) > 0 && worker.Suspend != nil && *worker.Suspend {
for _, workerPod := range workerPods.Items {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use DeleteAllOf instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteAllOf seems better, I didn't use it because of r.rayClusterScaleExpectation.ExpectScalePod. Do you know if it's okay to skip this call in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we don't call this for spec.suspend so seems fine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use r.deleteAllPods instead, which calls DeleteAllOf

@andrewsykim
Copy link
Collaborator Author

Added tests, PTAL

@andrewsykim andrewsykim changed the title [WIP][RayCluster] support suspending worker groups [RayCluster] support suspending worker groups Dec 18, 2024
@@ -776,6 +776,18 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
return err
}

// Delete all workers if worker group is suspended and skip reconcile
if worker.Suspend != nil && *worker.Suspend {
if _, err := r.deleteAllPods(ctx, common.RayClusterGroupPodsAssociationOptions(instance, worker.GroupName)); err != nil {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteAllPods checks existing pods first, so this doesn't always call DeleteAllFor

@@ -776,6 +776,18 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
return err
}

// Delete all workers if worker group is suspended and skip reconcile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a follow up issue: should we call deleteAllPods only when there are no in-flight requests? If so, I will add a follow up item in #2566.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: the expectation has already been checked by the following logic

if !r.rayClusterScaleExpectation.IsSatisfied(ctx, instance.Namespace, instance.Name, worker.GroupName) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe another follow up is to update expectations for deleteAllPods instead. Currently, we rely on deleteAllPods to check the number of Pods is more than 0 to avoid calling DeleteAllOf every reconciliation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't sure what to do about expectations, but it doesn't seem like we do anything for expectations in the standard suspend case with .spec.suspend

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it doesn't seem like we do anything for expectations in the standard suspend case with .spec.suspend

We need to address it in the future. You can check #2566 for more details, but I think it is fine not to handle expectations in this PR.

r.Recorder.Eventf(instance, corev1.EventTypeWarning, string(utils.FailedToDeleteWorkerPod),
"Failed deleting worker Pods for suspended group %s in RayCluster %s/%s, %v", worker.GroupName, instance.Namespace, instance.Name, err)
return errstd.Join(utils.ErrFailedDeleteWorkerPod, err)
}
Copy link
Member

@kevin85421 kevin85421 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the suspend operation atomic, like in #1798?

If this field is only used by the RayJob controller for now, it is fine for it not to be atomic at this moment. In that case, we should add a comment in raycluster_types.go to state that this field is not user-facing and add a check in validateRayClusterSpec to ensure that workerGroup.suspend is not false—only nil and true are valid. This is not ideal, but may be able to avoid most misconfig.

Does DeleteAllOf guarantee to delete all or nothing? Is it possible that only some Pods are deleted while others fail to be deleted? If DeleteAllOf guarantee to delete all or nothing, we don't need to worry about the atomicity.

Copy link
Collaborator Author

@andrewsykim andrewsykim Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think DeleteAllOf is atomic. Not sure if the same approach we took for RayJob will work, this means we'll need a condition per worker group to track whether the suspension is completed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we expecting this field to be only for the RayJob controller and not user-facing, at least for v1.3.0? If so, we can use:

If this field is only used by the RayJob controller for now, it is fine for it not to be atomic at this moment. In that case, we should add a comment in raycluster_types.go to state that this field is not user-facing and add a check in validateRayClusterSpec to ensure that workerGroup.suspend is not false—only nil and true are valid. This is not ideal, but may be able to avoid most misconfig.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main use-case will be with RayJob, but I think it should be valid for other use-cases. We shouldn't stop or prevent users from using this field if they want to

@kevin85421
Copy link
Member

would you mind rebasing with the master branch to include this fix #2664?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants