You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Serving workloads are different than training - they can be easily trimmed - a Deployment can run at 70% or 50% of Pods. This is different to most AI training workloads, where all Pods need to run. We want to leverage this fact and optimize preemptions.
In particular, when a new high priority workload comes in and we have multiple serving workloads, we want to distribute the preemptions across the serving workloads, rather than preempting one completely.
Note that this is also related to the partial preemption for batch workloads: #975. We may consider having a solution which solves both problems, but for now it seems reasonable to have this dedicated issue, emphasizing that serving workloads are special in this regard.
Why is this needed:
To improve experience of hosting mix of training and inference workloads. When the high-priority workload comes, we can make room for it by trimming multiple serving workloads, rather than preempting completely one.
Completion requirements:
This enhancement requires the following artifacts:
Design doc
API change
Docs update
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered:
What would you like to be added:
Serving workloads are different than training - they can be easily trimmed - a Deployment can run at 70% or 50% of Pods. This is different to most AI training workloads, where all Pods need to run. We want to leverage this fact and optimize preemptions.
In particular, when a new high priority workload comes in and we have multiple serving workloads, we want to distribute the preemptions across the serving workloads, rather than preempting one completely.
Note that this is also related to the partial preemption for batch workloads: #975. We may consider having a solution which solves both problems, but for now it seems reasonable to have this dedicated issue, emphasizing that serving workloads are special in this regard.
Why is this needed:
To improve experience of hosting mix of training and inference workloads. When the high-priority workload comes, we can make room for it by trimming multiple serving workloads, rather than preempting completely one.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: