add kep first version

kubernetes-sigs · Sep 9, 2024 · 65feed7 · 65feed7
1 parent e6fb7b6
commit 65feed7
Showing 1 changed file with 223 additions and 1 deletion.
diff --git a/keps/74-support-argo-workflow/README.md b/keps/74-support-argo-workflow/README.md
@@ -28,7 +28,8 @@
 
 ### Goals
 
-
+- Support Argo Workflow in Kueue. Users only need to add `kueue.x-k8s.io/queue-name` to the
+labels of the workflows and submit the workflow in suspend state.
 
 ### Non-Goals
 
@@ -39,10 +40,231 @@
 
 ### User Stories
 
+#### Story 1
+
+As a ML engineer, I want to do some data processing before my training job start. I will
+submit a workflow with two steps, the first one is a data processing job, and the second 
+one is a PytorchJob. GPU is not required for the data processing job. So I hope the data
+processing job will not be blocked by the GPU quota.
 
+#### Story 2
+
+As an ML engineer, my workflow comprises multiple stages that require GPU resources, all 
+of which have identical resource demands. I aspire to reuse the resources already allocated 
+by previous nodes in my workflow to enhance efficiency and resource utilization.
 
 ## Design Details
 
+### Workflow as An Unit
+
+Pods in one workflow can have differnet resources, nodeaffinity, tolerations, etc. And 
+parallelizm can change during the workflow's execution. So it is difficult to determine 
+how many resources on each flavor for a workflow by the controller. In this case, users have 
+to specify the resources for the workflow in workflow's annotation. Users can specify the 
+potential resource requirements for their workflows by setting `kueue.k8s.io/max-resources` 
+in the annotation, and they can configure tolerations for tainted nodes as well as node 
+selections using `kueue.k8s.io/toleration` and `kueue.k8s.io/node-selector`, respectively.
+
+#### Drawback and Limitations
+
+- It is not able to set different nodeSelectors and tolerations for more than one kind of podSets
+in this way.
+
+#### Advantages
+
+- Architecture is simple, and it is easy to implement.
+
+### Layer as An Unit
+
+A workflow's template definition can be a container invocation (leaf template) or a list 
+of steps. We will create workload for each parallel step which is composed by leaf templates.
+For the workflow which is composed by a single leaf template, we create a workload for it.
+
+#### Examples 
+
+In the following example, we solely discuss which patterns of workflows should warrant the 
+creation of workloads, without delving into the specifics of how these workloads are created, 
+nor addressing the division of responsibilities between the workflow-controller and kueue.
+
+##### Example 1 (ParallelSteps Contains Leaf Template Only)
+For a parallelStep with only leaf templates, we create a workload for the parallelStep.
+In the following example, we create workloads for `loop-example-depth-2(0:depth-1-1)` and `loop-example-depth-2(1:depth-1-2)`. Patterns of DAGs are similar, so we do not discuss them 
+separately.
+
+```
+# kubectl create -f - << EOF
+apiVersion: argoproj.io/v1alpha1
+kind: Workflow
+metadata:
+  generateName: loops-
+  namespace: argo
+spec:
+  entrypoint: loop-example-depth-1
+  templates:
+  - name: loop-example-depth-2
+    steps:
+    - - name: print-message-loop
+        template: print-message
+        arguments:
+          parameters:
+          - name: message
+            value: "{{item}}"
+        withItems:              # invoke print-message once for each item in parallel
+        - hello world           # item 1
+        - goodbye world         # item 2
+  - name: loop-example-depth-1
+    steps:
+    - - name: loop-example-depth-2
+        template: loop-example-depth-2
+        withItems:
+        - depth-1-1
+        - depth-1-2
+  - name: print-message
+    inputs:
+      parameters:
+      - name: message
+    container:
+      image: busybox
+      command: [echo]
+      args: ["{{inputs.parameters.message}}"]
+EOF
+      
+# argo get loops-mlr6m
+...
+
+STEP                                            TEMPLATE              PODNAME                               DURATION  MESSAGE
+ ✔ loops-mlr6m                                  loop-example-depth-1
+ └─┬─✔ loop-example-depth-2(0:depth-1-1)        loop-example-depth-2
+   │ └─┬─✔ print-message-loop(0:hello world)    print-message         loops-mlr6m-print-message-2545579066  6s
+   │   └─✔ print-message-loop(1:goodbye world)  print-message         loops-mlr6m-print-message-323962978   5s
+   └─✔ loop-example-depth-2(1:depth-1-2)        loop-example-depth-2
+     └─┬─✔ print-message-loop(0:hello world)    print-message         loops-mlr6m-print-message-520674448   4s
+       └─✔ print-message-loop(1:goodbye world)  print-message         loops-mlr6m-print-message-2893948292  6s
+```
+
+##### Example 2 (ParallelSteps Contains Leaf Template and Step)
+
+For the step composed by a leaf template and another step, we create workload for the 
+leaf template. And the workload for the other step is created separately.
+In the following example, we will create workload for `loops-644ch` and `loop-example-depth-2-2`.
+
+```
+apiVersion: argoproj.io/v1alpha1
+kind: Workflow
+metadata:
+  generateName: loops-
+  namespace: argo
+spec:
+  entrypoint: loop-example-depth-1
+  templates:
+  - name: loop-example-depth-2
+    steps:
+    - - name: print-message-loop
+        template: print-message
+        arguments:
+          parameters:
+          - name: message
+            value: "{{item}}"
+        withItems:              # invoke print-message once for each item in parallel
+        - depth-2-1           # item 1
+        - depth-2-2         # item 2
+  - name: loop-example-depth-1
+    steps:
+    - - name: print-message
+        template: print-message
+        arguments:
+          parameters:
+          - name: message
+            value: "{{item}}"
+        withItems:
+        - depth-1-1
+        - depth-1-2
+      - name: loop-example-depth-2-2
+        template: loop-example-depth-2
+  - name: print-message
+    inputs:
+      parameters:
+      - name: message
+    container:
+      image: busybox
+      command: [echo]
+      args: ["{{inputs.parameters.message}}"]
+
+# argo get loops-644ch
+...
+STEP                                        TEMPLATE              PODNAME                               DURATION  MESSAGE
+ ✔ loops-644ch                              loop-example-depth-1
+ └─┬─✔ loop-example-depth-2-2               loop-example-depth-2
+   │ └─┬─✔ print-message-loop(0:depth-2-1)  print-message         loops-644ch-print-message-1796012204  4s
+   │   └─✔ print-message-loop(1:depth-2-2)  print-message         loops-644ch-print-message-1116167650  6s
+   ├─✔ print-message(0:depth-1-1)           print-message         loops-644ch-print-message-413467513   5s
+   └─✔ print-message(1:depth-1-2)           print-message         loops-644ch-print-message-3356863351  5s
+```
+
+##### Example 3 (Workflow with Single Container Template)
+
+We create a workload for the single container template. For example:
+```
+apiVersion: argoproj.io/v1alpha1
+kind: Workflow
+metadata:
+  generateName: hello-
+spec:
+  entrypoint: main
+  templates:
+    - name: main
+      plugin:
+        hello: { }
+
+# argo get hello-jtlcw
+...
+STEP            TEMPLATE  PODNAME  DURATION  MESSAGE
+ ◷ hello-jtlcw  main
+```
+
+#### How to suspend a workflow step by step
+
+We introduce two ways to manage the workflow. Responsebilities are different for the
+workflow-controller and kueue-controller in two ways.
+
+1. Give users a CLI to modify workflows and add a specific suspend template for each step.
+When the workflows are suspended on this special suspend template, the job-controller in Kueue
+create workloads for the next step. Modification of workflow-controller is not needed for
+this way, so that it is easy to iterate, and no need to manage the version of argo and kueue.
+By in this way, users can modify their workflows to skip waiting in kueue, which maybe is not 
+acceptable for some users.
+
+2. Add a new field in the workflows' specs like suspendBySteps. If workflow.spec.suspendBySteps is 
+true, workflow-controller insert a special suspend template for each stepGroup. Job-controller in 
+Kueue watch and create workloads for the next step. After the workloads are admitted, the suspend 
+step is set finished.
+
+3. Add a new webhook in Kueue. When new pods are added to the cluster, the webhook find out if 
+the pods is managed by the workflow and if the there is `kueue.x-k8s.io/queue-name` on the 
+workflow. If so, schedulingGates will be added to the pods, then these pods will be grouped by 
+job-controller in Kueue (pods can be found in the status of the workflow), and the workloads will
+be created for each group. After the workloads are admitted, schedulingGates in pods are removed 
+so that the pods can be scheduled.
+
+#### Drawback and Limitations
+
+
+
+#### Advantages
+
+
+
+### Plain Pod as An Unit
+
+
+
+#### Drawback and Limitations
+
+- Pods in same stepGroup are queued by different workload.
+- Gang for stepGroup is not available.
+
+#### Advantages
+
 
 
 ## Additional Details