Add CDI support to peer pods #2126

inatatsu · 2024-10-23T07:47:28Z

How can we enable Dynamic Resource Allocation (DRA) based on Container Device Interface (CDI) for peer pods?

K8s v1.26 introduced DRA and Kata agent is recently enabling CDI. In my understanding, when we want to use GPUs in a peer pod, we need to manually specify an instance profile with GPUs. The webhook simply removes nvidia.com/gpu device requests to an annotation kata.peerpods.io.gpus, but it seems to be not used to select an instance profile.

Any suggestions?

The text was updated successfully, but these errors were encountered:

bpradipt · 2024-10-23T08:00:38Z

@inatatsu commenting on the webhook aspect. There are few plumbing work pending and I'm open for PRs :-)

Enable gpu annotation in Kata containers (similar to default_cpus, default_mem, machine_type etc) for remote hypervisor so that it's available in CAA before CreateVM request
Enable selection logic in CAA. I have started something here - bpradipt@d38f3d9

inatatsu · 2024-10-23T08:11:45Z

@bpradipt Thank you for your responses. Can we extend the current webhook-based approach to support DRA and generate a CDI spec in a peer pod VM?

stevenhorsman · 2024-10-23T08:22:52Z

@zvonkok - you might also be interested and extremely helpful here?

yoheiueda · 2024-10-23T08:31:14Z

According to this comment, it looks like CDI needs to be enabled in both runtime and kata-agent.

kata-containers/kata-containers#9543 (comment)

The CDI support for runtime has been only enabled in runtime-rs, but not in Go version of kata-shim runtime.

kata-containers/kata-containers#10145

I don't think runtime-rs supports the remote hypervisor for peer pods. Do we need to enable CDI in the Go version of kata-shim runtime?

yoheiueda · 2024-10-23T08:32:49Z

In the Go version of kata-shim runtime, the remote hypervisor just ignore devices for now. I think we also need to fix this when CDI support is enabled in the kata-shim runtime.

https://github.com/kata-containers/kata-containers/blob/ca416d883729c7888287a89de836d67bc0975528/src/runtime/virtcontainers/remote.go#L203-L215

func (rh *remoteHypervisor) AddDevice(ctx context.Context, devInfo interface{}, devType DeviceType) error {
	// TODO should we return notImplemented("AddDevice"), rather than nil and ignoring it?
	logrus.Printf("addDevice: deviceType=%v devInfo=%#v", devType, devInfo)
	return nil
}


func (rh *remoteHypervisor) HotplugAddDevice(ctx context.Context, devInfo interface{}, devType DeviceType) (interface{}, error) {
	return nil, notImplemented("HotplugAddDevice")
}


func (rh *remoteHypervisor) HotplugRemoveDevice(ctx context.Context, devInfo interface{}, devType DeviceType) (interface{}, error) {
	return nil, notImplemented("HotplugRemoveDevice")
}

yoheiueda · 2024-10-23T08:41:29Z

I think another possible workaround to support CDI in peer pods is to manipulate Devices in CreateContainerRequest by cloud-api-adaptor.

cloud-api-adaptor/src/cloud-api-adaptor/pkg/adaptor/proxy/service.go

Lines 77 to 82 in aab207c

    
           if len(req.Devices) > 0 { 
        
           	logger.Print("    devices:") 
        
           	for _, d := range req.Devices { 
        
           		logger.Printf("        container_path:%s vm_path:%s type:%s", d.ContainerPath, d.VmPath, d.Type) 
        
           	} 
        
           }

bpradipt · 2024-10-23T09:28:58Z

@bpradipt Thank you for your responses. Can we extend the current webhook-based approach to support DRA and generate a CDI spec in a peer pod VM?

I think @yoheiueda proposal to do it in the CreateContainerRequest may be easier. We can just keep the webhook to handle resource removals from the spec which doesn't apply to peer-pods.
Also I'm unclear how DRA will impact the peer-pods resource management? Can we do away with the webhook completely and rely on DRA for peer-pods resource management ?

zvonkok · 2024-10-24T00:37:52Z

There are several parts to the story. I am ramping up on peer-pods so excuse my ignorance on some parts. There are several aspects here.

Enable CDI in the kata-agent which is completely independent if peer-pods, or local VMM. This is enabled here: kata-containers/kata-containers#9584. @bpradipt This will eliminate the prestart-hook.

I do not understand the complete webhook thing in peer-pods, but let's try to keep it simple and stupid.

We've build DRA to request special features of a GPU, like give me a GPU with 40G, MIG slice, vGPU or a specific architecture. I am still unsure how we're going to map this exactly with peer-pods since we do not know what the CSP pool is capable of.

We need some advertisement system (NFD) for CSP like infrastructure?

The peer pods add a new layer of complexity. I need to think of how to enable DRA and CDI.

zvonkok · 2024-10-24T00:48:39Z

@bpradipt We need to think how to enable DRA properly. The logic you have is a good start but ignores MIG, or vGPU.

Apokleos · 2024-10-24T01:26:18Z

According to this comment, it looks like CDI needs to be enabled in both runtime and kata-agent.

kata-containers/kata-containers#9543 (comment)

The CDI support for runtime has been only enabled in runtime-rs, but not in Go version of kata-shim runtime.

kata-containers/kata-containers#10145

I don't think runtime-rs supports the remote hypervisor for peer pods. Do we need to enable CDI in the Go version of kata-shim runtime?

Yes, both runtime and kata-agent need integrate with CDI. Currently AFAIK, kata runtime and runtime-rs have both support CDI for GPU scenarios.
And another thing remote hypervisor in runtime-rs, is also under reviewing, which is a Project of Summer of Code

zvonkok · 2024-10-24T02:00:36Z

Hmm, since the mapping is Pod per CSP VM we need to make sure that DRA in the case of peer-pods only allows creation of GPUs that map to CSP instance types or have the Pod pending until the CSP implements the proper instance type :)

zvonkok · 2024-10-24T02:06:01Z

All the managment and configuration of devices is now pushed into DRA, whereas with device-plugins you consume what the infrastructure offers. We have a conflict here with peer-pods.
In the bare-metal use-case we can request a full-passthrough GPU (vGPU) where DRA would bind a proper GPU to VFIO or MDEV and create the CDI spec with the vfio device and the CRI sends this to Kata which then passes-through the GPU and in the VM we use CDI to create the proper device nodes in the OCI spec to be mounted into the container.

In the case of peer-pods DRA would just act as a proxy to pass-through the wanted typed to peer-pods which then in the end would choose the proper instance-type and to the CSP magic.

yoheiueda · 2024-10-24T08:02:08Z

@zvonkok Thank you very much for the explanation of how CDI works with DRA.

And another thing kata-containers/kata-containers#10225, is also under reviewing, which is a Project of Summer of Code

@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?

Apokleos · 2024-10-24T12:29:21Z

@zvonkok Thank you very much for the explanation of how CDI works with DRA.

And another thing kata-containers/kata-containers#10225, is also under reviewing, which is a Project of Summer of Code

@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?

Hah, Yeah, good point. I think I should invite AC members @stevenhorsman @fupanli @zvonkok .etc. to help answer this question.

stevenhorsman · 2024-10-24T12:36:51Z

@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?

The short answer here is yes. The more nuanced version is yes, but we are not sure on the timeframe. The current plan is for Kata Containers 4.0 to ship with runtime-rs as the default shim, but the go runtime won't be removed here, however it might have security fixes only, or best-effort feature support with all new features targeted primarily at the rust runtime first. In Kata Containers 5.0 I guess there is a reasonable chance that the go runtime will be removed entirely, but that is unlikely to be decided for a long time.

4.0 is planned for so time in 2025, but there is still quite a bit of work required to close the gap as listed in kata-containers/kata-containers#8702 including the remote hypervisor support that @Apokleos mentioned.

inatatsu · 2024-10-28T06:31:08Z

@bpradipt @stevenhorsman @yoheiueda @zvonkok @Apokleos Thank you very much for your helpful comments. Let me summarize the discussions and suggestions (and my understanding😃). Feel free to correct or add anything:

A user can run a Pod which refers a ResourceClaim as a peer pod. The user can pass a structured parameter to define the allocated resource.
The requested resource will be actually allocated when a peer pod VM is created.
The resource request is reflected to a pod VM instance profile and a CDI spec used by the kata agent inside of the pod VM.
The worker node must advertise the available ResourceSlice in advance (This is somewhat similar to what is currently done by the peerpod webhook using the kata.peerpods.io/vm extended resources).
The container runtime in the worker node also must enable CDI.
We need some custom component (Kubelet plugin?) to pass through the resource allocation request to the cloud provider and pod VM while creating a dummy CDI spec for the container runtime on the worker node.

bpradipt · 2024-10-29T08:03:07Z

@inatatsu thanks for summarising it.
Few inline questions for my understanding:

@bpradipt @stevenhorsman @yoheiueda @zvonkok @Apokleos Thank you very much for your helpful comments. Let me summarize the discussions and suggestions (and my understanding😃). Feel free to correct or add anything:

A user can run a Pod which refers a ResourceClaim as a peer pod. The user can pass a structured parameter to define the allocated resource.

The requested resource will be actually allocated when a peer pod VM is created.

The resource request is reflected to a pod VM instance profile and a CDI spec used by the kata agent inside of the pod VM.

The worker node must advertise the available ResourceSlice in advance (This is somewhat similar to what is currently done by the peerpod webhook using the kata.peerpods.io/vm extended resources).

Is this about advertising external VMs as resources instead of the current per node extended resources?

The container runtime in the worker node also must enable CDI.

We need some custom component (Kubelet plugin?) to pass through the resource allocation request to the cloud provider and pod VM while creating a dummy CDI spec for the container runtime on the worker node.

How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.

inatatsu · 2024-10-30T02:40:23Z

@bpradipt Thank you for your questions.

Is this about advertising external VMs as resources instead of the current per node extended resources?

While I did not imagine such a use case😅, it is interesting and may simplify the VM management. ResourceSlice can be also per node.

How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.

In my understanding, CDI allows flexible device mapping and is runtime-agnostic. But as you point out, peer pods primarily rely on selecting an appropriate instance profile (or flavor) to allocate resources, and CDI just provides a mapping between the resources and containers.

bpradipt · 2024-10-30T04:24:11Z

@bpradipt Thank you for your questions.

Is this about advertising external VMs as resources instead of the current per node extended resources?

While I did not imagine such a use case😅, it is interesting and may simplify the VM management. ResourceSlice can be also per node.

How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.

In my understanding, CDI allows flexible device mapping and is runtime-agnostic. But as you point out, peer pods primarily rely on selecting an appropriate instance profile (or flavor) to allocate resources, and CDI just provides a mapping between the resources and containers.

So, CDI will be helpful on the kata-agent side to assign the GPU (or other devices) to the container and additionally using the same building blocks (CDI). Is my understanding correct?

inatatsu · 2024-10-30T04:26:27Z

So, CDI will be helpful on the kata-agent side to assign the GPU (or other devices) to the container and additionally using the same building blocks (CDI). Is my understanding correct?

@bpradipt Yes. That's my current understanding.

inatatsu · 2024-11-05T05:59:54Z

* The container runtime in the worker node also must enable CDI.

The go runtime merged PRs to enable CDI:

gpu: Adding CDI support for cold and hot-plug of VFIO devices kata-containers/kata-containers#7325
gpu: reintroduce pcie_root_port and add pcie_switch_port kata-containers/kata-containers#8861
runtime: Fix runtime/cdi panic with assignment to entry in nil map kata-containers/kata-containers#10276

@zvonkok Does this mean the go runtime (except for the remote hypervisor) already supports CDI?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CDI support to peer pods #2126

Add CDI support to peer pods #2126

inatatsu commented Oct 23, 2024

bpradipt commented Oct 23, 2024

inatatsu commented Oct 23, 2024

stevenhorsman commented Oct 23, 2024

yoheiueda commented Oct 23, 2024

yoheiueda commented Oct 23, 2024

yoheiueda commented Oct 23, 2024

bpradipt commented Oct 23, 2024

zvonkok commented Oct 24, 2024

zvonkok commented Oct 24, 2024

Apokleos commented Oct 24, 2024

zvonkok commented Oct 24, 2024 •

edited

Loading

zvonkok commented Oct 24, 2024 •

edited

Loading

yoheiueda commented Oct 24, 2024

Apokleos commented Oct 24, 2024

stevenhorsman commented Oct 24, 2024 •

edited

Loading

inatatsu commented Oct 28, 2024 •

edited

Loading

bpradipt commented Oct 29, 2024

inatatsu commented Oct 30, 2024 •

edited

Loading

bpradipt commented Oct 30, 2024

inatatsu commented Oct 30, 2024 •

edited

Loading

inatatsu commented Nov 5, 2024

Add CDI support to peer pods #2126

Add CDI support to peer pods #2126

Comments

inatatsu commented Oct 23, 2024

bpradipt commented Oct 23, 2024

inatatsu commented Oct 23, 2024

stevenhorsman commented Oct 23, 2024

yoheiueda commented Oct 23, 2024

yoheiueda commented Oct 23, 2024

yoheiueda commented Oct 23, 2024

bpradipt commented Oct 23, 2024

zvonkok commented Oct 24, 2024

zvonkok commented Oct 24, 2024

Apokleos commented Oct 24, 2024

zvonkok commented Oct 24, 2024 • edited Loading

zvonkok commented Oct 24, 2024 • edited Loading

yoheiueda commented Oct 24, 2024

Apokleos commented Oct 24, 2024

stevenhorsman commented Oct 24, 2024 • edited Loading

inatatsu commented Oct 28, 2024 • edited Loading

bpradipt commented Oct 29, 2024

inatatsu commented Oct 30, 2024 • edited Loading

bpradipt commented Oct 30, 2024

inatatsu commented Oct 30, 2024 • edited Loading

inatatsu commented Nov 5, 2024

zvonkok commented Oct 24, 2024 •

edited

Loading

zvonkok commented Oct 24, 2024 •

edited

Loading

stevenhorsman commented Oct 24, 2024 •

edited

Loading

inatatsu commented Oct 28, 2024 •

edited

Loading

inatatsu commented Oct 30, 2024 •

edited

Loading

inatatsu commented Oct 30, 2024 •

edited

Loading