Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CDI support to peer pods #2126

Open
inatatsu opened this issue Oct 23, 2024 · 21 comments
Open

Add CDI support to peer pods #2126

inatatsu opened this issue Oct 23, 2024 · 21 comments

Comments

@inatatsu
Copy link

How can we enable Dynamic Resource Allocation (DRA) based on Container Device Interface (CDI) for peer pods?

K8s v1.26 introduced DRA and Kata agent is recently enabling CDI. In my understanding, when we want to use GPUs in a peer pod, we need to manually specify an instance profile with GPUs. The webhook simply removes nvidia.com/gpu device requests to an annotation kata.peerpods.io.gpus, but it seems to be not used to select an instance profile.

Any suggestions?

@bpradipt
Copy link
Member

@inatatsu commenting on the webhook aspect. There are few plumbing work pending and I'm open for PRs :-)

  1. Enable gpu annotation in Kata containers (similar to default_cpus, default_mem, machine_type etc) for remote hypervisor so that it's available in CAA before CreateVM request
  2. Enable selection logic in CAA. I have started something here - bpradipt@d38f3d9

@inatatsu
Copy link
Author

@bpradipt Thank you for your responses. Can we extend the current webhook-based approach to support DRA and generate a CDI spec in a peer pod VM?

@stevenhorsman
Copy link
Member

@zvonkok - you might also be interested and extremely helpful here?

@yoheiueda
Copy link
Member

According to this comment, it looks like CDI needs to be enabled in both runtime and kata-agent.

kata-containers/kata-containers#9543 (comment)

The CDI support for runtime has been only enabled in runtime-rs, but not in Go version of kata-shim runtime.

kata-containers/kata-containers#10145

I don't think runtime-rs supports the remote hypervisor for peer pods. Do we need to enable CDI in the Go version of kata-shim runtime?

@yoheiueda
Copy link
Member

In the Go version of kata-shim runtime, the remote hypervisor just ignore devices for now. I think we also need to fix this when CDI support is enabled in the kata-shim runtime.

https://github.com/kata-containers/kata-containers/blob/ca416d883729c7888287a89de836d67bc0975528/src/runtime/virtcontainers/remote.go#L203-L215

func (rh *remoteHypervisor) AddDevice(ctx context.Context, devInfo interface{}, devType DeviceType) error {
	// TODO should we return notImplemented("AddDevice"), rather than nil and ignoring it?
	logrus.Printf("addDevice: deviceType=%v devInfo=%#v", devType, devInfo)
	return nil
}


func (rh *remoteHypervisor) HotplugAddDevice(ctx context.Context, devInfo interface{}, devType DeviceType) (interface{}, error) {
	return nil, notImplemented("HotplugAddDevice")
}


func (rh *remoteHypervisor) HotplugRemoveDevice(ctx context.Context, devInfo interface{}, devType DeviceType) (interface{}, error) {
	return nil, notImplemented("HotplugRemoveDevice")
}

@yoheiueda
Copy link
Member

I think another possible workaround to support CDI in peer pods is to manipulate Devices in CreateContainerRequest by cloud-api-adaptor.

if len(req.Devices) > 0 {
logger.Print(" devices:")
for _, d := range req.Devices {
logger.Printf(" container_path:%s vm_path:%s type:%s", d.ContainerPath, d.VmPath, d.Type)
}
}

@bpradipt
Copy link
Member

@bpradipt Thank you for your responses. Can we extend the current webhook-based approach to support DRA and generate a CDI spec in a peer pod VM?

I think @yoheiueda proposal to do it in the CreateContainerRequest may be easier. We can just keep the webhook to handle resource removals from the spec which doesn't apply to peer-pods.
Also I'm unclear how DRA will impact the peer-pods resource management? Can we do away with the webhook completely and rely on DRA for peer-pods resource management ?

@zvonkok
Copy link
Member

zvonkok commented Oct 24, 2024

There are several parts to the story. I am ramping up on peer-pods so excuse my ignorance on some parts. There are several aspects here.

Enable CDI in the kata-agent which is completely independent if peer-pods, or local VMM. This is enabled here: kata-containers/kata-containers#9584. @bpradipt This will eliminate the prestart-hook.

I do not understand the complete webhook thing in peer-pods, but let's try to keep it simple and stupid.

We've build DRA to request special features of a GPU, like give me a GPU with 40G, MIG slice, vGPU or a specific architecture. I am still unsure how we're going to map this exactly with peer-pods since we do not know what the CSP pool is capable of.

We need some advertisement system (NFD) for CSP like infrastructure?

The peer pods add a new layer of complexity. I need to think of how to enable DRA and CDI.

@zvonkok
Copy link
Member

zvonkok commented Oct 24, 2024

@bpradipt We need to think how to enable DRA properly. The logic you have is a good start but ignores MIG, or vGPU.

@Apokleos
Copy link

According to this comment, it looks like CDI needs to be enabled in both runtime and kata-agent.

kata-containers/kata-containers#9543 (comment)

The CDI support for runtime has been only enabled in runtime-rs, but not in Go version of kata-shim runtime.

kata-containers/kata-containers#10145

I don't think runtime-rs supports the remote hypervisor for peer pods. Do we need to enable CDI in the Go version of kata-shim runtime?

Yes, both runtime and kata-agent need integrate with CDI. Currently AFAIK, kata runtime and runtime-rs have both support CDI for GPU scenarios.
And another thing remote hypervisor in runtime-rs, is also under reviewing, which is a Project of Summer of Code

@zvonkok
Copy link
Member

zvonkok commented Oct 24, 2024

Hmm, since the mapping is Pod per CSP VM we need to make sure that DRA in the case of peer-pods only allows creation of GPUs that map to CSP instance types or have the Pod pending until the CSP implements the proper instance type :)

@zvonkok
Copy link
Member

zvonkok commented Oct 24, 2024

All the managment and configuration of devices is now pushed into DRA, whereas with device-plugins you consume what the infrastructure offers. We have a conflict here with peer-pods.
In the bare-metal use-case we can request a full-passthrough GPU (vGPU) where DRA would bind a proper GPU to VFIO or MDEV and create the CDI spec with the vfio device and the CRI sends this to Kata which then passes-through the GPU and in the VM we use CDI to create the proper device nodes in the OCI spec to be mounted into the container.

In the case of peer-pods DRA would just act as a proxy to pass-through the wanted typed to peer-pods which then in the end would choose the proper instance-type and to the CSP magic.

@yoheiueda
Copy link
Member

@zvonkok Thank you very much for the explanation of how CDI works with DRA.

And another thing kata-containers/kata-containers#10225, is also under reviewing, which is a Project of Summer of Code

@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?

@Apokleos
Copy link

@zvonkok Thank you very much for the explanation of how CDI works with DRA.

And another thing kata-containers/kata-containers#10225, is also under reviewing, which is a Project of Summer of Code

@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?

Hah, Yeah, good point. I think I should invite AC members @stevenhorsman @fupanli @zvonkok .etc. to help answer this question.

@stevenhorsman
Copy link
Member

stevenhorsman commented Oct 24, 2024

@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?

The short answer here is yes. The more nuanced version is yes, but we are not sure on the timeframe. The current plan is for Kata Containers 4.0 to ship with runtime-rs as the default shim, but the go runtime won't be removed here, however it might have security fixes only, or best-effort feature support with all new features targeted primarily at the rust runtime first. In Kata Containers 5.0 I guess there is a reasonable chance that the go runtime will be removed entirely, but that is unlikely to be decided for a long time.

4.0 is planned for so time in 2025, but there is still quite a bit of work required to close the gap as listed in kata-containers/kata-containers#8702 including the remote hypervisor support that @Apokleos mentioned.

@inatatsu
Copy link
Author

inatatsu commented Oct 28, 2024

@bpradipt @stevenhorsman @yoheiueda @zvonkok @Apokleos Thank you very much for your helpful comments. Let me summarize the discussions and suggestions (and my understanding😃). Feel free to correct or add anything:

  • A user can run a Pod which refers a ResourceClaim as a peer pod. The user can pass a structured parameter to define the allocated resource.
  • The requested resource will be actually allocated when a peer pod VM is created.
  • The resource request is reflected to a pod VM instance profile and a CDI spec used by the kata agent inside of the pod VM.
  • The worker node must advertise the available ResourceSlice in advance (This is somewhat similar to what is currently done by the peerpod webhook using the kata.peerpods.io/vm extended resources).
  • The container runtime in the worker node also must enable CDI.
  • We need some custom component (Kubelet plugin?) to pass through the resource allocation request to the cloud provider and pod VM while creating a dummy CDI spec for the container runtime on the worker node.

@bpradipt
Copy link
Member

@inatatsu thanks for summarising it.
Few inline questions for my understanding:

@bpradipt @stevenhorsman @yoheiueda @zvonkok @Apokleos Thank you very much for your helpful comments. Let me summarize the discussions and suggestions (and my understanding😃). Feel free to correct or add anything:

  • A user can run a Pod which refers a ResourceClaim as a peer pod. The user can pass a structured parameter to define the allocated resource.
  • The requested resource will be actually allocated when a peer pod VM is created.
  • The resource request is reflected to a pod VM instance profile and a CDI spec used by the kata agent inside of the pod VM.
  • The worker node must advertise the available ResourceSlice in advance (This is somewhat similar to what is currently done by the peerpod webhook using the kata.peerpods.io/vm extended resources).

Is this about advertising external VMs as resources instead of the current per node extended resources?

  • The container runtime in the worker node also must enable CDI.
  • We need some custom component (Kubelet plugin?) to pass through the resource allocation request to the cloud provider and pod VM while creating a dummy CDI spec for the container runtime on the worker node.

How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.

@inatatsu
Copy link
Author

inatatsu commented Oct 30, 2024

@bpradipt Thank you for your questions.

Is this about advertising external VMs as resources instead of the current per node extended resources?

While I did not imagine such a use case😅, it is interesting and may simplify the VM management. ResourceSlice can be also per node.

How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.

In my understanding, CDI allows flexible device mapping and is runtime-agnostic. But as you point out, peer pods primarily rely on selecting an appropriate instance profile (or flavor) to allocate resources, and CDI just provides a mapping between the resources and containers.

@bpradipt
Copy link
Member

@bpradipt Thank you for your questions.

Is this about advertising external VMs as resources instead of the current per node extended resources?

While I did not imagine such a use case😅, it is interesting and may simplify the VM management. ResourceSlice can be also per node.

How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.

In my understanding, CDI allows flexible device mapping and is runtime-agnostic. But as you point out, peer pods primarily rely on selecting an appropriate instance profile (or flavor) to allocate resources, and CDI just provides a mapping between the resources and containers.

So, CDI will be helpful on the kata-agent side to assign the GPU (or other devices) to the container and additionally using the same building blocks (CDI). Is my understanding correct?

@inatatsu
Copy link
Author

inatatsu commented Oct 30, 2024

So, CDI will be helpful on the kata-agent side to assign the GPU (or other devices) to the container and additionally using the same building blocks (CDI). Is my understanding correct?

@bpradipt Yes. That's my current understanding.

@inatatsu
Copy link
Author

inatatsu commented Nov 5, 2024

* The container runtime in the worker node also must enable CDI.

The go runtime merged PRs to enable CDI:

@zvonkok Does this mean the go runtime (except for the remote hypervisor) already supports CDI?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants