[RayCluster] Adds `aliyun.com/gpu-mem` to known custom accelerators #2631

win5923 · 2024-12-10T15:15:51Z

Why are these changes needed?

if use aliyun k8s gpu share, gpu key is aliyun.com/gpu-mem

    workerGroupSpecs:
            resources:
              limits:
                aliyun.com/gpu-mem: "1"
                cpu: "1"
                memory: 2Gi
              requests:
                aliyun.com/gpu-mem: "1"
                cpu: "1"
                memory: 2Gi

autoscaler will not work when request gpu resource

(autoscaler +3m13s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

Adds aliyun's gpu share to the list of known custom accelerators in pod.go

Related issue number

Closes #2484

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

andrewsykim · 2024-12-10T17:43:40Z

ray-operator/controllers/ray/common/pod.go

@@ -38,11 +38,14 @@ const (
 	NeuronCoreRayResourceName          = "neuron_cores"
 	TPUContainerResourceName           = "google.com/tpu"
 	TPURayResourceName                 = "TPU"
+	GPUShareContainerResourceName      = "aliyun.com/gpu-mem"
+	GPUShareResourceName               = "gpu_share"


Is this the actual resource name recognized by Ray? aliyun.com/gpu-mem -> gpu_share seems like an odd mapping

Apologies, I realized that Ray Resource Name only supports GPU, neuron_cores, TPU, NPU, and HPU. I will mapping aliyun's gpu share to GPU.

Ref: https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html

yes, I think we should only add custom accelerators here that are in the supported list you referenced

Signed-off-by: win5923 <[email protected]>

andrewsykim

ref: #2631 (comment)

win5923 · 2024-12-17T16:58:55Z

I mapped aliyun.com/gpu-mem to num-gpus. Because gpu share also belongs to the GPU.

andrewsykim · 2024-12-17T19:49:16Z

Can you first open a PR to update the officially supported accelerators in https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html to include alyun.com/gpu?

win5923 · 2024-12-20T03:16:42Z

I checked Aliyun's documentation, and I think they also use Nvidia GPU with field called nvidia.com/gpu, while Aliyun GPU Share has another field called aliyun.com/gpu-mem.

Ref: https://www.alibabacloud.com/help/en/eci/user-guide/create-a-pod-by-specifying-the-gpu-specification-1

Can you first open a PR to update the officially supported accelerators in https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html to include alyun.com/gpu?

So I think we should just focus on the string?

win5923 force-pushed the aliyun/gpu branch 2 times, most recently from bde319f to 9fe74f0 Compare December 10, 2024 15:37

andrewsykim reviewed Dec 10, 2024

View reviewed changes

kevin85421 assigned andrewsykim Dec 10, 2024

win5923 changed the title ~~[RayCluster] add GPUShare to known custom accelerators~~ [RayCluster] Adds aliyun's gpu share to known custom accelerators Dec 10, 2024

win5923 marked this pull request as draft December 11, 2024 13:37

[RayCluster] add GPUShare to known custom accelerators

d89eab5

Signed-off-by: win5923 <[email protected]>

win5923 force-pushed the aliyun/gpu branch from 9fe74f0 to d89eab5 Compare December 12, 2024 13:18

win5923 marked this pull request as ready for review December 12, 2024 13:18

win5923 requested a review from andrewsykim December 12, 2024 13:19

win5923 changed the title ~~[RayCluster] Adds aliyun's gpu share to known custom accelerators~~ [RayCluster] Adds aliyun.com/gpu-mem to known custom accelerators Dec 12, 2024

andrewsykim requested changes Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayCluster] Adds `aliyun.com/gpu-mem` to known custom accelerators #2631

[RayCluster] Adds `aliyun.com/gpu-mem` to known custom accelerators #2631

win5923 commented Dec 10, 2024 •

edited

Loading

andrewsykim Dec 10, 2024

win5923 Dec 11, 2024 •

edited

Loading

andrewsykim Dec 11, 2024

andrewsykim left a comment

win5923 commented Dec 17, 2024

andrewsykim commented Dec 17, 2024 •

edited

Loading

win5923 commented Dec 20, 2024 •

edited

Loading

[RayCluster] Adds aliyun.com/gpu-mem to known custom accelerators #2631

Are you sure you want to change the base?

[RayCluster] Adds aliyun.com/gpu-mem to known custom accelerators #2631

Conversation

win5923 commented Dec 10, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

andrewsykim Dec 10, 2024

Choose a reason for hiding this comment

win5923 Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

andrewsykim Dec 11, 2024

Choose a reason for hiding this comment

andrewsykim left a comment

Choose a reason for hiding this comment

win5923 commented Dec 17, 2024

andrewsykim commented Dec 17, 2024 • edited Loading

win5923 commented Dec 20, 2024 • edited Loading

[RayCluster] Adds `aliyun.com/gpu-mem` to known custom accelerators #2631

[RayCluster] Adds `aliyun.com/gpu-mem` to known custom accelerators #2631

win5923 commented Dec 10, 2024 •

edited

Loading

win5923 Dec 11, 2024 •

edited

Loading

andrewsykim commented Dec 17, 2024 •

edited

Loading

win5923 commented Dec 20, 2024 •

edited

Loading