Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster] Adds aliyun.com/gpu-mem to known custom accelerators #2631

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

win5923
Copy link
Contributor

@win5923 win5923 commented Dec 10, 2024

Why are these changes needed?

if use aliyun k8s gpu share, gpu key is aliyun.com/gpu-mem

    workerGroupSpecs:
            resources:
              limits:
                aliyun.com/gpu-mem: "1"
                cpu: "1"
                memory: 2Gi
              requests:
                aliyun.com/gpu-mem: "1"
                cpu: "1"
                memory: 2Gi

autoscaler will not work when request gpu resource

(autoscaler +3m13s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

Adds aliyun's gpu share to the list of known custom accelerators in pod.go

Related issue number

Closes #2484

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@win5923 win5923 force-pushed the aliyun/gpu branch 2 times, most recently from bde319f to 9fe74f0 Compare December 10, 2024 15:37
@@ -38,11 +38,14 @@ const (
NeuronCoreRayResourceName = "neuron_cores"
TPUContainerResourceName = "google.com/tpu"
TPURayResourceName = "TPU"
GPUShareContainerResourceName = "aliyun.com/gpu-mem"
GPUShareResourceName = "gpu_share"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the actual resource name recognized by Ray? aliyun.com/gpu-mem -> gpu_share seems like an odd mapping

Copy link
Contributor Author

@win5923 win5923 Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I realized that Ray Resource Name only supports GPU, neuron_cores, TPU, NPU, and HPU. I will mapping aliyun's gpu share to GPU.

Ref: https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we should only add custom accelerators here that are in the supported list you referenced

@win5923 win5923 changed the title [RayCluster] add GPUShare to known custom accelerators [RayCluster] Adds aliyun's gpu share to known custom accelerators Dec 10, 2024
@win5923 win5923 marked this pull request as draft December 11, 2024 13:37
@win5923 win5923 marked this pull request as ready for review December 12, 2024 13:18
@win5923 win5923 requested a review from andrewsykim December 12, 2024 13:19
@win5923 win5923 changed the title [RayCluster] Adds aliyun's gpu share to known custom accelerators [RayCluster] Adds aliyun.com/gpu-mem to known custom accelerators Dec 12, 2024
Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@win5923
Copy link
Contributor Author

win5923 commented Dec 17, 2024

I mapped aliyun.com/gpu-mem to num-gpus. Because gpu share also belongs to the GPU.

@andrewsykim
Copy link
Collaborator

andrewsykim commented Dec 17, 2024

Can you first open a PR to update the officially supported accelerators in https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html to include alyun.com/gpu?

@win5923
Copy link
Contributor Author

win5923 commented Dec 20, 2024

I checked Aliyun's documentation, and I think they also use Nvidia GPU with field called nvidia.com/gpu, while Aliyun GPU Share has another field called aliyun.com/gpu-mem.

Ref: https://www.alibabacloud.com/help/en/eci/user-guide/create-a-pod-by-specifying-the-gpu-specification-1

Can you first open a PR to update the officially supported accelerators in https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html to include alyun.com/gpu?

So I think we should just focus on the string?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Statistics of other types of gpu
2 participants