Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Make selector choose appropriate CUDA 12.x versions based on dependencies #471

Open
vyasr opened this issue Feb 5, 2024 · 8 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@vyasr
Copy link

vyasr commented Feb 5, 2024

Is your feature request related to a problem? Please describe.
Once RAPIDS adds support for CUDA 12.2, it will be possible to install conda packages of PyTorch along with RAPIDS from conda. Currently this is not possible because PyTorch supports 12.1 and will likely bump straight to 12.3 for their next set of packages. Since the CUDA 12 lineup of RAPIDS packages is going to leverage CEC to support arbitrary CUDA minor versions, we will no longer need users to have a specific one for RAPIDS, but dependencies like PyTorch will likely continue to do so.

Describe the solution you'd like
We should update the release selector to include a range of CUDA minor versions and have it automatically select supported ones based on the user's choice of packages to include in their environment.

Additional context
For libraries like PyTorch, we will also need to consider what channel the package will be installed from. Officially supported PyTorch builds come from the pytorch channel, not conda-forge, so unless/until that changes we will need to ensure that our install command accounts for that correctly.

@vyasr vyasr added ? - Needs Triage Need team to review and classify feature request New feature or request labels Feb 5, 2024
@jakirkham
Copy link
Member

Possibly related ( #470 )

@bdice
Copy link
Contributor

bdice commented Feb 5, 2024

#470 fixes the compatible major versions of CUDA for the TensorFlow GPU conda-forge package. It does not impact minor version compatibility.

What part of this is dependent on RAPIDS supporting CUDA 12.2?

I was able to solve this environment, and got a CUDA 12 build of pytorch from conda-forge (pytorch 2.1.2 cuda120_py310h327d3bc_301).

mamba create -n rapids-23.12 -c rapidsai -c conda-forge -c nvidia rapids=23.12 python=3.10 cuda-version=12.0 pytorch

I don't think we can offer official compatibility between RAPIDS / conda-forge and the pytorch channel, given that the pytorch package from the pytorch channel is built against nvidia channel CUDA packages. These channel conflicts are unavoidable. An example environment showing the mixture of nvidia and conda-forge packages can be generated by adding -c pytorch before -c conda-forge:

# Uses both nvidia and conda-forge CUDA Toolkit packages. Not supported.
mamba create -n rapids-23.12 -c rapidsai -c pytorch -c conda-forge -c nvidia rapids=23.12 python=3.10 cuda-version=12.0 pytorch

Last I tested it, this environment worked but we can't offer support for a configuration with CUDA from a mixed set of channels.

At some point in the future we are hoping to make the CUDA distributions on the nvidia and conda-forge channels compatible, but until that point, I don't see any action item here. The install selector works as desired with PyTorch CUDA 12 packages from conda-forge.

@vyasr
Copy link
Author

vyasr commented Feb 8, 2024

I agree that this isn't addressable until the nvidia and conda-forge CTK packages are aligned. We should consider how the selector ought to work once that day comes, though. To @MatthiasKohl's point, though, the pytorch channel is the officially supported medium (by both NVIDIA and PyTorch) for installing the package, so IMHO once the two are aligned we would probably want to encourage installation of PyTorch from the pytorch channel unless and until we see a similar level of support for the conda-forge package as NVIDIA is now providing for the CTK on cf.

@MatthiasKohl
Copy link

The install selector works as desired with PyTorch CUDA 12 packages from conda-forge.

It might work as desired, but I don't think it should.
I checked today with Cliff and Piotr from DLFW, and both our DLFW teams and upstream pytorch have found many incompatibility issues with the pytorch build from conda-forge, e.g. libc version and so on. The problem is that few people install only pytorch and rely on many other packages, which are all either pip-wheel based or based on conda's main channel, and use different base packages.
IMO, we should not encourage people to use this pytorch build. If RAPIDS cannot be compatible with upstream pytorch (from officially supported channels), then we should either work with DLFW to become compatible, or remove that option from the install selector.

@vyasr
Copy link
Author

vyasr commented Oct 22, 2024

Big relevant news here: pytorch/pytorch#138506

@MatthiasKohl
Copy link

There has not been any substantial effort / progress to become compatible with DLFWs since this was last discussed.
The fact that PyTorch is deprecating their conda channel means that there will not be any officially supported package of PyTorch on conda, just like for Tensorflow.
Thus, we should remove both the PyTorch and Tensorflow options from the install selector.

@agm-eratosth
Copy link

agm-eratosth commented Oct 31, 2024

There has not been any substantial effort / progress to become compatible with DLFWs since this was last discussed. The fact that PyTorch is deprecating their conda channel means that there will not be any officially supported package of PyTorch on conda, just like for Tensorflow. Thus, we should remove both the PyTorch and Tensorflow options from the install selector.

Rapidsai is often used in conjunction with PyTorch and Tensorflow for many users. Wouldn't it instead make sense to support the conda-forge feedstocks, since they are community driven and pull requests can be made on them? The changes being discussed here can be made for compatibility moving forward with rapids now that the conda-forge channel is the way pytorch will be distributed moving forward on conda.

@MatthiasKohl
Copy link

Rapidsai is often used in conjunction with PyTorch and Tensorflow for many users. Wouldn't it instead make sense to support the conda-forge feedstocks, since they are community driven and pull requests can be made on them? The changes being discussed here can be made for compatibility moving forward with rapids now that the conda-forge channel is the way pytorch will be distributed moving forward on conda.

This does make sense, but it definitely requires support from Cliff Woolley and org, so I'd recommend reaching out to them and see what they can support. This will likely take a long time, especially if we can support conda-forge officially, so while this effort is going on, I'd still recommend removing the selector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants