-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX serialization errors after dumps_task
removal from Distributed
#1246
Comments
This failure is seen consistently in DGX tests, which are not observed in CI due to the lack of UCX testing for transports other than TCP. |
We can monkey patch past this error in dask-cuda, by patching
|
#1247 should fix this issue for RAPIDS 23.10 and allow us to pin Dask/Distributed 2023.9.2 as planned. The proper solutions must land in Distributed via dask/distributed#8216, once the Distributed fix is in and we unpin 2023.9.2 in |
The following snippet currently fails in Dask-CUDA if we use
protocol="ucx"
:Reproducer and output
After bisecting I found dask/distributed#8067 to be the source of this issue, it used to complete fine before that, it still does if we replace
protocol="ucx"
withprotocol="tcp"
, which may suggest there's something missing in the serialization protocol for UCX.cc @rjzamora @madsbk who both had a look at dask/distributed#8067 and may have thoughts on what we're missing.
The text was updated successfully, but these errors were encountered: