Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm Cluster CPU Affinity #1358

Closed
ilan-gold opened this issue Jul 19, 2024 · 10 comments
Closed

Slurm Cluster CPU Affinity #1358

ilan-gold opened this issue Jul 19, 2024 · 10 comments

Comments

@ilan-gold
Copy link

Hello All,

This is more of a "seeking advice" than a bug, although who knows. So anyone with experience in this area would be welcome to chime in! The TLDR is that requesting high amounts of memory on a slurm cluster causes CPU device affinity (and NUMA affinity) to be incorrect in conjunction with GPUs.

When running srun --pty -c 10 -p gpu_p --qos gpu_long --nice=0 --exclusive --gres=gpu:2 -t 06:00:00 bash, nvidia-smi topo -m gives the "correct" NUMA/CPU affinity (as needed by dask-cuda in the linked lines):

       GPU0    GPU1    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
GPU1    NV12     X      PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
NIC0    PXB     PXB      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PXB     PXB     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9

but when we request more memory, it doesn't work i.e., srun --pty -c 10 -p gpu_p --qos gpu_reservation --nice=0 --mem 200G --gres=gpu:2 --reservation=test_supergpu05 -t 06:00:00 bash followed by nvidia-smi topo -m

        GPU0    GPU1    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS             3               N/A
GPU1    NV12     X      SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS             7               N/A
NIC0    PXB     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PXB     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     PXB     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS
NIC5    SYS     PXB     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9

As you see, the NUMA affinity is N/A and the CPU affinity is not correct either.

So the following:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1")
client = Client(cluster)

loses one (or both) of the workers because the CPU affinity is wrong for the higher memory configuration, with the following error given:

Task exception was never retrieved
future: <Task finished name='Task-276' coro=<_wrap_awaitable() done, defined at /home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/deploy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/utils.py", line 1956, in wait_for
    return await fut
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/worker.py", line 1476, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/core.py", line 653, in start
    raise self.__startup_exc
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/utils.py", line 1956, in wait_for
    return await fut
           ^^^^^^^^^
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/core.py", line 678, in __aenter__
    await self
  File "/home/icb/ilan.gold/miniconda3/envs/rsc_ale/lib/python3.11/site-packages/distributed/core.py", line 672, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.

The section of the dask-cuda codebase that relies on this configuration can also be extracted into a self-contained script as well relying on pynvml. Although, pynvml just relies on nvidia-smi under the hood which is why I posted that output first. In any case, the following script:

import os
import pynvml
import numpy as np
from multiprocessing import cpu_count
import math
 
def unpack_bitmask(x, mask_bits=64):
    """Unpack a list of integers containing bitmasks.
 
    Parameters
    ----------
    x: list of int
        A list of integers
    mask_bits: int
        An integer determining the bitwidth of `x`
 
    Examples
    --------
>>> from dask_cuda.utils import unpack_bitmaps
>>> unpack_bitmask([1 + 2 + 8])
    [0, 1, 3]
>>> unpack_bitmask([1 + 2 + 16])
    [0, 1, 4]
>>> unpack_bitmask([1 + 2 + 16, 2 + 4])
    [0, 1, 4, 65, 66]
>>> unpack_bitmask([1 + 2 + 16, 2 + 4], mask_bits=32)
    [0, 1, 4, 33, 34]
    """
    res = []
 
    for i, mask in enumerate(x):
        if not isinstance(mask, int):
            raise TypeError("All elements of the list `x` must be integers")
 
        cpu_offset = i * mask_bits
 
        bytestr = np.frombuffer(
            bytes(np.binary_repr(mask, width=mask_bits), "utf-8"), "u1"
        )
        mask = np.flip(bytestr - ord("0")).astype(bool)
        unpacked_mask = np.where(
            mask, np.arange(mask_bits) + cpu_offset, np.full(mask_bits, -1)
        )
 
        res += unpacked_mask[(unpacked_mask >= 0)].tolist()
 
    return res
 
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
affinity = pynvml.nvmlDeviceGetCpuAffinity(
            handle,
            math.ceil(cpu_count() / 64),
        )
print(unpack_bitmask(affinity))
handle = pynvml.nvmlDeviceGetHandleByIndex(1)
affinity = pynvml.nvmlDeviceGetCpuAffinity(
            handle,
            math.ceil(cpu_count() / 64),
        )
print(unpack_bitmask(affinity))

will give incorrect results on the higher-memory allocation, where both print statements result in empty lists, whereas on the first srun I wrote, the output matches that of nvidia-smi topo -m for the cpu affinity.

Thanks for any advice!

@pentschev
Copy link
Member

It's hard to say without additional information and I would suggest inquiring your cluster's admin, to me it seems that the "more memory" situation you're describing and the addition of --reservation argument imply that you're getting machines from a different partition in your cluster which may have a different HW/SW configuration as well, and in there it's possible that there's something wrong with either HW or SW configuration that is reporting NUMA nodes incorrectly.

Given nvidia-smi is reporting incorrect NUMA nodes I don't think this is something we can fix on the Dask-CUDA end. In that case, I'd again strongly advise inquiring the cluster admin because this could have deeper roots than only incorrect NUMA node reporting, and may thus lead to other difficult to identify errors as well as considerable performance reductions.

If knowing the above you'd still like to try out launching a Dask-CUDA cluster you could try to disable setting affinity by commenting out the relevant dask cuda worker or LocalCUDACluster. If you ultimately find out that disabling setting proper affinity is a requirement for you cluster for reasons, we could accept a patch where one could explicitly disable that plugin but the default would remain to have it enabled.

@ilan-gold
Copy link
Author

If knowing the above you'd still like to try out launching a Dask-CUDA cluster you could try to disable setting affinity by commenting out the relevant dask cuda worker or LocalCUDACluster. If you ultimately find out that disabling setting proper affinity is a requirement for you cluster for reasons, we could accept a patch where one could explicitly disable that plugin but the default would remain to have it enabled.

Would a cluster still work without these lines? I was under the impression setting affinity was necessary.

@pentschev
Copy link
Member

Would a cluster still work without these lines? I was under the impression setting affinity was necessary.

It will but may be slow. The primary purpose of setting CPU affinity in the context of Dask-CUDA is to ensure workers are running on the closest CPU(s) to each GPU, thus avoiding additional hops that will slow down the application.

@ilan-gold
Copy link
Author

@pentschev What has your experience been with dask-cuda setting this affinity vs. not? What sort/magnitude of slowdown might you see if this is not set?

Would an example like that from #1351 where there is a cpu-gpu transfer reveal performance differences?

@ilan-gold
Copy link
Author

And then as a follow up, what would it mean for there to be no difference if I were to see that, in contrast to what you have seen if you have seen an improvement by setting this affinity?

@ilan-gold
Copy link
Author

ilan-gold commented Jul 22, 2024

In any case, I'm seeing a roughly 30% slowdown with 4 GPUS without the affinity set, so would be curious what your experience is (sorry for the bad log messages haha):

Screenshot 2024-07-22 at 15 13 48

vs.

Screenshot 2024-07-22 at 14 57 07

@ilan-gold
Copy link
Author

My topology:

        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
GPU1    NV12     X      NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
GPU2    NV12    NV12     X      NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1               N/A
GPU3    NV12    NV12    NV12     X      SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1               N/A
NIC0    PXB     PXB     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PXB     PXB     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     PXB     PXB     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     PXB     PXB     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X 

@pentschev
Copy link
Member

This is the kind of problem that there's no good rule-of-thumb, as there are simply too many variables involved. Everything will depend on the topology, the type of compute and memory access patterns, as well as system load, PCIe bandwidth, etc. The best is to do what you did and measure it, I'm not surprised by 30% slowdown, it will most likely be noticeable in the majority of cases and in particular when there's more than one NUMA node involved, which is the case for you.

@ilan-gold
Copy link
Author

Thanks @pentschev !

@wence-
Copy link
Contributor

wence- commented Aug 1, 2024

I think the conclusion here is that there is no concrete bug to take action on. If so @ilan-gold , please go ahead and close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants