UCX slower than TCP Socket #305

luweizheng · 2024-10-20T16:27:00Z

I started a thread in ucx-py, and now I have replaced ucx-py with ucxx, which resolved the blocking issue. However, in terms of performance, ucx is slower than TCP sockets. Over the past few days, I have done some analysis and profiling, and I found that ucx's await endpoint.recv() consumes a lot of CPU time.

I am currently maintaining repositories (xorbits/xoscar) that are data science tools similar to Dask, which can scale workloads like pandas to a cluster. Among them, xoscar is the underlying actor framework used for inter-process communication, serialization, and more. As an underlying actor framework, xoscar is a little bit similar to Dask's distributed package. All communication and resource management is handled by xoscar. Similar to Dask, xorbits has a supervisor for management and workers responsible for computation. When a compute node starts xorbits/xoscar, xorbits sends messages such as the heartbeat of that compute node to the supervisor through xoscar's communication mechanism. These management messages are mostly small transfers, and these communications occur at short intervals (a few times within one second). Big transfers include shuffling dataframes. In summary, there are small transfers for management messages and big transfers for data shuffling.

In xoscar, different actors communicate with each other through the concept of Channels, and we have currently implemented UCXChannel.
The ucx code is on this page.

I used py-spy to profile xorbits/xoscar/ucxx and found that this line of await endpoint.recv() is consuming a lot of CPU resources. The flame graph is as follows.

@pentschev mentioned in the previous thread that it's not surprising for TCP sockets to be faster than UCX. So what I want to confirm is whether there are good solutions for scenarios like mine, where small transfers and big transfers are mixed together, to reduce the CPU load of await endpoint.recv(). Or do I need to modify the current design to have small transfers go through TCP and dataframe shuffling go through UCX?

The text was updated successfully, but these errors were encountered:

pentschev · 2024-10-23T18:19:59Z

First of all, let me just say Python asyncio imposes horrible overhead, the runtime for a single task is over 10us, and a regular Python call runs in the range of 100ns or less (that means asyncio overhead > 100x!), so short-lived tasks (like small data transfers) are a terrible use of asyncio. Source: https://github.com/pentschev/python-overhead .

@pentschev mentioned in rapidsai/ucx-py#1072 that it's not surprising for TCP sockets to be faster than UCX.

To be clear I was referring to synchronous Python sockets, not necessarily using asyncio, which I now see is the case in your socket implementation, so I do ultimately expect at least similar performance between UCXX async backend and asyncio's StreamReader/StreamWriter. Second, I was also referring about creating/destroying endpoints all the time instead of reusing them which is a bad pattern, I'm not sure exactly how UCXChannel is used in xoscar, but you most definitely need to ensure endpoints are preserved for future usage.

With that said, I remember you previously wrote the following numbers in rapidsai/ucx-py#1072 (comment):

To show the numbers, I run some TPC-H queries, a data analysis benchmark. In terms of Query 3, the UCX backend takes 53 seconds, while UNIX sockets take only 35 seconds.

Is that still true? Can you confirm if you're using InfiniBand interfaces in both cases?

Can you also try the latest branch-0.41 (you'll need #116 for what I'm about to ask) and check how that performs when you try the different progress modes blocking/polling/thread/thread-polling by specifying that as the value to the UCXPY_PROGRESS_MODE environment variable?

We're certainly not attempting to be necessarily way faster, nor necessarily even faster than Python's libraries in all cases, but we definitely do want to have similar performance provided we're using the same transports. What we do want with UCXX is to leverage transports that are not available with Python builtin libraries, such as CUDA IPC and GPUDirectRDMA with InfiniBand, as well as provide a seamless interface to transfer Python arrays (i.e., objects that implement __array_interface__/__cuda_array_interface__), and all that imposes some extra cost, cost which Python builtins do not have since they operate only on objects implementing the buffer protocol.

In any case, we do want to improve performance where possible, but once again I do not necessarily expect we'll match or outperform Python sockets implementations if all you use are sockets via UCX. For example, in #309 I've added support for Python socket and asyncio.StreamReader/asyncio.StreamWriter, and running it I see the following for Python sync API on the system I have handy with Connect-X 4:

socket

$ python -m ucxx.benchmarks.send_recv --no-detailed-report --n-iter 10000 --n-bytes 1 --backend socke
t
Server Running at 10.33.225.163:10000
Client connecting to server at 10.33.225.163:10000
Roundtrip benchmark
================================================================================
Iterations                | 10000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 95.29 kiB/s
Bandwidth (median)        | 97.36 kiB/s
Latency (average)         | 10248 ns
Latency (median)          | 10030 ns

ucxx-core/TCP

$ UCX_TLS=tcp python -m ucxx.benchmarks.send_recv --no-detailed-report --n-iter 10000 --n-bytes 1 --b
ackend ucxx-core
Server Running at 10.33.225.163:53180
Client connecting to server at 10.33.225.163:53180
Roundtrip benchmark
================================================================================
Iterations                | 10000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Transfer API              | TAG
Progress mode             | thread
Asyncio wait              | False
Delay progress            | False
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 48.76 kiB/s
Bandwidth (median)        | 50.98 kiB/s
Latency (average)         | 20028 ns
Latency (median)          | 19157 ns

ucxx-core/RC

$ UCX_TLS=rc python -m ucxx.benchmarks.send_recv --no-detailed-report --n-iter 10000 --n-bytes 1 --backend ucxx-core
Server Running at 10.33.225.163:55133
Client connecting to server at 10.33.225.163:55133
Roundtrip benchmark
================================================================================
Iterations                | 10000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Transfer API              | TAG
Progress mode             | thread
Asyncio wait              | False
Delay progress            | False
UCX_TLS                   | rc
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 100.44 kiB/s
Bandwidth (median)        | 99.26 kiB/s
Latency (average)         | 9723 ns
Latency (median)          | 9838 ns

As you can see, we are just slightly faster than Python's socket library when using RC, but we're still slower using TCP. I don't know why that is the case yet, but I would say UCXX still does a good job in what it proposes to do (enabling transports that are not available in Python). I'm sure we can still improve things further and we should do it nevertheless, in fact I started doing some more profiling in #310 (still experimental) and found some potential candidates to improve performance, after which we get ~30% performance boost in that case:

ucxx-core/TCP

$ UCX_TLS=tcp python -m ucxx.benchmarks.send_recv --no-detailed-report --n-iter 10000 --n-bytes 1 --backend ucxx-core
Server Running at 10.33.225.163:57593
Client connecting to server at 10.33.225.163:57593
Roundtrip benchmark
================================================================================
Iterations                | 10000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-core
Transfer API              | TAG
Progress mode             | thread
Asyncio wait              | False
Delay progress            | False
UCX_TLS                   | tcp
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 73.89 kiB/s
Bandwidth (median)        | 69.80 kiB/s
Latency (average)         | 13215 ns
Latency (median)          | 13991 ns

ucxx-core/RC

$ UCX_TLS=rc python -m ucxx.benchmarks.send_recv --no-detailed-report --n-iter 10000 --n-bytes 1 --backend ucxx-core
Server Running at 10.33.225.163:60702
Client connecting to server at 10.33.225.163:60702
Roundtrip benchmark
================================================================================
Iterations                | 10000
Bytes                     | 1 B
Number of buffers         | 1
Object type               | numpy
Reuse allocation          | False
Backend                   | ucxx-core
Transfer API              | TAG
Progress mode             | thread
Asyncio wait              | False
Delay progress            | False
UCX_TLS                   | rc
UCX_NET_DEVICES           | all
================================================================================
Device(s)                 | CPU-only
Server CPU                | affinity not set
Client CPU                | affinity not set
================================================================================
Bandwidth (average)       | 137.87 kiB/s
Bandwidth (median)        | 126.03 kiB/s
Latency (average)         | 7083 ns
Latency (median)          | 7748 ns

Similarly, if you run benchmarks with --backend ucxx-async you'll see that UCXX on its own performs better or worse than --backend asyncio depending on the progress mode, and the progress modes may also have different impact on the actual application, so it would be interesting to know what you find out about the different progress modes in xoscar/xorbits, and maybe we can work to improve performance for you as well.

Finally, still on the topic of asyncio, the problems I've mentioned previously are something I've also encountered in Dask, so part of the UCXX effort attempts to try and overcome some of those issues, like the use of too many asyncio tasks. One of those attempts is to use what I've called multi-buffer transfers (see also the send_multi/recv_multi API), which essentially wrap multiple sequential buffer transfers into a single Python async task (under-the-hood they're still multiple UCX transfers though) to reduce the amount of asyncio tasks that need to be created and waited upon, that may or may not be useful for xoscar depending on the transfer patterns that exist.

Please let us know if you happen to do some more experimenting based on the information above and how that goes so we can see whether UCXX can be made to perform better for you as well.

luweizheng · 2024-11-18T14:11:07Z

Hi @pentschev

Sorry for the late reply.

Xoscar Design

In xoscar, we reuse endpoints instead of creating them each time. Each process is an ActorPool with a ucxx endpoint, and processes communicate with each other through UCXX for data transmission.
In our design, if two processes are within the same worker, communication through UNIX socket is very cost-effective. On different workers, communication is done through UCXX or TCP socket, but we found that when using UCXX, even inter-process communication within the same worker also uses UCXX, which not only has a high cost but also consumes a lot of CPU time for recv method.

Refactoring the architecture is not easy, and I believe this might be the crux of the issue: all communications, including those between processes within the same worker, are conducted through UCXX. And UCXX is not good for small transfers?

Benchmark for different progress mode

Given the architecture mention above, I did some tests.

I installed the UCXX nightly from conda, and the speed of ucxx.benchmark is expected.

blocking resulted in errors.

AssertionError
2024-11-18 21:39:14,830 asyncio      64845 ERROR    Exception in callback <bound method BlockingMode._fd_reader_callback of <ucxx._lib_async.continuous_ucx_progress.BlockingMode object at 0x1506bd673fb0>>
handle: <Handle BlockingMode._fd_reader_callback>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
  File "~/envs/ucxx/lib/python3.12/site-packages/ucxx/_lib_async/continuous_ucx_progress.py", line 149, in _fd_reader_callback
    assert self.blocking_asyncio_task is None or self.blocking_asyncio_task.done()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

polling and thread-polling, htop monitored high CPU utilization and a high load average. DataFrame workloads are much slower than TCP Socket.

pentschev · 2024-11-25T21:18:21Z

On different workers, communication is done through UCXX or TCP socket, but we found that when using UCXX, even inter-process communication within the same worker also uses UCXX, which not only has a high cost but also consumes a lot of CPU time for recv method.

Can you be more objective and provide numbers to what you refer as "high cost"? I'm not really sure what that means.

Consuming a lot of CPU probably refers to polling/thread-polling, is that right? Those are indeed expected to poll indefinitely and will incur 100% CPU utilization, so nobody should normally use them in practice unless the application is very latency-sensitive and tradeoffs are properly understood. I have indeed suggested that but as a debugging/extended benchmarking tool, you should prefer either blocking or thread instead.

And UCXX is not good for small transfers?

In C++ UCXX will do well for small transfers, I was merely stating facts about Python when used for small transfers (or any short-lived tasks in general), in particular async Python. I'm also saying that Python asyncio.Socket{Reader,Writer} may provide better performance for small transfers than UCXX as I've observed with our benchmarks, this may have to do with the more robust type compatibility we support, such as transferring arrays supporting the __array_interface__/__cuda_array_interface__, those checks incur non-negligible cost as with some of those I attempted to remove in #310 , there's definitely some room for improvement there though. With that said, if all your transfers are small and do not leverage RDMA or other specialized HW capabilities it's possible you'll be better off with Python sockets instead of UCXX. As I've mentioned previously, UCXX intends to leverage HW capabilities that are not available in asyncio or other system libraries, but we can't guarantee optimal performance for every use case. PRs to improve performance are also encouraged and welcome too!

Given the architecture mention above, I did some tests.

I installed the UCXX nightly from conda, and the speed of ucxx.benchmark is expected.

blocking resulted in errors.

It is strange that you're hitting that assertion all the time, I have not seen it in years or maybe ever. Can you provide a reproducer? Additionally, does it work if you comment it out? What does performance look like with that? If you have the chance I'd encourage you to test blocking with #312 as well, it should be significantly better than without it (numbers are in the PR's description).

DataFrame workloads are much slower than TCP Socket.

So far I've been talking generally about UCXX because you haven't provided me with any numbers, only subjective descriptions such as "much slower". Can you provide observed numbers instead? If possible, can you provide a simple reproducer too?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCX slower than TCP Socket #305

UCX slower than TCP Socket #305

luweizheng commented Oct 20, 2024

pentschev commented Oct 23, 2024

luweizheng commented Nov 18, 2024

pentschev commented Nov 25, 2024

UCX slower than TCP Socket #305

UCX slower than TCP Socket #305

Comments

luweizheng commented Oct 20, 2024

pentschev commented Oct 23, 2024

luweizheng commented Nov 18, 2024

Xoscar Design

Benchmark for different progress mode

pentschev commented Nov 25, 2024