New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add blocking progress mode to Python async #116

Merged

rapids-bot merged 39 commits into rapidsai:branch-0.41 from pentschev:python-async-blocking-mode

Oct 22, 2024

Member

pentschev commented Nov 2, 2023

Implements the blocking progress mode (UCX-Py default), which was still not implemented in UCXX.


          Expose ucxx::Worker epoll file descriptor getter

6dae328

pentschev requested review from a team as code owners

November 2, 2023 20:43

pentschev added 2 commits

November 2, 2023 15:17


          Add blocking progress mode to Python async

155ab58


          Test blocking mode in CI

1c1ab87

pentschev force-pushed the python-async-blocking-mode branch from 37ea715 to 1c1ab87 Compare

November 2, 2023 22:18

wence- reviewed

View reviewed changes

Contributor

wence- left a comment

I think the core of this looks fine, but I am reminded that this epoll_wait issue with asyncio doesn't actually quite work correctly with the code we are using (see discussion here rapidsai/ucx-py#888)

cpp/include/ucxx/worker.h Show resolved Hide resolved

python/ucxx/_lib_async/continuous_ucx_progress.py Outdated Show resolved Hide resolved

python/ucxx/_lib_async/continuous_ucx_progress.py Outdated

Comment on lines 108 to 110

+                      #  - All asyncio tasks that isn't waiting on UCX must be executed
+                      #    so that the asyncio's next state is epoll wait.
+                      #    See <https://github.com/rapidsai/ucx-py/issues/413>

Contributor

wence- Nov 6, 2023

Although I think I understand the constraint, I am not sure what it means for the "next state" to be epoll_wait. Surely there can be arbitrary non-ucx tasks?

Member Author

pentschev Nov 6, 2023

I'm gonna be honest and say my understanding here is also a bit fuzzy, and as you noted yourself this "doesn't work" (except when it does), with the original being adapted from https://stackoverflow.com/a/48491563. In my understanding, what epoll_wait refers to here is the socket state, which is there solely to provide a mechanism to prevent asyncio from running out of "ready" tasks, so epoll_wait will ensure that if nothing useful happens in the event loop, asyncio will still be awaken at some point to allow the loop to reevaluate.

python/ucxx/_lib_async/continuous_ucx_progress.py Outdated

+                          if self.worker.arm():
+                              # At this point we know that asyncio's next state is
+                              # epoll wait.

Contributor

wence- Nov 6, 2023

Who does the epoll_wait call?

Member Author

pentschev Nov 6, 2023

As per my #116 (comment), I believe this is related to the sockets.

Member Author

pentschev commented Nov 6, 2023

I think the core of this looks fine, but I am reminded that this epoll_wait issue with asyncio doesn't actually quite work correctly with the code we are using (see discussion here rapidsai/ucx-py#888)

Yes, I remember that. The purpose of this is not necessarily to be used long-term, but rather to have a fallback to original UCX-Py behavior. This will allow us testing as close as possible with what UCX-Py did in the past, and we may later deprecate/remove this if we are confident we have a better option (e.g., thread progress mode).

wence- approved these changes

View reviewed changes

Contributor

wence- left a comment

Thanks for the explanations!


          Disable Python future on blocking mode testing

e5f4a40

ajschmidt8 approved these changes

View reviewed changes


          Merge remote-tracking branch 'upstream/branch-0.36' into python-async…

7952ef1

…-blocking-mode

pentschev requested a review from a team as a code owner

January 15, 2024 18:02

pentschev changed the base branch from branch-0.35 to branch-0.36

January 15, 2024 19:15

pentschev added 5 commits

January 17, 2024 03:09


          Add timeout to Python's async blocking progress mode

edd3192


          Support blocking mode in 'send_recv` Python benchmark

e3f9cc3


          Schedule cancelation in ProgressTask deleter

b5f95f0


          Rerun CI

60e49d1


          Revert accidental CI script changes

91ab7bf

pentschev force-pushed the python-async-blocking-mode branch from 52e4d1d to 91ab7bf Compare

January 17, 2024 16:29

pentschev added 5 commits

January 18, 2024 04:56


          Disable blocking progress mode delayed submission benchmarks

26480a5


          Merge remote-tracking branch 'upstream/branch-0.36' into python-async…

6da6c5b

…-blocking-mode


          Merge remote-tracking branch 'upstream/branch-0.36' into python-async…

80d7e14

…-blocking-mode


          Merge remote-tracking branch 'upstream/branch-0.36' into python-async…

2b7c4cf

…-blocking-mode


          Remove pytest.mark.gpu

77b3659

pentschev changed the base branch from branch-0.36 to branch-0.40

July 25, 2024 14:35

pentschev requested a review from a team as a code owner

July 25, 2024 14:35

pentschev requested a review from raydouglass

July 25, 2024 14:35


          Merge branch 'branch-0.40' into python-async-blocking-mode

1e3c55e

raydouglass approved these changes

View reviewed changes

pentschev mentioned this pull request

BlockingMode._fd_reader_callback asyncio task not end rapidsai/ucx-py#1072

Closed

wence- approved these changes

View reviewed changes

Contributor

wence- left a comment

I had some small docstring comments but I still think this looks good, is there anything else to do here?

python/ucxx/ucxx/_lib_async/continuous_ucx_progress.py Outdated

Comment on lines 91 to 93

+                      The blocking progress mode ensure the worker is progress whenever the UCX
+                      worker reports an event on its epoll file descriptor. In certain
+                      circumstances the epoll file descriptor may not

Contributor

wence- Sep 30, 2024

nit: Grammar suggestion:

Blocking progress mode ensures that the worker is progressed whenever the UCX worker reports and event on its epoll file descriptor.

*nit8: The second sentence appears to be incomplete.

Member Author

pentschev Sep 30, 2024

Fixed that and a few more issues with the progress timeout in c800ca4 .

pentschev added 3 commits

September 30, 2024 14:03


          Fix progress timeout and docstrings

c800ca4


          Cancel progress tasks before closing of event loop

18e3cf0


          Merge remote-tracking branch 'upstream/branch-0.41' into python-async…

2ffac66

…-blocking-mode

pentschev changed the base branch from branch-0.40 to branch-0.41

September 30, 2024 21:17

pentschev commented

View reviewed changes

python/ucxx/ucxx/_lib_async/continuous_ucx_progress.py Outdated

Comment on lines 32 to 44

+                      event_loop_close = self.event_loop.close
+                      def _event_loop_close(*args, **kwargs):
+                          if not self.event_loop.is_closed() and self.asyncio_task is not None:
+                              try:
+                                  self.asyncio_task.cancel()
+                                  self.event_loop.run_until_complete(self.asyncio_task)
+                              except asyncio.exceptions.CancelledError:
+                                  pass
+                              finally:
+                                  event_loop_close(*args, **kwargs)
+                      self.event_loop.close = _event_loop_close

Member Author

pentschev Sep 30, 2024

@wence- would you mind having one more look at this? It is a real solution for the years long coroutine was never awaited/Task was destroyed but it is pending! warning that we've attempted to resolve in many instances, including rapidsai/ucx-py#929, yet it is very intrusive. To me it doesn't look like it can be too harmful, but maybe you'll have some other thoughts or opinions.

Contributor

wence- Oct 1, 2024

Some questions:

This changes the behaviour of EventLoop.close. Does it only do so for this instance?
What happens if multiple of these ProgressTasks are stacked up, I guess each one overwrites the close method, but remembers the previous one, so we do unwind everything?

Member Author

pentschev Oct 1, 2024

This changes the behaviour of EventLoop.close. Does it only do so for this instance?

That's right, the change only applies to self.event_loop, not the whole class, here's an example:

import asyncio


async def run_patch():
    loop = asyncio.get_running_loop()
    print(f"run_patch: {loop}")

    loop_close = loop.close

    def _patch_close(*args, **kwargs):
        if not loop.is_closed():
            print("_patch_close")
            loop_close(*args, **kwargs)

    loop.close = _patch_close


async def run_orig():
    loop = asyncio.get_running_loop()
    print(f"run_orig: {loop}")


loop = asyncio.new_event_loop()
loop.run_until_complete(run_patch())
loop.close()

loop2 = asyncio.new_event_loop()
loop2.run_until_complete(run_orig())
loop2.close()

Which prints:

run_patch: <_UnixSelectorEventLoop running=True closed=False debug=False>
_patch_close
run_orig: <_UnixSelectorEventLoop running=True closed=False debug=False>

IOW, _patch_close only applies to loop, as expected.

What happens if multiple of these ProgressTasks are stacked up, I guess each one overwrites the close method, but remembers the previous one, so we do unwind everything?

This is a very good catch. Indeed this may not work and cause infinite recursion due to local event_loop_close in its original form, but it does work when passing the original loop.close function to a partial, such as:

import asyncio
from functools import partial


async def run_patch():
    loop = asyncio.get_running_loop()
    print(f"run_patch: {loop}")

    loop_close = loop.close

    def _patch_close(loop_close, *args, **kwargs):
        if not loop.is_closed():
            print(f"_patch_close: {loop_close}")
            loop_close(*args, **kwargs)

    loop.close = partial(_patch_close, loop_close)

    loop_close = loop.close

    def _patch_close2(loop_close, *args, **kwargs):
        if not loop.is_closed():
            print(f"_patch_close2: {loop_close}")
            loop_close(*args, **kwargs)

    loop.close = partial(_patch_close2, loop_close)


loop = asyncio.new_event_loop()
loop.run_until_complete(run_patch())
loop.close()

If I'm not overlooking anything, the sample above is equivalent to having multiple ProgressTasks stacking up, so the unwinding occurs by always rewriting loop.close with the most recent ProgressTask that calls loop.close that was previously set. This change is now reflected in c5c2ceb .

wence- reviewed

View reviewed changes

python/ucxx/ucxx/_lib_async/continuous_ucx_progress.py

+                          if not self.event_loop.is_closed() and self.asyncio_task is not None:
+                              try:
+                                  self.asyncio_task.cancel()
+                                  self.event_loop.run_until_complete(self.asyncio_task)

Contributor

wence- Oct 1, 2024

suggestion: should we set self.asyncio_task = None after running until complete?

Contributor

wence- Oct 1, 2024

OK, so the idea is that you don't have control over who is closing the event loop, so you instead hook into close and this task cancels itself during event loop closing/teardown?

Member Author

pentschev Oct 1, 2024

suggestion: should we set self.asyncio_task = None after running until complete?

That's a good idea, done in 279cb4c .

OK, so the idea is that you don't have control over who is closing the event loop, so you instead hook into close and this task cancels itself during event loop closing/teardown?

Exactly, there doesn't seem to be another way, since we have no control of whether the user will close the loop before resetting UCXX.

python/ucxx/ucxx/_lib_async/continuous_ucx_progress.py Outdated

Comment on lines 32 to 44

+                      event_loop_close = self.event_loop.close
+                      def _event_loop_close(*args, **kwargs):
+                          if not self.event_loop.is_closed() and self.asyncio_task is not None:
+                              try:
+                                  self.asyncio_task.cancel()
+                                  self.event_loop.run_until_complete(self.asyncio_task)
+                              except asyncio.exceptions.CancelledError:
+                                  pass
+                              finally:
+                                  event_loop_close(*args, **kwargs)
+                      self.event_loop.close = _event_loop_close

Contributor

wence- Oct 1, 2024

Some questions:

This changes the behaviour of EventLoop.close. Does it only do so for this instance?
What happens if multiple of these ProgressTasks are stacked up, I guess each one overwrites the close method, but remembers the previous one, so we do unwind everything?

python/ucxx/ucxx/_lib_async/continuous_ucx_progress.py Outdated Show resolved Hide resolved

python/ucxx/ucxx/_lib_async/continuous_ucx_progress.py Outdated Show resolved Hide resolved

pentschev and others added 4 commits

October 1, 2024 04:31


          Use partial to rewrite event_loop.close

c5c2ceb


          Reset self.asyncio_task to None after cancellation

279cb4c


          Fix comments' phrasing

c624898

Co-authored-by: Lawrence Mitchell <[email protected]>


          Cancel _arm_worker instead of sock_recv

5769f31

wence- approved these changes

View reviewed changes

Contributor

wence- left a comment

Thanks Peter

pentschev added 11 commits

October 4, 2024 10:36


          Merge branch 'branch-0.41' into python-async-blocking-mode

4f7a7f2


          Merge branch 'branch-0.41' into python-async-blocking-mode

23bb0bb


          Merge branch 'branch-0.41' into python-async-blocking-mode

01cbe8a


          Merge remote-tracking branch 'upstream/branch-0.41' into python-async…

50d3f47

…-blocking-mode


          Revert "Resolve thread-safety issues in distributed-ucxx (rapidsai#295)"

dbb6386

This reverts commit a7d36f5.


          Adjust properties and blocking progress mode initialization

8e4bc38


          Merge remote-tracking branch 'origin/python-async-blocking-mode' into…

a806e45

… python-async-blocking-mode


          Fix unreachable test

9e2d017


          Merge branch 'branch-0.41' into python-async-blocking-mode

94d05e9


          Ensure writer is closed to prevent Distributed check failure

caf67f9


          Merge remote-tracking branch 'origin/python-async-blocking-mode' into…

74913fa

… python-async-blocking-mode

Member Author

pentschev commented Oct 22, 2024

/merge

rapids-bot bot merged commit 122d2f4 into rapidsai:branch-0.41

68 checks passed

Member Author

pentschev commented Oct 22, 2024

Thanks all for the reviews here!

pentschev deleted the python-async-blocking-mode branch

October 22, 2024 20:10

pentschev mentioned this pull request

UCX slower than TCP Socket #305

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet