fix send race by handling logs in the main thread when async #1831

technillogue · 2024-07-26T04:13:58Z

If events can be written from an event loop and from a thread, some race conditions can occur. This PR tried to fix this by conditionally moving the stream redirector into the main thread when we switch to async, and registering the file descriptors with the main select loop instead of using epoll in a thread.

This solves a similar problem as #1758: in some cases, currently emitting a metric or yielding an output can result in an EAGAIN error, and we believe Done events are sometimes dropped. In async land, we shouldn't need a lock; multiple writers should be able to write to a StreamWriter safely. We also keep a sync Lock when the predictor is not async.

However, I'm still not sure if StreamWriter is thread-safe.

After some research, StreamWriter.write calls self._transport.write and when StreamWriter is wrapping a unix socket and using the default selector, the _transport is a _SelectorSocketTransport, which calls [self._buffer.extend](https://github.com/python/cpython/blob/3.11/Lib/asyncio/selector_events.py#L1077), and self._buffer is a [bytearray](https://github.com/python/cpython/blob/3.11/Lib/asyncio/selector_events.py#L761). However, on 3.12, [append is called](https://github.com/python/cpython/blob/3.12/Lib/asyncio/selector_events.py#L1091) instead of extend.

After more soul-searching, I think write should be basically thread-safe: the two critical calls, socket.send() and bytearray.append/extend should be represented as a single bytecode instruction and some native code which works on python objects and does not release the GIL. As far as I understand, python threads are switched "between bytecodes". Even though the individual calls are thread-safe, the overall write method is not quite thread safe: a relevant thread switch could occur between the if not self._buffer, _sock.send, and _buffer.extend lines. However, in that case the race condition would result data could being sent out of order or incorrectly delayed/sent early, but it I think it shouldn't be corrupted, and we don't really care about the ordering of log lines vs outputs.

I can see two solutions:

Ideally, we would probably add an alternate implementation of StreamWriter that uses the main event loop and wraps stderr/stdout in StreamReaders, start a separate task for each stream, and use await wrapped_stream.read() so that threads are not necessary. there's a tricky moment where we need to use the threaded StreamRedirector to capture logs while the predictor is being imported, since we don't know if we're going to be async or not.
Alternatively, we could try to use a queue or deque to communicate between threads, so instead of calling stream_write_hook StreamRedirector would do queue.put, the main event loop would do queue.get and not worry about thread safety for the pipe. The problem with this is that asyncio.Queue is not thread-safe and queue.Queue could block the event loop. You could use a busy wait or similar and get_nowait, but that has other downsides.

previously, AsyncConnection would only be used in _loop_async, and _events.send would always be used, which usually immediately makes a write(2) syscall. in contrast, AsyncConnection has a StreamWriter, which should be safe to call from different coroutines.

…r loop

…_queue

nickstenning · 2024-07-26T15:39:42Z

Without a failing test to demonstrate the problem you're trying to solve, I really don't know how to evaluate this. Based on my limited understanding, this could fix the problem, but it could also not. It could also introduce new bugs that are even worse than the one we're trying to fix. I just don't have any framework to evaluate the change.

If we can't reproduce the bug in a test, I do wonder whether we really understand what the bug we're chasing even is.

technillogue · 2024-07-26T21:23:03Z

thanks, that's very reasonable, I'll get on a test

mattt · 2024-07-31T12:39:33Z

python/cog/server/connection.py

+        # we don't want to see EAGAIN, we'd rather wait
+        # however, perhaps this is wrong and in some cases this could still block terribly
+        # sock.setblocking(False)
+        sock.setblocking(True)


I'm having trouble understanding this in the context of the commented out code. Are those concerns for setblocking(True)? It'd be nice for this comment to provide enough relevant context for someone to pick this up if we need to revisit this behavior.

technillogue added 8 commits July 26, 2024 00:01

use a lock to protect sync writes

1cd6c81

process_log_queue + a lot of debugging

593b3f6

don't start asyncconn or async queue processing during setup, only fo…

a59f334

…r loop

sketch

d9816c1

add a switch_to_async method to StreamRedirector and drop process_log…

2d6c0d8

…_queue

drain before starting new readers and don't strip newlines

affb023

maybe fix the race condition

90f3f07

technillogue requested a review from nickstenning July 26, 2024 04:13

drop debug

5417f10

technillogue force-pushed the syl/cleaner-review-fix-send-race-async branch from 1f08a8b to 5417f10 Compare July 26, 2024 04:16

mattt reviewed Jul 31, 2024

View reviewed changes

mattt mentioned this pull request Jul 31, 2024

Fix send race [debug] #1786

Closed

technillogue changed the title ~~fix send race [review]~~ fix send race Oct 22, 2024

technillogue changed the title ~~fix send race~~ fix send race by handling logs in the main thread when async Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix send race by handling logs in the main thread when async #1831

fix send race by handling logs in the main thread when async #1831

technillogue commented Jul 26, 2024 •

edited

Loading

nickstenning commented Jul 26, 2024

technillogue commented Jul 26, 2024

mattt Jul 31, 2024

fix send race by handling logs in the main thread when async #1831

Are you sure you want to change the base?

fix send race by handling logs in the main thread when async #1831

Conversation

technillogue commented Jul 26, 2024 • edited Loading

nickstenning commented Jul 26, 2024

technillogue commented Jul 26, 2024

mattt Jul 31, 2024

Choose a reason for hiding this comment

technillogue commented Jul 26, 2024 •

edited

Loading