Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double-fork hang on Windows/ARM64 #73

Merged
merged 2 commits into from
Nov 14, 2024

Conversation

dscho
Copy link
Member

@dscho dscho commented Sep 26, 2024

This is yet another new attempt to address the hangs we observed e.g. in msys2/msys2-autobuild#62, based on the new insight that code-cache locks causes the hangs because the associated threads are already gone.

The idea here is to suspend the wait_thread (calling GetThreadContext() to actually ensure that it is suspended before continuing) before terminating it.

@jeremyd2019
Copy link

jeremyd2019 commented Sep 26, 2024

I tried the detach (the patch in the comment above) first. and that blew up pretty spectacularly. I think I tried something like if (!Cancel...) thr->terminate_thread (); and that would eventually hang. I suspect the change proposed here (in 588f46a) would eventually crash (based on my debugging/testing/hacking on this issue). TLDR: CancelSynchronousIo seemed to help, but the (more rare) case where it returns FALSE is still a problem that I couldn't figure out.

@dscho
Copy link
Member Author

dscho commented Sep 26, 2024

In particular, I want to test whether we should imitate the pattern in flock.cc even more faithfully, via something like this: [...]

I should learn how to read: @jeremyd2019 already reported in https://inbox.sourceware.org/cygwin-developers/[email protected]/ that:

[...] I first tried

+	      if (CancelSynchronousIo (chld_procs[i].wait_thread->thread_handle ()))
+		chld_procs[i].wait_thread->detach ();
+	      else
+		chld_procs[i].wait_thread->terminate_thread ();

but that resulted in a (debuggable) hang in detach, because the
cygthread::stub was waiting for thread_sync, while cygthread::detach was
waiting for *this. That appears to be because this is an auto-releasing
cygthread. It kind of bothers me that there is no synchronization to be
sure the wait_thread is done shutting down before moving on in
proc_terminate, but I don't see an obvious way in the current structure.

@dscho
Copy link
Member Author

dscho commented Sep 26, 2024

@jeremyd2019 FWIW I just tried to reproduce the hang in a VS Code terminal, running update-via-pacman.ps1 in a Git for Windows SDK (which was pretty reliable for me), and I could not get it to hang... Maybe I am doing something subtly different than I used to. Or maybe it is the update to MSYS2 runtime v3.5.4? Here's hoping...

@jeremyd2019
Copy link

jeremyd2019 commented Sep 26, 2024

I think if you tried the double-fork reproducer I had (https://gist.github.com/jeremyd2019/3156721497096d0bba00ef19a507f619) over and over and over, eventually you would see a crash. For me, the QC710 seemed to have an issue sooner than the other devices I have (2023 Dev Kit & Raspberry Pi 4 running Windows 10 - necessarily i686 cygwin 3.3 on that one).

Particularly, the crash showed up in the exit status check https://gist.github.com/jeremyd2019/3156721497096d0bba00ef19a507f619#file-testfork-c-L56

@dscho
Copy link
Member Author

dscho commented Sep 26, 2024

@jeremyd2019 right you are; the testfork.exe example hung in something like 20-30 of the first 100 forks.

@dscho dscho force-pushed the try-to-fix-hangs-on-arm64 branch from 588f46a to 6469fc6 Compare November 12, 2024 21:21
@dscho
Copy link
Member Author

dscho commented Nov 12, 2024

@jeremyd2019 I hope that you don't mind that inspired by your report, I lifted your commit from msys2-3.5.4-suspendthread; I am really eager to test it, and if successful, want to integrate it early into Git for Windows' fork of the MSYS2 runtime.

@jeremyd2019
Copy link

The more testing the better! Personally, I was planning to wait until upstream had a chance to look at it before opening PRs for msys2-runtime.

@dscho
Copy link
Member Author

dscho commented Nov 13, 2024

Good news! I have been running the reproducer for almost an hour now, without reproducing the dead-lock. Will keep it running for a while, but this is exciting news!

@dscho
Copy link
Member Author

dscho commented Nov 13, 2024

@dennisameling would you have time to craft a git-sdk-arm64 PR that downloads this PR build's artifact, replaces msys-2.0.dll, disables the IgnorePkg line in /etc/pacman.conf, and then runs update-via-pacman.ps1 twice (this script was the not-so-minimal reproducer of the problem, originally, right)?

@dennisameling
Copy link

Thanks for looking into this! My agenda is jam-packed for the rest of the week, so earliest I can do is Saturday. Will block some time then and will post an update here when I have something 👍🏼

@dscho
Copy link
Member Author

dscho commented Nov 13, 2024

Thanks for looking into this! My agenda is jam-packed for the rest of the week, so earliest I can do is Saturday. Will block some time then and will post an update here when I have something 👍🏼

Oh, no worries, if you're even more jam-packed than I am, I'll try to get to it later today.

dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request Nov 13, 2024
…indows/ARM64 deadlocks

I updated git-for-windows/msys2-runtime#73 to
build the newest iteration of the dead-lock workaround, which worked
100% in my tests.

Let's verify that it works around the issues in `update-via-pacman.ps1`,
too.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request Nov 13, 2024
…indows/ARM64 deadlocks

I updated git-for-windows/msys2-runtime#73 to
build the newest iteration of the dead-lock workaround, which worked
100% in my tests.

Let's verify that it works around the issues in `update-via-pacman.ps1`,
too.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request Nov 13, 2024
…indows/ARM64 deadlocks

I updated git-for-windows/msys2-runtime#73 to
build the newest iteration of the dead-lock workaround, which worked
100% in my tests.

Let's verify that it works around the issues in `update-via-pacman.ps1`,
too.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request Nov 13, 2024
…indows/ARM64 deadlocks

I updated git-for-windows/msys2-runtime#73 to
build the newest iteration of the dead-lock workaround, which worked
100% in my tests.

Let's verify that it works around the issues in `update-via-pacman.ps1`,
too.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request Nov 13, 2024
…indows/ARM64 deadlocks

I updated git-for-windows/msys2-runtime#73 to
build the newest iteration of the dead-lock workaround, which worked
100% in my tests.

Let's verify that it works around the issues in `update-via-pacman.ps1`,
too.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request Nov 13, 2024
…indows/ARM64 deadlocks

I updated git-for-windows/msys2-runtime#73 to
build the newest iteration of the dead-lock workaround, which worked
100% in my tests.

Let's verify that it works around the issues in `update-via-pacman.ps1`,
too.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request Nov 13, 2024
…indows/ARM64 deadlocks

I updated git-for-windows/msys2-runtime#73 to
build the newest iteration of the dead-lock workaround, which worked
100% in my tests.

Let's verify that it works around the issues in `update-via-pacman.ps1`,
too.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request Nov 13, 2024
…indows/ARM64 deadlocks

I updated git-for-windows/msys2-runtime#73 to
build the newest iteration of the dead-lock workaround, which worked
100% in my tests.

Let's verify that it works around the issues in `update-via-pacman.ps1`,
too.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request Nov 13, 2024
…indows/ARM64 deadlocks

I updated git-for-windows/msys2-runtime#73 to
build the newest iteration of the dead-lock workaround, which worked
100% in my tests.

Let's verify that it works around the issues in `update-via-pacman.ps1`,
too.

Signed-off-by: Johannes Schindelin <[email protected]>
dscho added a commit to git-for-windows/git-sdk-arm64 that referenced this pull request Nov 13, 2024
…indows/ARM64 deadlocks

I updated git-for-windows/msys2-runtime#73 to
build the newest iteration of the dead-lock workaround, which worked
100% in my tests.

Let's verify that it works around the issues in `update-via-pacman.ps1`,
too.

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho
Copy link
Member Author

dscho commented Nov 13, 2024

I'll try to get to it later today.

Well, this was more tedious than I wanted, but I finally got through all the usual GitHub Actions debugging until I got the workflow definition right.

And it works as I hoped!

@dennisameling with this, I am willing to start that beta phase of Git for Windows/ARM64.

@jeremyd2019 This is another piece of evidence that your patch is the long-awaited work-around for our problems!

@dscho dscho marked this pull request as ready for review November 13, 2024 16:11
@dscho
Copy link
Member Author

dscho commented Nov 13, 2024

/open pr

The workflow run was started

@jeremyd2019
Copy link

Watch out for msys2/msys2-runtime#234 (comment), potential fixup msys2/msys2-runtime#236

This addresses an extremely difficult to debug deadlock when running
under emulation on ARM64.

A relatively easy way to trigger this bug is to call `fork()`, then within the
child process immediately call another `fork()` and then `exit()` the
intermediate process.

It would seem that there is a "code emulation" lock on the wait thread at
this stage, and if the thread is terminated too early, that lock still exists
albeit without a thread, and nothing moves forward.

It seems that a `SuspendThread()` combined with a `GetThreadContext()`
(to force the thread to _actually_ be suspended, for more details see
https://devblogs.microsoft.com/oldnewthing/20150205-00/?p=44743)
makes sure the thread is "booted" from emulation before it is suspended.

Hopefully this means it won't be holding any locks or otherwise leave
emulation in a bad state when the thread is terminated.

Also, attempt to use `CancelSynchonousIo()` (as seen in `flock.cc`) to avoid
the need for `TerminateThread()` altogether.  This doesn't always work,
however, so was not a complete fix for the deadlock issue.

Cherry picked from commit e09c64ef65 (cygthread: suspend thread before
terminating., 2024-11-11) from msys2/msys2-runtime.

Addresses: https://cygwin.com/pipermail/cygwin-developers/2024-May/012694.html
Signed-off-by: Jeremy Drake <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
Suppress error output if ReadFile on child wait pipe returns
ERROR_OPERATION_ABORTED due to addition of CancelSynchronousIo call.

Cherry picked from commit eafd9a22bc (fixup! cygthread: suspend thread
before terminating., 2024-11-13) from msys2/msys2-runtime.

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho dscho closed this Nov 14, 2024
@dscho dscho force-pushed the try-to-fix-hangs-on-arm64 branch from 6469fc6 to 54a7252 Compare November 14, 2024 11:15
@dscho dscho reopened this Nov 14, 2024
@dscho
Copy link
Member Author

dscho commented Nov 14, 2024

/open pr

The workflow run was started

github-actions bot pushed a commit to git-for-windows/build-extra that referenced this pull request Nov 14, 2024
On Windows/ARM64, running the 64-bit version of Git for Windows could
infrequently cause deadlocked threads (see e.g. [this
report](msys2/msys2-autobuild#62) or [this
one](https://inbox.sourceware.org/cygwin-developers/[email protected]/)),
[which was
addressed](git-for-windows/msys2-runtime#73).

Signed-off-by: gitforwindowshelper[bot] <[email protected]>
@dscho dscho merged commit d937dbd into git-for-windows:main Nov 14, 2024
2 checks passed
@dscho dscho deleted the try-to-fix-hangs-on-arm64 branch November 14, 2024 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants