Skip to content
This repository has been archived by the owner on Dec 3, 2019. It is now read-only.

--threads -t crash: Failed to PTRACE_GETREGS: No such process #93

Closed
fillest opened this issue Jul 29, 2017 · 10 comments · Fixed by #94
Closed

--threads -t crash: Failed to PTRACE_GETREGS: No such process #93

fillest opened this issue Jul 29, 2017 · 10 comments · Fixed by #94

Comments

@fillest
Copy link

fillest commented Jul 29, 2017

pyflame 1.5.0 (compiled from d1a76174e6b570c7e98af79a694f8769271f922a)
Python 2.7.12
$ uname -a
Linux test-VirtualBox 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
(I had the same problem on 14 ubuntu, IIRC)
$ cat pyflametest.py
import time

def sleep300 ():
    time.sleep(.300)

def sleep700 ():
    time.sleep(.700)

def main ():
    t1 = time.time()
    while True:
        if time.time() - t1 >= 5:
            return
        sleep300()
        sleep700()

main()
works:
$ pyflame -t python pyflametest.py 
(idle) 3470
/usr/lib/python2.7/site.py:<module>:563;/usr/lib/python2.7/site.py:main:545;/usr/lib/python2.7/site.py:addusersitepackages:272;/usr/lib/python2.7/site.py:getusersitepackages:247;/usr/lib/python2.7/site.py:getuserbase:237;/usr/lib/python2.7/sysconfig.py:get_config_var:582;/usr/lib/python2.7/sysconfig.py:get_config_vars:509;/usr/lib/python2.7/re.py:<module>:105;/usr/lib/python2.7/sre_compile.py:<module>:15;/usr/lib/python2.7/sre_parse.py:<module>:706 1

crashes:
test@test-VirtualBox:~$ pyflame --threads -t python pyflametest.py
terminate called after throwing an instance of 'pyflame::PtraceException'
  what():  Failed to PTRACE_GETREGS: No such process
Aborted (core dumped)
$ make test
./runtests.sh
Running test suite against Python 2.7.12
.....s..............
19 passed, 1 skipped in 13.38 seconds
Running test suite against Python 3.5.2
Already using interpreter /usr/bin/python3
....................
20 passed in 14.47 seconds
@fillest
Copy link
Author

fillest commented Jul 29, 2017

Attaching after the launch also works

$ python pyflametest.py & P=$!; pyflame --threads ${P} -s 3
[1] 10275
(idle) 1
pyflametest.py:<module>:17;pyflametest.py:main:15;pyflametest.py:sleep700:7 1344
pyflametest.py:<module>:17;pyflametest.py:main:14;pyflametest.py:sleep300:4 648
/usr/lib/python2.7/site.py:<module>:563;/usr/lib/python2.7/site.py:main:546;/usr/lib/python2.7/site.py:addsitepackages:328;/usr/lib/python2.7/site.py:addsitedir:190 1
  • but - only for less than program runs (e.g. -s 3 when the program runs for 5 sec). If I set -s for longer, I get the same crash:
$ python pyflametest.py & P=$!; pyflame --threads ${P} -s 10
[1] 10309
terminate called after throwing an instance of 'pyflame::PtraceException'
  what():  Failed to PTRACE_GETREGS: No such process
[1]+  Done                    python pyflametest.py
Aborted (core dumped)

@eklitzke
Copy link
Contributor

Thanks for the report, I tried running your test case and I can reproduce the problem. Adding @jamespic who contributed the threading code and might have an idea.

The threading code is tricky because when --threads is used Pyflame will attempt to do a kind of remote code execution by actually calling Python functions, whereas when --threads is not used ptrace is only used to peek memory. It looks like the threading code is doing this because it needs to access a static variable interp_head (defined in Python/pystate.c) which is exposed via the method PyInterpreterState_Head(). I am generally not a fan of doing code execution w/ ptrace this way, since it's really hard to get right (I worked on another ptrace based project where I was doing code execution, and ran into a ton of corner cases on that too). The current code execution implementation is also platform specific (it only works on x86-64), whereas the rest of Pyflame will work on other architectures, which is appealing because I know of at least one person using Pyflame on ARM systems. On the other hand, I'm not sure if there's a viable alternative, and in principle we should be able to get the code working since debuggers like GDB can do code execution reliably.

I'll spend some time thinking about the issue and try to make some time to dig into this problem later this week. In the mean time, if you are able to get a core dump or stack trace of the Python process when it crashes that would be helpful. The PTRACE_GETREGS error happens because Pyflame puts the Python process into a bad state and the Python process itself crashes. It looks like on my system I only get a core dump from Pyflame, not the Python process it is profiling. I forget if that's a ptrace limitation or not (hopefully not!).

@eklitzke
Copy link
Contributor

One way that this could be fixed would be to disassemble PyInterpreterState_Head and get the address of interp_head from that. This is probably even more evil/fragile, but I think it would be pretty simple. Here's the C code for the method:

PyInterpreterState *
PyInterpreterState_Head(void)
{
    return interp_head;
}

This is trivially compiled to a mov + ret:

(gdb) disas /r PyInterpreterState_Head
Dump of assembler code for function PyInterpreterState_Head:
   0x000000000007171b <+0>:	48 8b 05 ae ea 3c 00	mov    0x3ceaae(%rip),%rax        # 0x4401d0 <interp_head>
   0x0000000000071722 <+7>:	c3	retq   
End of assembler dump.

The math is pretty simple. The function starts with a mov that contains a relative address. If you take 0x7171b (the address of PyInterepreterState_Head) plus 0x3ceaae (offset encoded in the mov) plus 7 (width of the mov instruction) you get 0x4401d0 which is the correct address for the interp_head symbol. This is still going to be platform-specific, but it does eliminate the need to do code execution, since once the function is disassembled we can just read interp_head directly.

If a future version of Python changed the implementation the assumption that this could be disassembled this way would break, but it looks like they haven't changed this code since at least 2.7 if not earlier.

You can also access this data via DWARF, which is more portable/correct, but probably trickier to use. Using DWARF would also require having debug symbols available which isn't that big a deal but something to think about.

@jamespic
Copy link
Contributor

I did consider using the disassembly method, but the project I'm using PyFlame on compiles their own Python executable with some weird compiler flags, that mean the disassembly for that function is different (the assembly you posted looks like PIC, so you'd get different assembly on statically addressed builds).

One option I did ponder, is that since that function is pure, it could be copied into PyFlame's address space, and executed there. At the time that seemed harder, but a couple of workarounds later, I'm not so sure.

I'll try and reproduce this issue when I get chance. My best guess is that it's related to the cleanup code - when PyFlame exits, it cleans up the extra chunk of memory that it mmaped into the profiled process's address space, and maybe that's failing if the process has already terminated.

@eklitzke
Copy link
Contributor

Here's a proof of concept of how you would do this with disassembly: c9165de

I'm using Capstone, but the code is simple enough that the x86 decoding logic could be hard coded. Capstone is a bit more "portable" in the sense that it can also decode ARM etc., but this code is extremely implementation dependent so I'm not sure if that's a real win. You could handle static/PIC cases this way, since Capstone knows how to decode all of the different ways of accessing memory, although it might end up becoming a bit of a mess.

This code seems to crash eventually as well and it has some other problems, but it shows the general idea of how the approach would work. It would be nice to know why the current calling code fails though, because that might end up being simpler once all of the corner cases are added for disassembly.

@eklitzke
Copy link
Contributor

This is actually way simpler than I thought -- the code path for cleaning up when --threads is enabled just doesn't handle the case where the child has exited. PR #94 has a fix (and adds a test case).

@eklitzke
Copy link
Contributor

eklitzke commented Aug 1, 2017

I've tagged v1.5.1 which fixes this.

@fillest
Copy link
Author

fillest commented Aug 1, 2017

Thanks! Works much better now - still sometimes crashes very early:

$ pyflame --threads -t python pyflametest.py
Failed to waitpid(), unexpectedly got status: Stopped (signal)

and ubuntu crash reporter showing sigsegv in python pyflametest.py
I will try to reproduce from a cleaner state (e.g. vbox behaving a bit strange after suspend-resume) and capture a coredump and re-open if the problem remains

akatrevorjay added a commit to akatrevorjay/pyflame that referenced this issue Aug 4, 2017
* origin/master: (42 commits)
  refactor the main probe loop (uber-archive#108)
  be less aggressive about probing for libpython in attach mode (uber-archive#107)
  add version string to -h output
  Switch to PTRACE_TRACEME to fix startup race conditions. (uber-archive#106)
  tag v1.5.2
  fix a bug with venv deactivation in runtests.sh (uber-archive#105)
  augment version string (uber-archive#103)
  remove pointless build note (uber-archive#101)
  Switch to -p to specify the process to trace, fixes uber-archive#99 (uber-archive#100)
  explain how --abi works
  relase v1.5.1, which fixes uber-archive#93
  convert the man page to pandoc, fixes uber-archive#97
  Fix a bug in PtraceDetach when threads are enabled (uber-archive#94)
  detect when test suite is run from within a virtualenv (uber-archive#95)
  remove an unneeded skipif and yapf the tests
  tag 1.5.0 release
  use skipif here as well
  Unicode support (uber-archive#92)
  Various autoconf improvements.
  move all Python.h checks to configure.ac
  ...
@eklitzke
Copy link
Contributor

eklitzke commented Aug 8, 2017

@fillest Probably the same issue as #114 . My (unconfirmed) theory on that one is that it's getting wait events from grandchild processes, which is confusing Pyflame. I should have some time to look at this soon (next week or so). If you are able to reproduce this issue reliably though, that would be great, just in case they are different.

@eklitzke
Copy link
Contributor

eklitzke commented Aug 9, 2017

@fillest Would you mind testing against master (specifically, d1942a5 or later, also tagged as v1.5.5)? I fixed #114 in that change, and am interested to know if it fixes your issue or not.

Since you are getting a segfault I don't think my change will fix your crash, but it would help to know if you are still seeing crashes. And of course, anything you can do to repro an issue would be great.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants