Skip to content
This repository has been archived by the owner on Nov 6, 2019. It is now read-only.

Intermittent and undetected Firefox crashes #592

Open
jugglinmike opened this issue Aug 24, 2018 · 17 comments
Open

Intermittent and undetected Firefox crashes #592

jugglinmike opened this issue Aug 24, 2018 · 17 comments
Assignees

Comments

@jugglinmike
Copy link
Collaborator

During the week of 2018-0819, Firefox has been sporadically reporting incomplete results. This project refuses to upload results to wpt.fyi when tests defined in the manifest are not available in the report, so this has led to missing reports for some revisions of WPT.

The logs seem to indicate a problem between the browser and the virtual X display:

pid:4949 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
 5:49.31 pid:4949 Unable to init server: Could not connect: Connection refused
 5:49.31 pid:4949 Error: cannot open display: :106
 7:49.26 WARNING Failed to start protocol connection

I'll include more context for that output and a link to the full log below.

After comparing all the recent failures on http://builds.wpt.fyi, I was unable to correlate the failures with the revision of WPT, the release channel of Firefox, the machine running the test, or the "chunk" of tests being executed.

My initial idea was that this may be a bug in Geckodriver. While an issue was recently filed about this particular error, the reporter was struggling with a misconfigured system. In our case, all of the machines continue to demonstrate proper configuration through successful collection attempts. The problem we're experiencing is sporadic, and it occurs without any change to system configuration.

It's possible that we're (once again) carrying some faulty state between builds. Maybe a defunct X server is interfering here. In that case, though, I'd be curious to learn why only Firefox is affected.

Extended excerpt showing the context of that output
2018-08-21 14:08:50,609 INFO validate-wpt-results wpt-run:stdout  5:11.55 INFO Setting up ssl
2018-08-21 14:08:50,977 INFO validate-wpt-results wpt-run:stdout  5:11.90 certutil
2018-08-21 14:08:51,114 INFO validate-wpt-results wpt-run:stdout  5:12.05 certutil
2018-08-21 14:08:51,164 INFO validate-wpt-results wpt-run:stdout  5:12.11 certutil
2018-08-21 14:08:51,165 INFO validate-wpt-results wpt-run:stdout Certificate Nickname                                         Trust Attributes
2018-08-21 14:08:51,165 INFO validate-wpt-results wpt-run:stdout                                                              SSL,S/MIME,JAR/XPI
2018-08-21 14:08:51,165 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:08:51,165 INFO validate-wpt-results wpt-run:stdout web-platform-tests                                           CT,,
2018-08-21 14:08:51,165 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:08:51,204 INFO validate-wpt-results wpt-run:stdout  5:12.15 INFO Application command: /home/selina/worker/Local_Chunked_Runner/build/firefox/firefox --marionette about:blank -profile /tmp/tmpcmbGWL.mozrunner
2018-08-21 14:08:51,266 INFO validate-wpt-results wpt-run:stdout  5:12.21 INFO Starting runner
2018-08-21 14:09:01,554 INFO validate-wpt-results wpt-run:stdout  5:22.48 pid:4599 Full command: /home/selina/worker/Local_Chunked_Runner/build/firefox/firefox --marionette about:blank -profile /tmp/tmpcmbGWL.mozrunner
2018-08-21 14:09:01,555 INFO validate-wpt-results wpt-run:stdout pid:4599 1534860541537	Marionette	INFO	Listening on port 2828
2018-08-21 14:09:03,591 INFO validate-wpt-results wpt-run:stdout  5:24.54 TEST_START: /css/CSS2/selectors/first-letter-punctuation-059.xht
2018-08-21 14:09:04,780 INFO validate-wpt-results wpt-run:stdout  5:25.72 pid:4599 1534860544780	Marionette	INFO	Testing http://web-platform.test:8000/css/CSS2/selectors/first-letter-punctuation-059.xht == http://web-platform.test:8000/css/CSS2/selectors/first-letter-punctuation-059-ref.xht
2018-08-21 14:09:05,617 INFO validate-wpt-results wpt-run:stdout  5:26.56 pid:4599 1534860545617	Marionette	INFO	Found 522 pixels different, maximum difference per channel 255
2018-08-21 14:09:05,945 INFO validate-wpt-results wpt-run:stdout  5:26.89 TEST_END: FAIL, expected PASS - Testing http://web-platform.test:8000/css/CSS2/selectors/first-letter-punctuation-059.xht == http://web-platform.test:8000/css/CSS2/selectors/first-letter-punctuation-059-ref.xht
2018-08-21 14:09:05,946 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:09:07,219 INFO validate-wpt-results wpt-run:stdout  5:28.16 pid:4599 *** UTM:SVC TimerManager:registerTimer called after profile-before-change notification. Ignoring timer registration for id: telemetry_modules_ping
2018-08-21 14:09:08,634 INFO validate-wpt-results wpt-run:stdout  5:29.58 INFO Browser exited with return code 0
2018-08-21 14:09:08,635 INFO validate-wpt-results wpt-run:stdout  5:29.58 WARNING u'runner_teardown': ()
2018-08-21 14:09:08,720 INFO validate-wpt-results wpt-run:stdout  5:29.66 INFO Setting up ssl
2018-08-21 14:09:09,002 INFO validate-wpt-results wpt-run:stdout  5:29.93 certutil
2018-08-21 14:09:09,270 INFO validate-wpt-results wpt-run:stdout  5:30.21 certutil
2018-08-21 14:09:09,336 INFO validate-wpt-results wpt-run:stdout  5:30.28 certutil
2018-08-21 14:09:09,336 INFO validate-wpt-results wpt-run:stdout Certificate Nickname                                         Trust Attributes
2018-08-21 14:09:09,336 INFO validate-wpt-results wpt-run:stdout                                                              SSL,S/MIME,JAR/XPI
2018-08-21 14:09:09,337 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:09:09,337 INFO validate-wpt-results wpt-run:stdout web-platform-tests                                           CT,,
2018-08-21 14:09:09,337 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:09:09,383 INFO validate-wpt-results wpt-run:stdout  5:30.33 INFO Application command: /home/selina/worker/Local_Chunked_Runner/build/firefox/firefox --marionette about:blank -profile /tmp/tmpf1weuw.mozrunner
2018-08-21 14:09:09,466 INFO validate-wpt-results wpt-run:stdout  5:30.41 INFO Starting runner
2018-08-21 14:09:19,647 INFO validate-wpt-results wpt-run:stdout  5:40.59 pid:4761 Full command: /home/selina/worker/Local_Chunked_Runner/build/firefox/firefox --marionette about:blank -profile /tmp/tmpf1weuw.mozrunner
2018-08-21 14:09:19,647 INFO validate-wpt-results wpt-run:stdout pid:4761 1534860559646	Marionette	INFO	Listening on port 2828
2018-08-21 14:09:21,515 INFO validate-wpt-results wpt-run:stdout  5:42.46 TEST_START: /css/CSS2/selectors/first-letter-punctuation-060.xht
2018-08-21 14:09:23,069 INFO validate-wpt-results wpt-run:stdout  5:44.01 pid:4761 1534860563068	Marionette	INFO	Testing http://web-platform.test:8000/css/CSS2/selectors/first-letter-punctuation-060.xht == http://web-platform.test:8000/css/CSS2/selectors/first-letter-punctuation-060-ref.xht
2018-08-21 14:09:24,172 INFO validate-wpt-results wpt-run:stdout  5:45.12 pid:4761 1534860564154	Marionette	INFO	Found 526 pixels different, maximum difference per channel 255
2018-08-21 14:09:24,502 INFO validate-wpt-results wpt-run:stdout  5:45.44 TEST_END: FAIL, expected PASS - Testing http://web-platform.test:8000/css/CSS2/selectors/first-letter-punctuation-060.xht == http://web-platform.test:8000/css/CSS2/selectors/first-letter-punctuation-060-ref.xht
2018-08-21 14:09:24,502 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:09:25,562 INFO validate-wpt-results wpt-run:stdout  5:46.51 pid:4761 [Parent 4761, Gecko_IOThread] WARNING: pipe error (57): Connection reset by peer: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 353
2018-08-21 14:09:25,867 INFO validate-wpt-results wpt-run:stdout  5:46.81 pid:4761 *** UTM:SVC TimerManager:registerTimer called after profile-before-change notification. Ignoring timer registration for id: telemetry_modules_ping
2018-08-21 14:09:27,298 INFO validate-wpt-results wpt-run:stdout  5:48.24 INFO Browser exited with return code 0
2018-08-21 14:09:27,298 INFO validate-wpt-results wpt-run:stdout  5:48.24 WARNING u'runner_teardown': ()
2018-08-21 14:09:27,397 INFO validate-wpt-results wpt-run:stdout  5:48.34 INFO Setting up ssl
2018-08-21 14:09:27,663 INFO validate-wpt-results wpt-run:stdout  5:48.59 certutil
2018-08-21 14:09:27,918 INFO validate-wpt-results wpt-run:stdout  5:48.86 certutil
2018-08-21 14:09:27,987 INFO validate-wpt-results wpt-run:stdout  5:48.91 certutil
2018-08-21 14:09:27,988 INFO validate-wpt-results wpt-run:stdout Certificate Nickname                                         Trust Attributes
2018-08-21 14:09:27,988 INFO validate-wpt-results wpt-run:stdout                                                              SSL,S/MIME,JAR/XPI
2018-08-21 14:09:27,988 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:09:27,988 INFO validate-wpt-results wpt-run:stdout web-platform-tests                                           CT,,
2018-08-21 14:09:27,988 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:09:28,014 INFO validate-wpt-results wpt-run:stdout  5:48.96 INFO Application command: /home/selina/worker/Local_Chunked_Runner/build/firefox/firefox --marionette about:blank -profile /tmp/tmpskWIHZ.mozrunner
2018-08-21 14:09:28,093 INFO validate-wpt-results wpt-run:stdout  5:49.02 INFO Starting runner
2018-08-21 14:09:28,361 INFO validate-wpt-results wpt-run:stdout  5:49.30 pid:4949 Full command: /home/selina/worker/Local_Chunked_Runner/build/firefox/firefox --marionette about:blank -profile /tmp/tmpskWIHZ.mozrunner
2018-08-21 14:09:28,361 INFO validate-wpt-results wpt-run:stdout pid:4949 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-08-21 14:09:28,361 INFO validate-wpt-results wpt-run:stdout  5:49.31 pid:4949 Unable to init server: Could not connect: Connection refused
2018-08-21 14:09:28,361 INFO validate-wpt-results wpt-run:stdout  5:49.31 pid:4949 Error: cannot open display: :106
2018-08-21 14:11:28,315 INFO validate-wpt-results wpt-run:stdout  7:49.26 WARNING Failed to start protocol connection
2018-08-21 14:11:28,317 INFO validate-wpt-results wpt-run:stdout  7:49.26 ERROR Traceback (most recent call last):
2018-08-21 14:11:28,317 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/tools/wptrunner/wptrunner/executors/protocol.py", line 47, in setup
2018-08-21 14:11:28,317 INFO validate-wpt-results wpt-run:stdout     self.connect()
2018-08-21 14:11:28,317 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/tools/wptrunner/wptrunner/executors/executormarionette.py", line 462, in connect
2018-08-21 14:11:28,317 INFO validate-wpt-results wpt-run:stdout     self.marionette.raise_for_port()
2018-08-21 14:11:28,317 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/_venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 711, in raise_for_port
2018-08-21 14:11:28,317 INFO validate-wpt-results wpt-run:stdout     self.host, self.port))
2018-08-21 14:11:28,317 INFO validate-wpt-results wpt-run:stdout timeout: Timed out waiting for connection on 127.0.0.1:2828!
2018-08-21 14:11:28,317 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:11:28,318 INFO validate-wpt-results wpt-run:stdout  7:49.26 INFO Browser exited with return code 1
2018-08-21 14:11:28,335 INFO validate-wpt-results wpt-run:stdout  7:49.28 CRITICAL Traceback (most recent call last):
2018-08-21 14:11:28,335 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/tools/wptrunner/wptrunner/testrunner.py", line 134, in start_runner
2018-08-21 14:11:28,335 INFO validate-wpt-results wpt-run:stdout     runner.run()
2018-08-21 14:11:28,335 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/tools/wptrunner/wptrunner/testrunner.py", line 83, in run
2018-08-21 14:11:28,335 INFO validate-wpt-results wpt-run:stdout     self.setup()
2018-08-21 14:11:28,335 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/tools/wptrunner/wptrunner/testrunner.py", line 70, in setup
2018-08-21 14:11:28,335 INFO validate-wpt-results wpt-run:stdout     self.executor.setup(self)
2018-08-21 14:11:28,335 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/tools/wptrunner/wptrunner/executors/executormarionette.py", line 725, in setup
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout     self.implementation.setup(**self.implementation_kwargs)
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/tools/wptrunner/wptrunner/executors/executormarionette.py", line 815, in setup
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout     self.executor.protocol.marionette.set_context(self.executor.protocol.marionette.CONTEXT_CHROME)
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/_venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1434, in set_context
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout     {"value": context})
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/_venv/lib/python2.7/site-packages/marionette_driver/decorators.py", line 26, in _
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout     return func(*args, **kwargs)
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/_venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 730, in _send_message
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout     raise errors.MarionetteException("Please start a session")
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout MarionetteException: Please start a session
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout  7:49.28 WARNING Traceback (most recent call last):
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/tools/wptrunner/wptrunner/executors/executormarionette.py", line 836, in teardown
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout     self.executor.protocol.marionette._send_message("reftest:teardown", {})
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/_venv/lib/python2.7/site-packages/marionette_driver/decorators.py", line 26, in _
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout     return func(*args, **kwargs)
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/_venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 730, in _send_message
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout     raise errors.MarionetteException("Please start a session")
2018-08-21 14:11:28,336 INFO validate-wpt-results wpt-run:stdout MarionetteException: Please start a session
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout  7:49.28 WARNING Traceback (most recent call last):
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/tools/wptrunner/wptrunner/executors/executormarionette.py", line 730, in teardown
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout     handle = self.protocol.marionette.window_handles[0]
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/_venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1385, in window_handles
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout     return self._send_message("WebDriver:GetWindowHandles")
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/_venv/lib/python2.7/site-packages/marionette_driver/decorators.py", line 26, in _
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout     return func(*args, **kwargs)
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout   File "/home/selina/worker/Local_Chunked_Runner/build/_venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 730, in _send_message
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout     raise errors.MarionetteException("Please start a session")
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout MarionetteException: Please start a session
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout 
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout  7:49.28 WARNING u'runner_teardown': ()
2018-08-21 14:11:28,337 INFO validate-wpt-results wpt-run:stdout  7:49.28 WARNING u'stop': ()

Complete log

@jugglinmike jugglinmike self-assigned this Aug 24, 2018
@jugglinmike
Copy link
Collaborator Author

By increasing the frequency of partial builds, this issue exacerbated a previously-existing problem, and that problem blocked all results collection. That problem and its resolution are described in gh-594.

@jugglinmike
Copy link
Collaborator Author

This problem continues to interfere with results collection for both Firefox and Firefox Nightly.

I've learned a little more from analyzing the logs a bit. Here's a filtered version of a collection attempt. This project implements "retry" logic, so this output describes three failed invocations of wpt run:

$ grep -E -e 'Failed to connect to Mir' \
  -e '[0-9]+ missing results' \
  -e 'Failed to start protocol' \
  -e 'TEST_START' \
  -e 'Running .* tests' \
  -o ../docs/2018-09-06-stdio-firefox-stable-12-of-20.txt | uniq -c
      1 Running reftest tests
     72 TEST_START
      1 Failed to connect to Mir
      1 Failed to start protocol
      1 Running wdspec tests
      3 TEST_START
      1 Running testharness tests
     44 TEST_START
      1 Failed to connect to Mir
    287 TEST_START
      1 Failed to connect to Mir
    646 TEST_START
      1 311 missing results
      1 Running reftest tests
    125 TEST_START
      1 Failed to connect to Mir
      1 Failed to start protocol
      1 Running wdspec tests
      3 TEST_START
      1 Running testharness tests
    229 TEST_START
      1 Failed to connect to Mir
    124 TEST_START
      1 Failed to connect to Mir
      3 TEST_START
      1 Failed to connect to Mir
    445 TEST_START
      1 Failed to connect to Mir
    176 TEST_START
      1 258 missing results
      1 Running reftest tests
    135 TEST_START
      1 Failed to connect to Mir
      1 Failed to start protocol
      1 Running wdspec tests
      3 TEST_START
      1 Running testharness tests
    341 TEST_START
      1 Failed to connect to Mir
     65 TEST_START
      1 Failed to connect to Mir
     46 TEST_START
      1 Failed to connect to Mir
    265 TEST_START
      1 Failed to connect to Mir
     22 TEST_START
      1 Failed to connect to Mir
    238 TEST_START
      1 248 missing results

This shows that the display error happens multiple times per attempt and effects all three test types. It also shows that the very first occurrence is what causes manifest-level incompleteness: the runner runs a variable number of reftests, encounters the error, fails to "start protocol", and moves on to wdspec tests, skipping all remaining reftests. The skipped reftests account for the "XXX missing results" which prevent upload.

I've been able to reproduce the error reported above by using kill to terminate the Xvfb process started by the xvfb-run utility. However, in these logs, some tests are executed following the "Mir" error, so I don't think the X server is actually going down. I've pushed up a commit to capture the output from xvfb. (This is a chage that we can't deploy without interrupting ongoing builds, so it will take some time before it's running in production.)

I'm pessimistic about that log surfacing any useful information, though. @andreastt / @whimboo you two know a bunch about Marionette and Geckodriver. Do you have any thoughts about what might be going wrong? Is there any other information that would be helpful?

@whimboo
Copy link

whimboo commented Sep 7, 2018

@jugglinmike please note the following line in testrunner.py:

https://searchfox.org/mozilla-central/rev/f2ac80ab7dbde5400a3400d463e07331194dec94/testing/web-platform/tests/tools/wptrunner/wptrunner/testrunner.py#573-575

We only check for crashes when those conditions are met. And I assume that this is not the case here, and as such no crashes are getting reported. So I think this is just https://bugzilla.mozilla.org/show_bug.cgi?id=1485259, right?

@jugglinmike
Copy link
Collaborator Author

Thanks, @whimboo. That bug seems to be focused on how wptrunner detects crashes. Fixing that issue might make recovery possible, which is definitely an improvement. (By the way, it'd be good to track that issue in WPT since it concerns code which is managed in that repository.)

Even with a fix to wptrunner, the crash would still be occurring. I'm interested in addressing the crash itself because restarting the browser contributes to the time required to run the tests. More importantly, the problem may also be effecting other users of Geckodriver/Marionette, and they may not have the ability to handle intermittent errors. We've only been experiencing the problem for the past few weeks, so this may just be a symptom of a more serious regression.

@jugglinmike
Copy link
Collaborator Author

Good news: @jgraham has submitted a patch which is expected to allow for crash recovery. Thanks, James!

@jugglinmike
Copy link
Collaborator Author

@whimboo here's another issue which may or many not be related: gh-602

@whimboo
Copy link

whimboo commented Sep 10, 2018

@jugglinmike are the crashes still persistent with the latest Nightly build of Firefox? We indeed had a very annoying crash over the last month as triggered by bug 1482029 which caused a lot of tests to fail. But that crash is fixed since Sep 4th.

So I would like to know if you can still reproduce it. If yes, lets hope that James' patch will give us more details.

@jugglinmike
Copy link
Collaborator Author

@jugglinmike are the crashes still persistent with the latest Nightly build of Firefox?

Unfortunately, yes.

@whimboo provided some more detail on crash recovery in gh-602. I'd like to respond in this thread to keep the two issues separate.

Maybe this is one of the shutdown hangs we experience here, and the
background thread monitor kills Firefox after the 60s max shutdown duration.

If you want to play around and you can reproduce on Travis with a PR maybe
try to check the toolkit.asyncshutdown.crash_timeout preference, and shorten
the time until Firefox gets killed. If that makes tests faster then we have
an indication.

We may be able to infer some of this from the timestamps in the logs we already have. The raw data is quite verbose, so I've filtered it. Here's the shell pipeline I used:

awk '/Mir:/,/Setting up ssl/;/Running/' | grep -E 'Mir:|Browser exited|Running'

In plain English, that selects the following lines:

  • Those that inclue "Mir"
  • Those that describe the browser exiting following the "Mir" reference (I'm limiting this because in the current configuration, the browser restarts after every failing test)
  • Those that describe the transition between WPT test types (one of reftest, testharness, or wdspec)
Filtered output from a recent build
2019-09-07 15:57:53,139 INFO validate-wpt-results wpt-run:stdout  1:52.67 INFO Running reftest tests
2018-09-07 16:06:04,074 INFO validate-wpt-results wpt-run:stdout pid:7724 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 16:08:04,129 INFO validate-wpt-results wpt-run:stdout 12:03.64 INFO Browser exited with return code 1
2018-09-07 16:08:04,196 INFO validate-wpt-results wpt-run:stdout 12:03.72 INFO Running wdspec tests
2018-09-07 16:08:04,196 INFO validate-wpt-results wpt-run:stdout 12:03.72 INFO Running wdspec tests
2018-09-07 16:08:04,214 INFO validate-wpt-results wpt-run:stdout 12:03.73 INFO Running testharness tests
2018-09-07 16:08:04,214 INFO validate-wpt-results wpt-run:stdout 12:03.73 INFO Running testharness tests
2018-09-07 18:29:40,474 INFO validate-wpt-results wpt-run:stdout pid:27166 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 18:30:49,396 INFO validate-wpt-results wpt-run:stdout 154:48.92 INFO Browser exited with return code 1
2018-09-07 18:53:37,828 INFO validate-wpt-results wpt-run:stdout pid:9448 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 18:54:46,694 INFO validate-wpt-results wpt-run:stdout 178:46.22 INFO Browser exited with return code 1
2018-09-07 19:33:34,556 INFO validate-wpt-results wpt-run:stdout  1:45.14 INFO Running reftest tests
2018-09-07 19:41:17,160 INFO validate-wpt-results wpt-run:stdout pid:3191 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 19:43:17,221 INFO validate-wpt-results wpt-run:stdout 11:27.80 INFO Browser exited with return code 1
2018-09-07 19:43:17,297 INFO validate-wpt-results wpt-run:stdout 11:27.88 INFO Running wdspec tests
2018-09-07 19:43:17,297 INFO validate-wpt-results wpt-run:stdout 11:27.88 INFO Running wdspec tests
2018-09-07 19:43:17,316 INFO validate-wpt-results wpt-run:stdout 11:27.90 INFO Running testharness tests
2018-09-07 19:43:17,316 INFO validate-wpt-results wpt-run:stdout 11:27.90 INFO Running testharness tests
2018-09-07 20:03:00,860 INFO validate-wpt-results wpt-run:stdout pid:13904 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 20:04:09,741 INFO validate-wpt-results wpt-run:stdout 32:20.32 INFO Browser exited with return code 1
2018-09-07 20:40:48,635 INFO validate-wpt-results wpt-run:stdout pid:3257 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 20:41:58,201 INFO validate-wpt-results wpt-run:stdout 70:08.78 INFO Browser exited with return code 1
2018-09-07 20:43:53,933 INFO validate-wpt-results wpt-run:stdout pid:4024 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 20:45:02,831 INFO validate-wpt-results wpt-run:stdout 73:13.41 INFO Browser exited with return code 1
2018-09-07 21:01:46,456 INFO validate-wpt-results wpt-run:stdout pid:14996 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 21:02:55,321 INFO validate-wpt-results wpt-run:stdout 91:05.90 INFO Browser exited with return code 1
2018-09-07 21:17:17,484 INFO validate-wpt-results wpt-run:stdout pid:24164 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 21:18:26,373 INFO validate-wpt-results wpt-run:stdout 106:36.95 INFO Browser exited with return code 1
2018-09-07 21:52:23,535 INFO validate-wpt-results wpt-run:stdout pid:12936 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 21:53:32,457 INFO validate-wpt-results wpt-run:stdout 141:43.04 INFO Browser exited with return code 1
2018-09-07 22:58:08,646 INFO validate-wpt-results wpt-run:stdout pid:20029 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 22:59:17,519 INFO validate-wpt-results wpt-run:stdout 207:28.10 INFO Browser exited with return code 1
2018-09-07 23:04:31,361 INFO validate-wpt-results wpt-run:stdout pid:22498 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-07 23:05:40,220 INFO validate-wpt-results wpt-run:stdout 213:50.80 INFO Browser exited with return code 1
2018-09-07 23:12:46,983 INFO validate-wpt-results wpt-run:stdout  1:39.55 INFO Running reftest tests
2018-09-07 23:28:14,804 INFO validate-wpt-results wpt-run:stdout 17:07.37 INFO Running wdspec tests
2018-09-07 23:28:14,826 INFO validate-wpt-results wpt-run:stdout 17:07.39 INFO Running testharness tests
2018-09-08 00:01:29,779 INFO validate-wpt-results wpt-run:stdout pid:21143 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 00:02:38,624 INFO validate-wpt-results wpt-run:stdout 51:31.19 INFO Browser exited with return code 1
2018-09-08 00:58:02,871 INFO validate-wpt-results wpt-run:stdout pid:23357 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 00:59:11,628 INFO validate-wpt-results wpt-run:stdout 108:04.18 INFO Browser exited with return code 1
2018-09-08 00:59:48,372 INFO validate-wpt-results wpt-run:stdout pid:23547 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 01:00:57,222 INFO validate-wpt-results wpt-run:stdout 109:49.79 INFO Browser exited with return code 1
2018-09-08 01:13:18,253 INFO validate-wpt-results wpt-run:stdout pid:31835 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 01:14:27,147 INFO validate-wpt-results wpt-run:stdout 123:19.71 INFO Browser exited with return code 1
2018-09-08 01:40:40,104 INFO validate-wpt-results wpt-run:stdout pid:16859 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 01:41:49,820 INFO validate-wpt-results wpt-run:stdout 150:42.39 INFO Browser exited with return code 1
2018-09-08 01:44:13,222 INFO validate-wpt-results wpt-run:stdout pid:17709 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 01:45:22,100 INFO validate-wpt-results wpt-run:stdout 154:14.66 INFO Browser exited with return code 1
2018-09-08 01:54:23,293 INFO validate-wpt-results wpt-run:stdout pid:23649 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 01:55:32,129 INFO validate-wpt-results wpt-run:stdout 164:24.69 INFO Browser exited with return code 1
2018-09-08 02:05:04,499 INFO validate-wpt-results wpt-run:stdout pid:29580 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 02:06:13,360 INFO validate-wpt-results wpt-run:stdout 175:05.92 INFO Browser exited with return code 1
2018-09-08 02:24:45,333 INFO validate-wpt-results wpt-run:stdout pid:9121 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 02:25:54,185 INFO validate-wpt-results wpt-run:stdout 194:46.75 INFO Browser exited with return code 1
2018-09-08 02:27:30,627 INFO validate-wpt-results wpt-run:stdout pid:10947 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 02:28:39,492 INFO validate-wpt-results wpt-run:stdout 197:32.06 INFO Browser exited with return code 1
2018-09-08 02:51:16,330 INFO validate-wpt-results wpt-run:stdout pid:23894 Failed to connect to Mir: Failed to connect to server socket: No such file or directory
2018-09-08 02:52:25,132 INFO validate-wpt-results wpt-run:stdout 221:17.70 INFO Browser exited with return code 1

Complete log

In two cases, there is a delay of 120 seconds between the "Mir" error and the browser being reported as exited. These are during the "reftests" (but not the same test), and they are the problematic crash. That is, they are the crashes that cause the wptrunner to skip the remaining tests and proceed to the next test type.

In all of the other cases of the "Mir" error, the delay is between 69 and 70 seconds.

@whimboo
Copy link

whimboo commented Sep 11, 2018

Thanks for the complete log! Here some more indicators:

  1. Lots of starts of Firefox take about 20s and more!! And we have lots of those. I assume we are using opt builds under which circumstance this is huge! Or which of those lines below indicate the real execution of the binary? Application command or Full command?
2018-09-08 01:49:26,254 INFO validate-wpt-results wpt-run:stdout 158:18.82 INFO Application command: /home/lara/worker/GNU_Linux_Chunked_Runner/build/firefox/firefox --marionette about:blank -profile /tmp/tmpMwdqJF.mozrunner
2018-09-08 01:49:26,335 INFO validate-wpt-results wpt-run:stdout 158:18.90 INFO Starting runner
2018-09-08 01:49:41,519 INFO validate-wpt-results wpt-run:stdout 158:34.08 pid:20534 Full command: /home/lara/worker/GNU_Linux_Chunked_Runner/build/firefox/firefox --marionette about:blank -profile /tmp/tmpMwdqJF.mozrunner
2018-09-08 01:49:41,519 INFO validate-wpt-results wpt-run:stdout pid:20534 1536371381498	Marionette	INFO	Listening on port 2829
  1. The cases with exit code 1 show failures like Error: cannot open display: :100, which indicate problems with the window manager as you already stated above. And as result Firefox can't be started. Have you thought to run all the tests in headless mode? This should even be way faster.

Beside those two conditions I cannot see something else which would indicate a browser crash.

@jugglinmike
Copy link
Collaborator Author

Lots of starts of Firefox take about 20s and more!! And we have lots of
those.

Most restarts documented in the complete log are expected. As I mentioned in my
previous comment, wptrunner restarts the browser following every test failure.

The cases with exit code 1 show failures like Error: cannot open display: :100, which indicate problems with the window manager as you already stated
above. And as result Firefox can't be started.

It sounds like we need more information to continue debugging this. Do you know
what else I could do to learn what is going wrong?

Have you thought to run all the tests in headless mode? This should even be
way faster.

Thanks for the tip! We're not using that feature due to my uncertainty about
its effect on the tests. I want to keep this discussion focused on the
regression, so I've opened a separate issue to talk about that: gh-603.

@whimboo
Copy link

whimboo commented Sep 13, 2018

The patch from @jgraham is waiting to get merged on web-platform-tests/wpt#12968. I'm interested to see how this will change the behavior what we are seeing here.

@whimboo
Copy link

whimboo commented Sep 13, 2018

Please note that also people using Selenium via a docker image also seem to have such a problem with Mir: SeleniumHQ/docker-selenium#785

@jugglinmike
Copy link
Collaborator Author

Thanks for the link--it's good to know I'm not alone! That's also the reason I don't think we can consider @jgraham's patch a resolution. I'd like to help resolve the underlying issue, but I could use guidance on where I should be looking next. @whimboo do you know who I can talk to about researching this further?

@whimboo
Copy link

whimboo commented Sep 13, 2018

Sorry, but I don't know. Maybe some Ubuntu forum could give an answer to this.

Is there a way to have more detailed log information about Mir failures? Maybe turn those on, and we could see what's wrong.

@jgraham
Copy link
Collaborator

jgraham commented Sep 14, 2018

This might be a fairly obvious question, but why are we running Mir? It's officially unsupported now and Ubuntu 17.10 switched to Wayland, and 18.04 LTS switched back to X11 (not saying the bug shouldn't be fixed, but we also shouldn't be running CI on what amounts to an exotic setup).

@jugglinmike
Copy link
Collaborator Author

Thank you for this; it wasn't obvious to me at all. I agree that the bug should be fixed, though I disagree with the classification of "exotic" (Ubuntu's LTS lasts for 5 years, after all). I was hoping that this issue could be a forcing function to get that corrected, but I'm reconsidering.

We don't have the expertise or the time to push for a proper fix, and the workaround we've been discussing didn't have the desired effect (wptrunner/Firefox continue to produce incomplete results). In light of our ongoing responsibility to provide timely results, I'm inclined to let this go and migrate to 18.04 LTS.

@jgraham
Copy link
Collaborator

jgraham commented Sep 14, 2018

Exotic was perhaps too strong, but it does seem more likely that we will get documentation and support for a product that's not EoL (modulo LTS). I can't see any Firefox bug related to this issue, so I suspect it's something weird in the Mir/Ubuntu setup.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants