-
Notifications
You must be signed in to change notification settings - Fork 46
Intermittent and undetected Firefox crashes #592
Comments
By increasing the frequency of partial builds, this issue exacerbated a previously-existing problem, and that problem blocked all results collection. That problem and its resolution are described in gh-594. |
This problem continues to interfere with results collection for both Firefox and Firefox Nightly. I've learned a little more from analyzing the logs a bit. Here's a filtered version of a collection attempt. This project implements "retry" logic, so this output describes three failed invocations of
This shows that the display error happens multiple times per attempt and effects all three test types. It also shows that the very first occurrence is what causes manifest-level incompleteness: the runner runs a variable number of reftests, encounters the error, fails to "start protocol", and moves on to wdspec tests, skipping all remaining reftests. The skipped reftests account for the "XXX missing results" which prevent upload. I've been able to reproduce the error reported above by using I'm pessimistic about that log surfacing any useful information, though. @andreastt / @whimboo you two know a bunch about Marionette and Geckodriver. Do you have any thoughts about what might be going wrong? Is there any other information that would be helpful? |
@jugglinmike please note the following line in testrunner.py: We only check for crashes when those conditions are met. And I assume that this is not the case here, and as such no crashes are getting reported. So I think this is just https://bugzilla.mozilla.org/show_bug.cgi?id=1485259, right? |
Thanks, @whimboo. That bug seems to be focused on how wptrunner detects crashes. Fixing that issue might make recovery possible, which is definitely an improvement. (By the way, it'd be good to track that issue in WPT since it concerns code which is managed in that repository.) Even with a fix to wptrunner, the crash would still be occurring. I'm interested in addressing the crash itself because restarting the browser contributes to the time required to run the tests. More importantly, the problem may also be effecting other users of Geckodriver/Marionette, and they may not have the ability to handle intermittent errors. We've only been experiencing the problem for the past few weeks, so this may just be a symptom of a more serious regression. |
Good news: @jgraham has submitted a patch which is expected to allow for crash recovery. Thanks, James! |
@jugglinmike are the crashes still persistent with the latest Nightly build of Firefox? We indeed had a very annoying crash over the last month as triggered by bug 1482029 which caused a lot of tests to fail. But that crash is fixed since Sep 4th. So I would like to know if you can still reproduce it. If yes, lets hope that James' patch will give us more details. |
Unfortunately, yes. @whimboo provided some more detail on crash recovery in gh-602. I'd like to respond in this thread to keep the two issues separate.
We may be able to infer some of this from the timestamps in the logs we already have. The raw data is quite verbose, so I've filtered it. Here's the shell pipeline I used:
In plain English, that selects the following lines:
Filtered output from a recent build
In two cases, there is a delay of 120 seconds between the "Mir" error and the browser being reported as exited. These are during the "reftests" (but not the same test), and they are the problematic crash. That is, they are the crashes that cause the wptrunner to skip the remaining tests and proceed to the next test type. In all of the other cases of the "Mir" error, the delay is between 69 and 70 seconds. |
Thanks for the complete log! Here some more indicators:
Beside those two conditions I cannot see something else which would indicate a browser crash. |
Most restarts documented in the complete log are expected. As I mentioned in my
It sounds like we need more information to continue debugging this. Do you know
Thanks for the tip! We're not using that feature due to my uncertainty about |
The patch from @jgraham is waiting to get merged on web-platform-tests/wpt#12968. I'm interested to see how this will change the behavior what we are seeing here. |
Please note that also people using Selenium via a docker image also seem to have such a problem with Mir: SeleniumHQ/docker-selenium#785 |
Thanks for the link--it's good to know I'm not alone! That's also the reason I don't think we can consider @jgraham's patch a resolution. I'd like to help resolve the underlying issue, but I could use guidance on where I should be looking next. @whimboo do you know who I can talk to about researching this further? |
Sorry, but I don't know. Maybe some Ubuntu forum could give an answer to this. Is there a way to have more detailed log information about Mir failures? Maybe turn those on, and we could see what's wrong. |
This might be a fairly obvious question, but why are we running Mir? It's officially unsupported now and Ubuntu 17.10 switched to Wayland, and 18.04 LTS switched back to X11 (not saying the bug shouldn't be fixed, but we also shouldn't be running CI on what amounts to an exotic setup). |
Thank you for this; it wasn't obvious to me at all. I agree that the bug should be fixed, though I disagree with the classification of "exotic" (Ubuntu's LTS lasts for 5 years, after all). I was hoping that this issue could be a forcing function to get that corrected, but I'm reconsidering. We don't have the expertise or the time to push for a proper fix, and the workaround we've been discussing didn't have the desired effect (wptrunner/Firefox continue to produce incomplete results). In light of our ongoing responsibility to provide timely results, I'm inclined to let this go and migrate to 18.04 LTS. |
Exotic was perhaps too strong, but it does seem more likely that we will get documentation and support for a product that's not EoL (modulo LTS). I can't see any Firefox bug related to this issue, so I suspect it's something weird in the Mir/Ubuntu setup. |
During the week of 2018-0819, Firefox has been sporadically reporting incomplete results. This project refuses to upload results to wpt.fyi when tests defined in the manifest are not available in the report, so this has led to missing reports for some revisions of WPT.
The logs seem to indicate a problem between the browser and the virtual X display:
I'll include more context for that output and a link to the full log below.
After comparing all the recent failures on http://builds.wpt.fyi, I was unable to correlate the failures with the revision of WPT, the release channel of Firefox, the machine running the test, or the "chunk" of tests being executed.
My initial idea was that this may be a bug in Geckodriver. While an issue was recently filed about this particular error, the reporter was struggling with a misconfigured system. In our case, all of the machines continue to demonstrate proper configuration through successful collection attempts. The problem we're experiencing is sporadic, and it occurs without any change to system configuration.
It's possible that we're (once again) carrying some faulty state between builds. Maybe a defunct X server is interfering here. In that case, though, I'd be curious to learn why only Firefox is affected.
Extended excerpt showing the context of that output
Complete log
The text was updated successfully, but these errors were encountered: