System administrators have been left fuming at Google after the company pushed experimental changes out to stable versions of its Chrome browser, which triggered a “white screen of death” for thousands of business users.
The code change last week was silently pushed out as part of a WebContents Occlusion feature designed to suspend Chrome tabs when users move other apps on top of them, in a bid to reduce the browser’s high resource use.
Google said Friday that it had tested it on one percent of users with no negative impact – a comment that did little to assuage sysadmin frustrations.
Google Chrome White Screen: What Happened?
Those running the browser via Windows Server “terminal server” setups – a common setup in enterprise networks – or accessing Chrome through virtual machine environments like Citrix saw Chrome tabs turn completely unresponsive across their networks, impacting thousands of users.
With IT admins typically managing and controlling such updates, the Google Chrome white screen left IT teams scrambling to identify what in their systems had gone wrong as end-users howled at the abrupt outage. (Many businesses will not allow quick replacement downloads of an alternative browser).
Users on the Chromium bug thread expressed huge frustration at the bug itself, Google’s slow reaction and the fact that it had been pushed out to stable Chrome versions without any warning or any notification.
As one sysadmin put it in a Chromium bug thread: “In a medical environment with 4000 concurrent users, it becomes a clinical risk to the patients if a web application does not run as it was intended. We absolutely require the ability to disable these random tests. When we deploy a product in a RDS/Citrix environment, we expect it to remain working and unchanged until we update it. We do not use the “autoupdate” function in the product and manage deployment.”
“At my organization, nearly 100% of our users (~300) running in Citrix virtual desktops were directly effected for 2 full business days. Our main line of business application runs in Chrome and the result of the Occlusion flag being enabled, our staff was unable to effectively service customers” another user wrote.
Another added: “Are we running beta/dev/canary versions of chromium where those experiments should take place? No, most of us are on stable/enterprise channel, and therefore shouldn’t have ours messed with at all.”
“The experiment / flag has been on in beta for ~5 months,” Google’s David Bienvenu said in a Chromium bug thread. “It was turned on for stable (e.g., m77, m78) via an experiment that was pushed to released Chrome Tuesday morning.
“Prior to that, it had been on for about one percent of M77 and M78 users for a month with no reports of issues, unfortunately.”
Another Google engineer added: “Once we received reports of the problem, we were able to revert it immediately. We sincerely apologize for the disruption this caused.”
“Many of us were scrambling to find root cause and validating all layers of the infrastructure when it was your team’s misstep all along”
The apology was not enough for many. On Saturday the thread was still drawing frustrated comments. As one user put it: “Given that we’re an enterprise and will need to be doing RCAs [Root Cause Analyses] for this giant debacle who on the chromium team is going to be writing up the RCA that they will make public for all of us so we can understand the details and your action plan to keep this from happening again?
“I know from experience this is part of your process. Will you commit to making this RCA public?”
Another added: “‘Oops’ and apologies unfortunately don’t unravel the mess and backlog this created for many of us. Like others state, figure out a way to eliminate Enterprise versions from being your test subjects. I cant imagine how many technical professionals were impacted with their employers by your experiment. Some employers don’t tolerate what is perceived as incompetence. Many of us were scrambling to find root cause and validating all layers of the infrastructure when it was your team’s misstep all along.”
The code bug capped a torrid few weeks for Google, which has also had to face a series of Google Cloud outages that left engineers being faced to manually fix tasks around the clock for three days. That was also triggered, in part, by inadequate testing prior to code rollout. (GCP said it has now implemented “continuous load testing as part of the deployment pipeline of the component which suffered the performance regression, so that such issues are identified before they reach production in future.”)
Read this: Codeanywhere Blames GCP Outage for Vanished Projects