The official blog explains the cause of Firefox suddenly unable to display the page and the response at that time



On January 13, 2022, there was a

problem that the web page could not be displayed in Firefox of the web browser. Mozilla's technical blog, Mozilla Hacks, which conveys technical information such as Firefox, explains why such a situation occurred.

Retrospective and Technical Details on the recent Firefox Outage --Mozilla Hacks --the Web developer blog
https://hacks.mozilla.org/2022/02/retrospective-and-technical-details-on-the-recent-firefox-outage/

It is reported that the bug on January 13, 2022 that occurred in Firefox was caused by a bug related to HTTP / 3, the protocol used for communication when browsing websites. Christian Horror, who posted this blog, gives a more detailed explanation of why this bug was triggered and Mozilla's response.

What was the reason Firefox suddenly stopped displaying pages? --GIGAZINE



In his blog, Horror said Firefox has multiple servers and related infrastructure to handle some internal services, including features such as updates, telemetry , certificate management, and crash reporting. It explains. Firefox's infrastructure is hosted by cloud service providers such as Google Cloud Platform (GCP) , which uses a load balancer to evenly distribute the load between servers.

For services hosted on GCP, there are HTTP protocol-related settings that the load balancer advertises. The HTTP / 3 support status can be selected from 'Enabled', 'Disabled', and 'Automatic', and was set to the default 'Automatic' in Firefox.

Then, at 7:28 on January 13, 2022 ( Coordinated Universal Time ), GCP deployed a change to make HTTP / 3 the default protocol without notice. Since Firefox was set to 'automatic', Horror said that even service infrastructure that used to use HTTP / 2 when connecting was automatically switched to HTTP / 3.

Mozilla was unaware that Firefox's internal services were now using HTTP / 3, as there was no announcement from GCP about the default protocol change. However, shortly after this change was deployed, he noticed a surge in Firefox crashes through crash reports and reports inside and outside Mozilla. Looking at the graph below that shows the transition of the number of crash reports in Firefox, you can see that the number of crash reports has increased sharply around 7:30.



After investigating the cause, Mozilla found that the client had stopped network requests on one of Firefox's internal services. At this point, the cause and scope of the problem was unknown, so further analysis revealed that Firefox had not made any updates or configuration changes that could cause this problem.

So, he thought, 'I think the cloud service provider made some kind of'invisible'change and somehow changed the behavior of the load balancer.' Examining the logs revealed that the telemetry service load balancer, which previously provided HTTP / 2 connections, is providing HTTP / 3 connections for some reason. So at 9:12, explicitly disabling HTTP / 3 with GCP solved the problem on the part of the user, Holly said.

After investigating the root cause of the Firefox problem, I found an element called

'Necko ' in the network stack through which the HTTP / 3 connection goes, and 'viaduct ' in the intermediate library that calls the Rust component that requires direct access to the network. It seems that he found out that he was involved.

Inside Necko, the existence of the 'Content-Length ' header is checked when making an HTTP / 3 upload request, and if it does not exist, the header is automatically added. The lower level HTTP / 3 code determines the size of the requested data based on this header. However, when the request passes through viaduct for the first time, it seems that each header is passed to Necko in lowercase letters, and there is a problem here.

Necko's checks are not case sensitive, but lower level HTTP / 3 code is case sensitive. Therefore, if the code with the Content-Length header added goes through viaduct, there is a contradiction that Necko's check will find the header, but the HTTP / 3 code will not. Due to this discrepancy, the situation was that 'Necko has determined that the request has been completed, but the actual request body remains unsent', resulting in an infinite loop. Holly explains that the overall network communication was blocked and the web page could not be displayed.



Holly says he learned some lessons from this situation.

・ Deepen cooperation with cloud service providers
If GCP announced that it would implement HTTP / 3 by default, it could have mitigated the damage, if not completely eliminated the risk of the incident. Therefore, Mozilla is working with GCP to improve the situation.

・ Make the settings explicit
One of the reasons for this problem is that Mozilla set the GCP load balancer setting to 'automatic' instead of 'enabled' or 'disabled'. Mozilla says it is reviewing all service configurations to prevent similar mistakes in the future.

· Configuration and component testing
The combination of HTTP / 3 and viaduct on the Firefox desktop that caused the problem this time was not the combination that Mozilla tested. Although it is not possible to test all possible combinations of configurations and components, Mozilla is aware of the need to run more system tests on different HTTP versions.

At the end of this blog, Holly said, 'Learn as much as possible from this one will help improve the quality of our products. Send us crash reports and collaborate on the bug tracking forum Bugzilla. Thanks to all the users who helped me and helped others avoid the problem. '

in Web Service, Posted by log1h_ik