Google Explains More About The Major Failures On YouTube And Google Cloud



On June 2, 2019, a large-scale failure occurred in Google's network, and in some areas, services provided by Google and various web services using Google Cloud might become unavailable or heavy. It has evolved. The Google Cloud engineers team explains on the official blog about the cause of such a large-scale failure.

An update on Sunday's service disruption | Google Cloud Blog

Google Cloud Status Dashboard

The failure that occurred on June 2, 2019 in Pacific Standard Time was less affected in Japan, but in some parts of the United States and Europe, Google provided services such as Google Cloud, YouTube, GSuite, Google such as Discord and iCloud It has had a major impact on the use of web services that use the Cloud.

Massive failure in Google Cloud, many services such as YouTube and Gmail are affected-GIGAZINE

In the official blog, Benjamin Solos, vice president of Google Cloud's surveillance team Google 24x7, said, “The configuration changes that were supposed to apply to servers in a particular region are mistakenly on servers in multiple adjacent regions. Is also the cause of being applied. ' Also, it is said that this case is also affected by a combination of management software misconfiguration and bugs.

Within the data center, Google's machines are separated into multiple logical clusters. Each of these clusters has its own management software, which enables recovery from failures, infrastructure changes, and automatic execution of data center maintenance events. When setting up maintenance as an event in Google's data center, it is often said to be a global maintenance, and it is rare to maintain only servers in a certain region.



When I set an event to stop the control plane of the network for maintenance on a server in a specific region this time, a maintenance event is started at 11:45 on June 2 and at the same time a bug in the management software It seems that the stop setting has been applied to the servers in the adjacent region as well. As a result, the configuration is overwritten in the servers of multiple adjacent regions, and more than half of the available network capacity is not used, resulting in network congestion.

Google's engineering team started recovering work two minutes after the failure occurred. According to the plan, recovery was expected to be completed in a few minutes, but due to network congestion, management software debugging became difficult, and software that automates maintenance events about one hour and 16 minutes after it is finally stopped About. After that, the engineer team re-enables the control plane and support infrastructure, rebuilds the schedule settings and redistributes it. Reconfiguration of the server was completed at 14:03, network capacity recovered at 15:19, and all services were resumed at 16:10.

by Google

As a result of this failure, YouTube recorded a 2.5% decrease in browsing numbers in one hour and Google Cloud storage recorded a 30% decrease in traffic. Although it was said that 'only a few users' affected, millions of users still could not send and receive e-mail, Sloss says.

in Software,   Web Service, Posted by log1i_yk