Google explains the cause and countermeasure of large-scale failure that occurred in Google DriveGmail etc.

Google is a large time that occurred in services such as Gmail / Google Drive and

G Suite / Google Cloud Platform between 8:55 pm local time on August 19, 2020 and 3:30 am on August 20, 2020. For major service outages, we announced the details of the outage and the cause and countermeasure.

Google Cloud Issue Summary
(PDF file)

◆ Details of failure
Details of failures that occurred in each service are as follows.

Some users are unable to use the Gmail service, causing delays in email delivery and errors when adding attachments to messages. About 0.73% of Gmail users were getting three or more errors during a failure, and about 27% of Gmail users affected were G Suite users. The impact on Gmail has been mitigated by 3:30 am on August 20, 2020, and all messages that were delayed due to a failure have been delivered.

・Google Drive
Errors and delays for some Google Drive users. About 1.5% of Google Drive users who were active during the outage and within the last 24 hours after the outage experienced three or more errors.

・Google Docs
Some Google Docs users have problems with image creation actions such as uploading images, copying documents with images, and using templates with images.

・Google site
Some users are having problems creating new websites, adding new pages to websites, or uploading images to websites. In addition, the error rate when creating a website from a template during a failure was almost 100%. The impact on the website has been reduced by 3 am on August 20, 2020.

・Google Chat
Approximately 2% of the Google Chat users who tried to send the message and 16% of the users who tried to transfer the message to Gmail encountered the error.

・Google Meet
During the outage, the live stream was completely down and there was a delay in delivering it to YouTube. The disruption on Google Meet lasted from 9 p.m. 2020 to 9:15 a.m. on 20 Aug. 2020 and 1:40 a.m. 20 Aug. 2020 to 2:10 a.m.

・Google Keep
About 500 Internal Server Error responses were sent to some Google Keep users, causing a delay in media operations.

・Google Voice
I was unable to deliver an SMS message containing an attachment, causing some voicemail, call recording, and SMS delivery delays. The impact on Google Voice was mitigated by 3:20 am on August 20, 2020, with all voicemail and recordings delivered, with a maximum delay of 5.5 hours.

・Google Jamboard
Some users encountered an error when trying to upload an image or copy a document that contained an image.

・G Suite Admin Console
An error occurred for some users when uploading a CSV file in the G Suite admin console. The error rate during failures was 15-40%.

・Google App Engine
Increased error rate with App Engine standard calling the Blobstore API. Peak error rates were less than 5% in many regions, reaching about 47% in us-west 1 (west coast) and 13% in us-central 1 (central US). The App Engine standard calling the Images API had an error rate of up to 66%. Inbound HTTP requests served by static files or Blobstore objects had a high error rate with a peak error rate of approximately 1%.

The following message was displayed when deploying an application that contains static files. The impact on App Engine has been mitigated by 3:25 AM, August 20, 2020.

The following errors occurred whilecopying files to App Engine: File failed with: Failed to save staticfile.

・Cloud Logging
Writing log messages to Google Cloud Logging, which includes Google generated logs such as App Engine request logs, activity logs, and audit logs, was delayed by up to 4 hours and 43 minutes. Log backlog has been fully processed by 4pm on August 20, 2020. During the failure, the log write and read API calls were successful, but the read returned incomplete results.

・Cloud Storage
Approximately 1% of API calls to Google Cloud Storage buckets in the US multi-region failed. The error is completely resolved by 0:31 am on August 20, 2020.

◆Cause of failure
According to the Google Cloud team, many Google services use a common internal distributed system for binary large objects (BLOBs) . The BLOB storage system includes a front end that interfaces with Google's client services, a middle tier that handles metadata operations, and a storage back end for BLOBs, which when a client makes a request to the front end, Data operations are forwarded to the metadata service, which interacts with the storage service.

The cause of the failure was reported by the Google Cloud team: 'Since the traffic from the Google service increased, the metadata service was overloaded and the task did not work properly, increasing the request delay. Increased delay This caused too many retries of operations, leading to resource exhaustion.The system tried to start a new metadata task automatically, but due to resource exhaustion, enough resources were allocated to the new task. We didn't. This problem was exacerbated by the system structure for canceling and retrying failed requests, which had a multiplying impact on traffic.'

The Google Cloud team also explained why Google Cloud Storage (GCS) had a smaller impact compared to other services: ``GCS is built on the same BLOB storage system as other services, but GCS's metadata layer Is largely isolated from the failed Google internal metadata service, the migration work for GCS metadata isolation is only underway in the US multi-region, and all other migrations are complete. As a result, the impact on GCS users was limited to the US multi-region only.'

◆ Recurrence prevention measures
Google has announced the following measures to prevent recurrence.

Increased allocation of computing resources to the BLOB metadata service until the root cause is completely repaired.
-Investigation and improvement of the health check executed at the time of starting the metadata service task to prevent the task from stopping early before the resource is supplied.
• Evaluate and improve backoff and retry procedures used when metadata operations fail.
-Fixed the problem that cancellation requests may be flooded to the entire resource replication due to single error occurrence.
-Improved the auto-scaling alert function used by the BLOB storage system so that problems at task startup and resource allocation can be detected early.
-Implementation of comprehensive rate limit control for requests to the BLOB storage service.
-Added a measurement function that enables effective debugging of BLOB operations.
-Improved speed, efficiency and automation of resource transfer between tasks.
-Improved the internal manual for rate limiting control of BLOB storage service.

The Google Cloud team commented, 'We are committed to improving our technology and operations quickly and continuously to prevent service interruptions. We apologize for any inconvenience caused to our customers.'

in Software,   Web Service, Posted by darkhorse_log