Official Explains AWS's Massive Failures Affecting Nintendo and Netflix



On December 7, 2021 local time

, a large-scale failure occurred in AWS (Amazon Web Services) , a cloud computing service provided by Amazon, and video distribution services Amazon Prime Video, Netflix, Disney + and crypto asset exchanges Coinbase, communication tool Slack, stock trading app Robinhood, Nintendo's network services, etc. were affected. AWS explains on the official website how this failure occurred and how the operator responded.

Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region
https://aws.amazon.com/message/12721/



The AWS catastrophe occurred on December 7, 2021 local time (December 8, Japan time) in the northern Virginia region. It has been reported that this failure has affected various external services that use AWS, as well as problems with Amazon's internal tools and warehouses.

A large-scale failure on AWS affects Amazon's parcel delivery --GIGAZINE



The AWS network is divided into a main AWS network that provides most AWS services and all customer applications, and an internal network that hosts system monitoring, internal DNS, authentication services, etc., and these two networks communicate. And configure AWS.

At 7:30 PST on December 7, an automated activity to expand the capacity of AWS services hosted by the main AWS network caused unexpected behavior on a large number of clients in the internal network. That thing. This causes a surge in connection activity between the internal network and the main network, overwhelming the networking device and causing communication delays. AWS explains that the devices connecting the two networks experienced persistent congestion and performance issues due to increased service delays and errors, and repeated connection attempts and retries.

AWS operators set out to resolve the issue, but the issue also affected the availability of real-time monitoring data by the internal operations team, forcing operators to rely on logs to identify and resolve the issue. .. The operator identified the internal DNS error that caused the congestion and resolved the error at 9:28, but it was not possible to completely eliminate the impact on AWS with this alone.

The operator continued to work on repairs such as identifying the top sources of traffic to be separated into dedicated network devices, disabling some network traffic services, and adding network capacity. However, real-time monitoring data was still unavailable, and in-house systems deployed on the internal network were also affected, and we worked carefully to ensure that AWS services that were operating normally were not affected. Therefore, it took time to solve a series of problems.

In the end, the operator drastically reduced network congestion at 13:34 and all network devices recovered at 14:22. This major failure affected many AWS services, with a full recovery of login errors to the AWS console at 14:22 and a full recovery of the Amazon Secure Token Service (STS) used for authentication at 16. At 15:28, the API Gateway used to call the API management service was mostly recovered at 16:37, and the AWS Fargate error rate running the container returned to normal at 17:00, the contact center service. AWS states that Amazon Connect returned to normal operation at 16:41.



Following this catastrophic failure, AWS says it will disable the automated activity that expands the capacity of the AWS service that caused it and will not resume it until all repairs have been deployed. In addition, it is said that the backoff operation designed for system recovery did not work because of the behavior that was not observed so far in this failure, and we will deploy a fix for this problem over the next two weeks. That. He also explained that he introduced an additional network configuration to protect the affected network devices even if similar events reoccur.

In this major failure, the real-time monitoring system was affected, delaying understanding of the failure, and network congestion left the customer service health dashboard on standby. AWS says, 'We understand that these failures can be more impactful and frustrating if we don't have information about what's happening,' and a new version that improves on this. He said he plans to release the Service Health Dashboard in early 2022.

in Web Service,   Security, Posted by log1h_ik