The reason why Amazon's AWS who got into the havoc of the Internet got down was "Key mistake"


ByN i c o l a

On March 1, 2017, a massive disaster occurred at cloud storage service "S3" offered by Amazon, causing great disruption on the Internet. A fault verification report that lasted about 4 hours was made public by AWS and it turned out that the cause of the fault was a key mistake.

Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
https://aws.amazon.com/jp/message/41926/


How a single typo brought the web to its knees
https://thenextweb.com/insider/2017/03/02/single-typo-brought-web-knees/

Amazon AWS S3 internet outage caused by engineer typing wrong command - Business Insider
http://www.businessinsider.com/amazon-aws-internet-outage-caused-by-engineer-typing-wrong-command-2017-3

According to the AWS's verification report, it seems that the reason for the failure was that the engineer mistook the key. An engineer tried to enter a command to stop a small number of servers from the S3 subsystem, but I entered the wrong command and more servers stopped than planned. The command input itself to stop the server itself was a routine work done every day, but this caused a big mess on the Internet.

Details of the large scale failure of S3 can be confirmed from the following article.

Failure of Amazon's AWS "S3" caused great disruption of the net, but it seems that the influence remains even though it is recovered - GIGAZINE


Among stopped servers, there are servers supporting two S3 subsystems, one of which is called an index subsystem that manages all S3 metadata and location information. The other is a subsystem that manages allocation of new storage, and it means that if the index subsystem is not working it did not work properly. In order to recover from the fault that occurred in this way it is necessary to restart the system, but AWS has never rebooted for a long time, and it took a long time to restore until it recovered.

AWS has been modified to improve the tool that caused the erroneous command input and to slow down the operation. Furthermore, it will take safety measures to stop only the minimum level of subsystems. At the end of the verification report, it says "I was very sorry for causing such a large scale trouble and inconvenience to many users."

in Web Service, Posted by darkhorse_log