Explain why Facebook went down for 6 hours so that Facebook executives can understand even if they are not experts



On October 5, 2021, a failure occurred in Facebook's system, and not only the company's SNS but also all Facebook services such as Instagram, WhatsApp, Messenger, and Oculus went down. Santosh Janardhan, vice president of engineering and infrastructure at Facebook, explains the cause of this failure in an easy-to-understand manner with a jargon explanation.

More details about the October 4 outage --Facebook Engineering

https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

The following article details the initial situation and recovery process of the October 5th failure when Facebook service went down for about 6 hours.

Facebook, Instagram, Oculus, WhatsApp are down worldwide, what is the cause? --GIGAZINE



According to Janardhan said, this time of failure backbone thing to be due to the system for managing the capacity of the trunk called. As the word for spine implies, the backbone is the network that underpins Facebook services, and all data that users exchange with Facebook apps is processed in the Facebook data center via this backbone network. ..

The failure on October 5th occurred during the maintenance of this backbone network. When a Facebook engineer who was doing maintenance issued a command to check the capacity availability of the global backbone network, all connections of the backbone network suddenly stopped, and Facebook's worldwide data center network Has been disconnected. To prevent this from happening, Facebook has a tool to monitor commands that affect the entire network, but this time the tool didn't work due to a bug, so we couldn't stop the command.



The downturn in the backbone network that connects large data centers has also affected the small facilities that connect data centers to users. Facebook's small facility is responsible for 'responding to

DNS queries.' DNS is a so-called Internet address book that translates simple addresses that users enter into their browsers into IP addresses of specific servers. The information converted by this is passed to the Internet in other regions via a communication standard called 'Border Gateway Protocol (BGP)'.

In a small facility of Facebook, when the DNS server cannot communicate with the data center, the exchange of network management information ( advertising ) using BGP is disabled. This is done for network security, but since all backbones were down in this failure, all of Facebook's DNS servers 'cannot communicate even though they are up'. I fell into the situation.

As a result of the instant chain of these failures, Facebook said, 'The network of a large data center is down and the data center cannot be accessed in the usual way' and 'The DNS of a small facility is down. This caused the double pain of the problem that the in-house tools used for failure investigation and recovery were broken, and the response was delayed. Also, Facebook's data center was designed to require a high level of security both physically and systematically, and it took time to recover the engineers dispatched to the data center. It contributed to the protracted problem.



On the other hand, thanks to the training in preparation for a large-scale system down, I was fortunate that I was able to bring the backbone network back online as quickly as possible while preventing crashes caused by the reaction of the system recovering at once. , Janardhan looks back.

Janardhan commented on the lessons learned from this failure: 'We have made extensive system enhancements to prevent unauthorized access, but it was caused by our own mistakes rather than malicious hacks. When trying to recover from a failure, it was a thought-provoking event that the system was strengthened. Our future task is to test and train to improve overall resilience. It's about making sure that problems like this one don't happen again. '

in Web Service,   Security, Posted by log1l_ks