Oct 05, 2021 11:47:00

Facebook, Instagram, Oculus, WhatsApp are down worldwide, what is the cause?

At around 0:40 (Japan time) on October 5, 2021, the Facebook system failed and all systems went down. As a result, not only Facebook but also Instagram, WhatsApp, Messenger, Oculus, etc. owned by Facebook suffered a system failure, and it was inaccessible until around 7 o'clock on the same day. Internet infrastructure company Cloudflare explains why Facebook went down globally.

Understanding How Facebook Disappeared from the Internet

https://blog.cloudflare.com/october-2021-facebook-outage/

Facebook is scrambling to fix massive outage --The Verge
https://www.theverge.com/2021/10/4/22709575/facebook-outage-instagram-whatsapp

BGP Explained: the protocol that may be behind Facebook's disappearance --The Verge
https://www.theverge.com/2021/10/4/22709260/what-is-bgp-border-gateway-protocol-explainer-internet-facebook-outage

Due to a system failure, we can no longer access Facebook and its related services such as WhatsApp and Instagram. Name resolution for these services began to fail in DNS, and some of the IP addresses of the infrastructure that underpins the services were also unreachable. 'It was as if someone had unplugged the data center cables all at once and disconnected them from the Internet,' Cloudflare said.

Cloudflare was affected by Facebook's system failure and thought that there might be a problem with Cloudflare's DNS resolver 1.1.1.1, and at 1:51 (Japan time) on October 5, it was titled 'Facebook DNS lookup returning SERVFAIL'. Launched an internal incident.

As a result of the investigation, it was found that it was BGP (Border Gateway Protocol) that caused the problem. BGP is an AS (autonomous system) on the Internet, that is, a mechanism for exchanging route information between networks. Simply put, it plays a role like a car navigation system that shows the route when you enter a destination. I'm playing.

Each AS has its own

AS number . All AS numbers must use BGP to announce the connection route to the Internet. Otherwise, the AS will not be discovered or connected by anyone. The AS number for Facebook, Instagram, and WhatsApp is AS32934, which can be referenced from the following page. Facebook obtains its own AS number and connects directly to the Internet without going through an Internet service provider.

PeeringDB
https://www.peeringdb.com/net/979

So Cloudflare tracks all the BGP updates and announcements we see on our global network. Then, at around 0:40 on the 5th, when the system failure was confirmed, the peak of the routing change was seen from Facebook.

Below is a more detailed graph of this peak. After repeated announcements and withdrawals of routes from Facebook, the DNS server went offline at 1:50, and Cloudflare engineers soon realized that 1.1.1.1 couldn't resolve facebook.com and suspected a system failure. That's why. As a result, Facebook and related services have been effectively disconnected from the Internet.

And Cloudflare noticed that Facebook has stopped announcing the route to the

DNS prefix. So at this point at least Facebook's DNS server wasn't available. As a result, Cloudflare's DNS resolver 1.1.1.1 was unable to respond to queries requesting an IP address on facebook.com or instagram.com.

So, @facebook's DNS is broken this morning ...

TL; DR: Google anycast DNS returns SERVFAIL for Facebook queries; querying https://t.co/0BDgaIHmlr directly times out. Pic.twitter.com/3GHJ3mW0P0
— Jim Salter (@jrssnet) October 4, 2021

According to Cloudflare, Facebook and related services are so large that an error can cause dozens of times more requests than usual, causing delays and timeouts. The number of Facebook, WhatsApp, Messenger, and Instagram requests actually seen in 1.1.1.1 is shown in the graph below, and has increased rapidly from around 15:40 to nearly 30 times the usual number. DNS resolvers around the world have stopped resolving facebook-related domains to prevent an increase in Facebook-related DNS requests.

Furthermore, in the aftermath of Facebook going down, DNS queries to other social media platforms such as Twitter, Signal, Telegram, and TikTok have increased.

At 4:52 on the 5th, Facebook Chief Technology Officer Mike Schroepfer 'apologizes' sincerely 'to everyone affected by the outage of Facebook-based services. There is a network problem and the team is working on debugging and recovery as quickly as possible. ' According to IT news site The Verge, Schropfer wrote in an email to employees that 'this system failure is affected by the network backbone that interconnects all data centers.' It seems that it was.

* Sincere * apologies to everyone impacted by outages of Facebook powered services right now. We are experiencing networking issues and teams are working as fast as possible to debug and restore as fast as possible
— Mike Schroepfer (@schrep)
October 4, 2021

And it was confirmed that the BGP activity from the Facebook network was updated around 6 o'clock on the 5th, and the name of 'facebook.com' could be resolved in 1.1.1.1 around 6:20. It will take a little longer for all services to come back online, but at 6:28, Facebook itself was confirmed to be restored.

Regarding this system failure, Facebook CEO Mark Zuckerberg said, 'Facebook, Instagram, WhatsApp and Messenger are now back online. We apologize for the inconvenience. We know how much we use our service to keep it going. '

Security researcher Brian Krebs said that Facebook's system recovery took a long time. I didn't have it, so the recovery was delayed. ' However, it is not clear why the BGP update failure that caused this failure occurred.

From trusted source: Person on FB recovery effort said the outage was from a routine BGP update gone wrong. But the update blocked remote users from reverting changes, and people with physical access didn't have network / logical access. So blocked at both ends from reversing it.
— Briankrebs (@briankrebs) October 4, 2021

In addition, there are reports that the Facebook system went down, the internal system also went down, the security system for entering the building did not work, and employees were locked out.

Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren't working to access doors.
— Sheera Frenkel (@sheeraf) October 4, 2021

·postscript
The Facebook engineer team issued a statement that the failure was caused by a configuration change in the backbone router that regulates network traffic between data centers.

Update about the October 4th outage --Facebook Engineering
https://engineering.fb.com/2021/10/04/networking-traffic/outage/

・ Continued
Explain why Facebook went down for 6 hours so that Facebook executives can understand even if they are not experts --GIGAZINE

Related Posts:

Oct 05, 2021 11:47:00 in Web Service, Posted by log1i_yk