Cloudflare reports major outage at data center, failover works with little impact to service



On March 26, 2024, a major outage occurred at Cloudflare's data center, but most services were not affected at all. Cloudflare posted on its official blog about how it prevented the outage.

Major data center power failure (again): Cloudflare Code Orange tested
https://blog.cloudflare.com/major-data-center-power-failure-again-cloudflare-code-orange-tested



On November 2, 2023, about five months prior to this outage, Cloudflare experienced a power outage at its data center in the Portland, Oregon region. This November's outage led to a failure in the control plane that controls data transmission routes for the following 14 services, causing the service to go down for at least six hours.

・API and Dashboard
・Zero Trust
・Magic Transit
・SSL
・SSL for SaaS
・Workers
・KV
・Waiting Room
・Load Balancing
・Zero Trust Gateway
・Access
・Pages
・Stream
・Images

Cloudflare took the massive outage seriously and declared a 'Code Orange.'

'Code Orange' is a system adopted by Google in which, when a threat to the survival of the business occurs, the company declares 'Code Yellow' or 'Code Red' depending on the severity of the threat and prioritizes efforts to resolve it. The 'orange' matches the color of the Cloudflare logo.

After declaring a Code Orange, the team's top priority was to create a system that would ensure service would continue as smoothly as possible even if a major data center facility experienced another catastrophic failure.



About five months after the first outage, at 14:58 on March 26, 2024, the same data center experienced another outage. As with the previous outage, the team was alerted immediately that connectivity to the data center had been lost. And unlike the previous outage, the Cloudflare team was able to quickly determine that the cause was a power outage.

Cloudflare's control plane consists of hundreds of internal services that are expected to continue to operate from the remaining two facilities even if it loses one of its three data centers in the Portland, Oregon area. Since the last outage, Cloudflare has spent months preparing and testing automatic failover to redundant facilities, and most of its services were either not affected at all during this outage or were restored within minutes if they were affected.

However, only the analytics platform for understanding user traffic relied on the data center, and it was not fully restored until late that night. Cloudflare had started building a new analytics platform immediately after the previous outage, but had not yet been able to complete it due to its large scale. The platform is expected to be completed in the near future, eliminating the need for reliance on a single data center for services.



In addition, the time it took to perform a cold start on a data center after a power outage was reduced from 72 hours during the previous power outage to 10 hours during this outage. It is said that the cold start time will be further reduced in the future by continuing to improve the procedure.

The cause of the blackout was the simultaneous failure of four power distribution panels at the data center. Initial assessment of the incident indicated that the chain of failures was caused by a misconfiguration of the breakers.

Cloudflare said its 'efforts over the past four months have delivered the results we expected,' and concluded by saying it remains 'committed to completing the remaining work.'

in Software,   Web Service,   Hardware, Posted by log1d_ts