What caused Slack's massive outage in May 2020?



The worst thing that should happen to a web service is a 'service outage', but even big internet companies like Google and Cloudflare sometimes cause such obstacles. Slack engineer

Laura Nolan explains the cause of the Slack service outage in May 2020.

A Terrible, Horrible, No-Good, Very Bad Day at Slack
https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack-dfe05b485f82

Regarding the failure that occurred in Slack in May 2020, it actually started to affect Slack users from 16:45 on May 12, Pacific Standard Time , but the failure started from 8:30 before that I was there. The traffic team that monitors network traffic is alerted that some API requests have failed, while the DBRE team, which aims to improve database reliability, alerts that the load on the database infrastructure is increasing significantly. received. The increase in database load was due to the setting change, so immediately roll back the setting and solve the problem. At this point, Nolan had little impact on the users.



Slack said that it was significantly increasing the number of cloud instances in the web application layer in order to respond to the lockdown caused by the new coronavirus and the increase in users due to waiting at home. If the worker in the cluster runs out of processing power, it will autoscale immediately, but this time the worker was waiting for the completion of some database requests, so the utilization rate increased and As a result of the operation of the scale, the number of instances increased by 75%. This caused the alert that occurred at 8:30 to stop the instance hosting the highest number of web application layers in Slack history.

Eight hours after addressing the first failure, I was alerted that Slack was returning more HTTP 503 errors than usual. An on-call engineer manually scaled up the web app layer, but the problem was not resolved. Nolan explains that he noticed that only a portion of the web app tier was overloaded and investigated both the performance of the web app and the load balancer tier and pinpointed the cause.

In Slack, behind the L4 load balancer, there are multiple HAProxy instances that distribute requests to the web application layer. Using Consul and Consul Template for service discovery , I used alive monitoring of the web application layer to list the instances that HAProxy can distribute requests to.



The list of running instances is not reflected directly in the HAProxy configuration file, but the list created by Consul is read by Slack's own program 'server-state-management' using the HAProxy Runtime API , and the HAProxy instance He was operating indirectly. By using the API, the configuration can be reflected without restarting the HAProxy process.



In Slack's HAProxy, server templates define 'slots' that a web app's backend can occupy. If the number of HAProxy instances is M and the number of slots is N, it means that requests can be relayed to a total of M×N web application layer instances, but after the rapid increase in instances at the 8:30 failure, M×N Nolan said that more than one instance was running. Since HAProxy has insufficient slots, there are instances where requests cannot be relayed.



If server-state-management was programmed to find a slot to allocate to a new instance before releasing the slot in HAProxy held by a non-started instance, if HAProxy runs out of slots Is unable to allocate slots for the new instance. In other words, if server-state-management lacked HAProxy slots, there was a bug that prevented HAProxy from reflecting the list of available hosts.

In this way, at 16:45, HAProxy will start sending requests to the already stopped instance based on the old host information, causing a problem that the user cannot use Slack. Nolan said the service outage occurred at the end of the business day due to the overlap with the time slot for scaling down the web application layer as traffic decreased.



Eventually the problem was resolved by restarting HAProxy. Nolan said the cause of the alert did not work is 'broken monitoring system is only a long time' was just moving ', there is also that there was no need to make changes, did not anyone notice that is broken.' Pointed out. He said the deployment of HAProxy was relatively static, so few engineers were concerned about monitoring systems and alerts.

At the time of writing the article, Slack is in the process of migrating from HAProxy to an environment using Envoy Proxy and xDS . Nolan strives to keep Slack highly available and reliable, but in this case it fails and says he will use what he learned from the failure to improve future systems and processes.

in Software,   Web Service, Posted by darkhorse_log