Microsoft's cloud service 'Microsoft Azure' temporarily went down due to a power outage, and Microsoft admits that there were only three staff on site at the time of the failure



On August 30, 2023 local time, a power outage occurred at the data center of Microsoft's cloud service '

Microsoft Azure ' in Sydney, Australia, causing the service to temporarily go down. A subsequent analysis by Microsoft revealed that only three engineers were on site at the time of the service outage.

Azure status history | Microsoft Azure
https://azure.status.microsoft/en-us/status/history/



Microsoft blames outage on small staff, automation failures • The Register

https://www.theregister.com/2023/09/04/microsoft_australia_outage_incident_report/



Oracle Cloud, Netsuite, and Azure go down, hard, Down Under • The Register

https://www.theregister.com/2023/08/30/oracle_microsoft_cloud_australia_outage/

On August 30, 2023 local time, Microsoft Azure services became unavailable in parts of Australia due to a data center accident that occurred in Sydney, Australia. Microsoft has informed affected customers that ``a power outage in eastern Australia has caused cooling units located within several data centers to be taken offline.''

Furthermore, Microsoft explained that the cause of the failure was ``a failure of the data center's chiller .'' 'To prevent cooling units from being stopped for long periods of time, which could cause temperatures in the data center to rise and cause damage to the hardware, we have shut down units and storage units used for some cloud computing services. 'I did,' he announced.

According to Microsoft's analysis, there were a total of seven cooling units in the data center affected by the power outage, five of which were in operation at the time of the power outage, and two backup units were on standby. When a power outage occurs, Microsoft staff is expected to execute emergency response procedures and activate backup cooling units. However, it has been reported that during the power outage on August 30, 2023, the corresponding cooling pump did not receive the operation signal from the backup cooler, so it did not work properly.

According to Microsoft, the failure to start up the backup cooling unit was unexpected. Microsoft said, ``There were two redundant cooling units that were on standby. One was brought back to normal operation due to emergency response, and the other one started up once but stopped working again within minutes.'' 'I did,' he reported.



Due to the failed restart, a data center that was originally running five cooling units now had to be cooled with just one cooling unit. Microsoft says, ``We had to reduce the heat load by shutting down some servers.''

Microsoft's report states that an hour after the power outage occurred, a team of field technicians climbed onto the roof of the data center to inspect the cooling units, and that the cooling unit manufacturer arrived on scene 2 hours and 39 minutes later. is revealed.

On the other hand, Microsoft said, ``Despite the huge data center, the number of personnel at night was insufficient to restart the cooling unit on the fly,'' and only three people were on site when the power outage occurred. I admit that it was.

'Until appropriate measures are taken, Microsoft will temporarily increase the size of its nightly team from three to seven people, prioritize cooling units that need to be restarted, and prioritize cooling units that are in high demand. We will add provisions in our emergency response procedures to ensure that a reboot is performed first.”



Additionally, Microsoft needed extensive troubleshooting to find out why cloud storage was taking so long to come back online, but the servers were down due to a power outage, so diagnostic tools were unable to find relevant data. is revealed.

``As a result, the on-site data center team manually removed components one by one and conducted an investigation to discover the specific components that were preventing each node from restarting,'' Microsoft reported. After investigation, we are reporting that some components had to be migrated to another server.

in Web Service, Posted by log1r_ut