Microsoft Azure Outage Eastern Australia

Azure Outage

The Microsoft Azure outage that took out its Australia East cloud region last week attributes the incident in part to insufficient staff numbers on site. They have blamed the incident on a utility power outage that tripped the cooling units offline in one datacenter in Sydney, Australia, after an electrical storm last week. This contributed to an outage that saw Microsoft Azure, Microsoft 365 and Power Platform services in Sydney go offline for up to 46 hours. While the datacenter is located in Sydney, Australia the outage affected thousands of end uses across the country who used either Microsoft Azure or Microsoft 365 services.

The two data halls impacted by the outage had seven cooling units, there were five in operation and two on standby. After the power outage, Microsoft’s staff executed Emergency Operational Procedures to bring them back online but that didn’t work because the corresponding pumps did not get the run signal from the chillers. Microsoft is talking to its suppliers about why it did not work and what needs to be done to rectify the situation.

Two cooling units that were in standby which attempted to restart automatically. One managed to restart and came back online but the other restarted but was tripped offline again within minutes. With just one cooling units working, thermal loads had to be reduced by shutting down servers.

Due to the size of the datacenter, there were insufficient staff to restart the cooling units in a timely manner. Microsoft have temporarily increased the team size from three to seven. Microsoft also had trouble understanding why its storage infrastructure didn’t come back online. Storage hardware damaged by the data hall temperatures required extensive troubleshooting. Unfortunately Microsoft’s diagnostic tools could not find relevant data because the storage servers were down.

Because of the extended time without cooling, the hardware was turned off to protect it from heat damage. Ultimately, this also brought down the temperature of the chillers which enabled them to start working again, and at this point, Azure began turning its compute and storage units back online. In total, approximately half of Azure’s Australia East Cosmos DC clusters went down or were heavily degraded. As a result, Microsoft’s onsite datacenter team needed to remove components manually, and re-seat them one by one to identify which components were preventing each node from booting. Microsoft also admitted their automation was incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts.

Some end users managed to get back online if they were part of the lucky group that had their nodes restarted early, but many waited the up to 46 hours without access to anything as their Microsoft Azure servers and Office 365 services were still down. As a result may offices closed early with companies electing to send staff home after several hours of no access.

Moving forward, Microsoft says that it is evaluating ways to ensure that the load profiles of the various chiller subsets can be prioritized so that chiller restarts will be performed for the highest load profiles first.