Microsoft’s Azure incident that took out its Australia East cloud region last week attributes the incident in part to insufficient staff numbers on site. They have blamed the incident on a utility power outage that tripped the cooling units offline in one datacenter in Sydney, Australia, after an electrical storm last week.
The two data halls impacted by the outage had seven chillers – five in operation and two on standby. After the power outage, Microsoft’s staff executed Emergency Operational Procedures to bring them back online. That didn’t work because the corresponding pumps did not get the run signal from the chillers and Microsoft is talking to its suppliers about why it did not work. Two chillers that were in standby which attempted to restart automatically. One managed to restart and came back online, the other restarted but was tripped offline again within minutes. With just one chiller working, thermal loads had to be reduced by shutting down servers.
Due to the size of the datacenter, there were insufficient staff to restart the chillers in a timely manner. Microsoft have temporarily increased the team size from three to seven. Microsoft also had trouble understanding why its storage infrastructure didn’t come back online. Storage hardware damaged by the data hall temperatures required extensive troubleshooting. Unfortunately Microsoft’s diagnostic tools could not find relevant data because the storage servers were down.
As a result, Microsoft’s onsite datacenter team needed to remove components manually, and re-seat them one by one to identify which components were preventing each node from booting. Microsoft also admitted their automation was incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts.