The News: Last week’s outage in Microsoft’s Central US region disrupted multiple Azure services, including Email services. Virtual Machines and Cosmos DB, due to a failed backend configuration update. This incident, alongside the Crowdstrike debacle, highlights vulnerabilities in cloud infrastructure and the broad impact such outages can have on dependent services. You can read the preliminary post-incident review here.

Microsoft’s Central US Azure Outage: What Went Wrong?

Analyst Take: Last Friday was a particularly challenging day for Microsoft, not just because of its entanglement in the Crowdstrike debacle, but also due to a significant, unrelated outage in their Central US region that brought down Microsoft 365 services. The timing of these events compounded the impact, making it a day to forget for the tech giant. While Crowdstrike’s issues grabbed the headlines, Microsoft’s own woes were equally severe, and in different circumstances, would have undoubtedly dominated the front pages.

The Crowdstrike incident, which centered around cybersecurity vulnerabilities, overshadowed Microsoft’s massive Azure outage. This disruption lasted for over 14 hours, affecting critical infrastructure and numerous industries reliant on Microsoft’s cloud services. Such a significant outage, impacting multiple availability zones in a major region, is not a common occurrence and typically would be headline news. However, the simultaneous crisis faced by Crowdstrike drew attention away, perhaps providing Microsoft with a reprieve from the full brunt of public scrutiny.

What Happened?

In April, this year, The Futurum Group published this report analyzing cloud availability over 12 months. Azure tends to have outages that are more expansive in nature (many services or many regions affected at the same time). This latest outage follows this pattern. The root cause goes back to the fundamentals of cloud architecture. Azure seems to have a concentration of risk due to excessive cross-dependencies between services that result in frequent outages with a large blast radius. I am sure Microsoft is working hard to address this in the platform. But the fact that it has taken many years is a testament to the fact that this is not an easy problem to solve for Microsoft. Customers considering Azure must discuss this aspect with Microsoft to make an informed decision.

According to my analysis of the recent Microsoft outage report, between 21:40 UTC on July 18, 2024, and 12:15 UTC on July 19, 2024, customers experienced significant issues with multiple Azure services in the Central US region. This disruption stemmed from an Azure configuration update that disrupted the connection between compute and storage resources, critically affecting the availability of Virtual Machines (VMs). Consequently, several Azure services reliant on these VMs encountered failures in service management operations and faced connectivity or availability issues.

A wide array of services were impacted by this incident, including but not limited to:

App Service
Azure Active Directory (Microsoft Entra ID)
Azure Cosmos DB
Microsoft Sentinel
Azure Data Factory
Event Hubs
Service Bus
Log Analytics
SQL Database
SQL Managed Instance
Virtual Machines
Cognitive Services
Application Insights
Azure Resource Manager (ARM)
Azure NetApp Files
Azure Communication Services
Microsoft Defender
Azure Cache for Redis
Azure Database for PostgreSQL-Flexible Server
Azure Stream Analytics
Azure SignalR Service
App Configuration

These services experienced both failures in service management operations and connectivity or availability issues during the incident. What is interesting for me is that I know, from chatting to numerous clients, that 365-based email was down, but that this is not listed in the services affected by Microsoft in the list above. Futurum is starting to drill into transparency of reporting from cloud providers and this is an example of the type of issues we are focusing on.

What Went Wrong and Why?

The root cause of the outage lies in the infrastructure supporting Azure Storage. Virtual Machines with persistent disks use disks backed by Azure Storage. To enhance security, Storage Scale Units only accept disk I/O requests from known Azure Virtual Machine Hosts. The list of these network addresses, known as the ‘Allow List,’ is regularly updated to reflect changes as VM Hosts are added or removed. Typically, these updates occur at least once daily in large regions.

On July 18, 2024, during routine updates to the VM Host fleet, an update to the ‘Allow List’ was generated. However, due to backend infrastructure failures, the update lacked address range information for a significant number of VM Hosts. The workflow responsible for generating the list failed to detect the missing data and published an incomplete ‘Allow List’ to all Storage Scale Units in the region. As a result, Storage Servers rejected all VM disk requests from the VMs running on the VM Hosts with missing information. Notably, Storage Scale Units hosting Premium v2 and Ultra Disk offerings were unaffected by this issue.

The ‘Allow List’ updates are applied in batches across the Storage Scale Units within a region over a brief window (approximately one hour). Unfortunately, the deployment workflow did not include sufficient checks for drops in VM availability and continued to deploy through the region, leading to widespread impact.

From the perspective of Azure SQL DB, the storage availability failures caused VMs across various control and data plane clusters to fail due to OS disk inaccessibility. This led to unhealthy clusters, resulting in failed service management operations and connectivity issues for Azure SQL DB and Azure SQL Managed Instance customers. Microsoft initiated failovers for databases with automatic failover policies within an hour of the incident. During the failover, the geo-secondary was elevated to the new primary, and the current primary was demoted to geo-secondary once the connection was re-established. After storage recovery in the Central US region, most databases resumed normal operations. However, some databases required additional mitigation to ensure gateways redirected traffic to the primary node. A few geo-failovers did not complete correctly, posing a risk of data inconsistency between primary and secondary nodes.

For Azure Cosmos DB, users experienced failed service management operations and connectivity failures as both the control plane and data plane rely on Azure VMs (Scale Sets) using Azure Storage disks, which were inaccessible. The region-wide availability for Cosmos DB in the Central US region dropped to 82% at its lowest point, with about 50% of the footprint affected by the Azure VMs going down. The impact varied based on customer database account configurations:

Customer database accounts with multi-region writes (active-active) were not impacted and maintained availability by redirecting traffic to other regions.
Customer database accounts with multiple read regions and a single write region outside of the Central US region maintained availability for reads and writes by redirecting traffic to other regions.
Customer database accounts with multiple read regions and a single write region in the Central US region (active-passive) maintained read availability but experienced write availability issues until accounts were failed over to another region.
Customer database accounts with a single region (multi-zonal or single zone) in the Central US region were impacted if at least one partition resided on impacted nodes.

What is Microsoft doing to ensure this doesn’t happen again?

To reduce the likelihood and impact of incidents like the recent outage, Microsoft is saying that it plans to implement several improvements across its storage, SQL, and Cosmos DB services. For storage, it is said they will fix the ‘Allow List’ generation workflow to detect incomplete source information, improve alerting for rejected storage requests, reduce batch sizes, and add additional VM health checks during ‘Allow List’ deployments. Microsoft is also planning a zone-aware rollout for these deployments and ensuring that invalid ‘Allow List’ deployments revert to the last-known-good state. SQL and Cosmos DB services are working on adopting the Resilient Ephemeral OS disk improvement to enhance VM resilience to storage incidents. Additionally, SQL is improving the Service Fabric cluster location change notification mechanism and implementing a zone-redundant setup for the metadata store. The Cosmos DB environment is planning to address failover issues by adding automatic per-partition failover for active-passive accounts, and improving fail-back workflows for specific customer configurations. These changes are scheduled to be completed progressively, with some extending into 2025.

Kudos to Microsoft for being transparent on what they plan to do to make things better for clients. I will be tracking to see whether Microsoft provides updates on its progress in the months ahead.

Looking Ahead

In today’s increasingly interconnected world, the impact of such outages extends far beyond the immediate downtime. The ripple effects can disrupt business operations, affect customer experiences, and even impact the broader economy. The incident at Microsoft’s Central US region serves as a stark reminder of the vulnerabilities inherent in our reliance on cloud services and the importance of robust, resilient infrastructure.

Regulatory scrutiny is intensifying, particularly in regions like Europe where the Digital Operational Resilience Act (DORA) is set to enforce stricter standards. DORA aims to enhance the resilience of digital services by imposing requirements on financial entities to ensure they can withstand, respond to, and recover from all types of ICT-related disruptions and threats. This legislation underscores the growing recognition of the systemic risks posed by digital outages and the need for comprehensive regulatory frameworks to mitigate these risks.

For companies like Microsoft, the implications are clear. There must be a renewed focus on enhancing the resilience and reliability of cloud services. This includes investing in advanced monitoring and mitigation technologies, adopting best practices for configuration management, and ensuring that robust disaster recovery and business continuity plans are in place.

The outage also highlights the importance of transparency and communication. During such incidents, clear and timely communication with customers is critical to managing the situation effectively and maintaining trust. Microsoft’s detailed post-incident report and the steps they took to address the issue are commendable and set a standard for how such situations should be handled.

As the digital landscape continues to evolve, the stakes are higher than ever. Businesses and regulators alike must prioritize resilience and preparedness to navigate the complexities of an interconnected world. The lessons learned from Microsoft’s outage will undoubtedly inform future strategies and policies, ensuring that we are better equipped to handle the challenges ahead.

You can read more about cloud availability and our latest tracking of this space here.

Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.

Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.