The News: CrowdStrike caused global outages by pushing an update that caused ‘blue screen of death’ outages on Microsoft Windows machines running its software. For more details read the CrowdStrike update on the outage here.
The CrowdStrike Outage – A Detailed Post-Mortem
Analyst Take: The global impact of the recent CrowdStrike IT outage has been profound, underscoring the critical dependency of various sectors on cybersecurity services. CrowdStrike, renowned for its robust cybersecurity solutions, experienced an unprecedented downtime, which reverberated across multiple industries worldwide, causing significant operational disruptions and highlighting vulnerabilities in the digital infrastructure.
One of the most notable examples of the impact was on the UK’s National Health Service (NHS). The NHS, which relies heavily on CrowdStrike for cybersecurity protection, faced severe disruptions in its operations. The outage affected patient care systems, electronic health records, and appointment scheduling, leading to delays in medical services and increased patient wait times. The cost of such downtime is not merely financial but also human, as critical healthcare services were compromised, potentially endangering lives.
In the financial sector, major banks such as JPMorgan Chase, HSBC, and Deutsche Bank experienced significant slowdowns and security vulnerabilities. Financial institutions, which depend on real-time cybersecurity monitoring to protect against fraud and cyberattacks, were left exposed. This vulnerability led to temporary shutdowns of online banking services and stock trading interruptions, causing considerable economic ramifications and shaking investor confidence. Particular impact was felt in branches as frontline tellers had to deal with disgruntled customers.
The transportation sector was also significantly impacted. Airlines such as United Airlines and American Airlines reported disruptions in their booking systems and flight operations. The outage led to delays, cancellations, and inconvenience for thousands of passengers, further compounding the financial strain on an industry already grappling with post-pandemic recovery challenges.
The manufacturing industry felt the impact, particularly in countries like Germany and Japan, where advanced manufacturing relies on secure IT environments to operate efficiently. Downtime in cybersecurity systems led to production halts, disrupted supply chains, and potential breaches of intellectual property, translating to substantial financial losses and operational inefficiencies.
The cost of the CrowdStrike outage extends beyond immediate financial implications. For instance, the downtime highlighted the fragility of global digital infrastructure and the cascading effects of a single point of failure in cybersecurity. Companies are now reevaluating their cybersecurity strategies, considering more robust redundancy plans, and seeking multi-layered security solutions to mitigate such risks in the future.
Overall, the CrowdStrike IT outage serves as a stark reminder of the intertwined nature of global digital economies and the critical role of cybersecurity in ensuring the smooth functioning of essential services. The incident underscores the need for enhanced resilience in cybersecurity frameworks to protect against future disruptions and safeguard societal well-being.
What happened?
The widespread Windows outage was due to a flaw in CrowdStrike’s Falcon Sensor update that led to Microsoft Windows devices experiencing the Blue Screen of Death (BSOD), rendering systems unusable and causing widespread operational disruptions. This flaw rendered Microsoft Windows systems running CrowdStrike’s software inoperable, with Windows systems failing to boot.
CrowdStrike pushed out the flawed update to Windows systems, causing outages in Australia, followed by Europe, grounding air travel and taking U.K. broadcaster SkyNews offline, Microsoft 365 outages, and taking down Windows systems running CrowdStrike software across the globe. Linux and MacOS systems were not impacted. The update from CrowdStrike was composed of “content”, rather than software. The specifics of CrowdStirke’s content update which caused Windows to crash are used by Falcon sensors executing on the host device.
CrowdStrike responded quickly by issuing an update fix and actively worked with customers around the globe to restore Windows systems. July 19, 2024, at 5:45 a.m. ET, CrowdStrike CEO George Kurtz posted a notice on social media platform X acknowledging the outage, indicating the root cause had been identified, and a fix update was available.
Remediation recommendations from CrowdStrike required rebooting affected Windows systems which would also receive the update. Virtualized, cloud-based servers and devices that can be remotely power cycled to apply the update. Windows devices still crashing required administrators to boot Windows into Safe Mode or the Windows Recovery Environment, navigate to a CrowdStrike directory, and manually delete the flawed update file. One aspect of why the CrowdStrike outage was so impactful was due to unmanaged and remote Windows devices, which required administrators’ physical access to the devices. Further remediation details are available here.
However, amid the incident response and remediation efforts, it is crucial to recognize that the most significant flaw in this CrowdStrike outage was not merely the defective update but also the deployment process that allowed a service-impacting update to affect such a large set of customers. In the interconnected and complex web of clouds, software, and services, rolling out updates too quickly and broadly across a customer base can have catastrophic consequences. While the details of the update deployment process are not fully available, it’s clear that CloudStrike’s outage incident was due not only to the underlying software defect but also to a deployment process that did not quickly detect the impact of this flaw in customer environments.
The incident with CrowdStrike underscores the critical need for a robust deployment process. This process, which leverages strategies including staggered, A-B, canary, and phased deployments, involves releasing updates to a subset of target devices in multiple phases. By validating the update at different phases, organizations can ensure it doesn’t create service, compatibility, security, or other issues before proceeding with a larger or full rollout. This controlled and managed deployment is crucial to minimize the risk of widespread disruptions and ensure the smooth functioning of systems.
Was This a Security Incident?
In short, this was not a security incident. It was an issue of a mismanaged patch update causing broadspread havoc. However, it is important to understand that the resulting damage was so extensive because CrowdStrike Falcon requires privileged kernel access to conduct a number of its key functions, including visibility and monitoring of systems, monitoring, and analytics of behavior of applications and processes, and certain remedial actions. This low-level access expands the potential attack surface, and as we have witnessed, means that even a simple bug in a patch update can inflict serious damage that can spread like wildfire given the ever-growing incidence of automation and software-defined architectures.
As touched on, this incident is a stark reminder of the importance on the part of the software vendor for effective patch management. It also underscores that, unfortunately, customers still need to assume ownership over the reliability of their extended software supply chain. This is no small feat. Any given enterprise has dozens of security tools in place, and according to The Futurum Group’s Cybersecurity Decision Maker data, More than half of organizations plan to add a new cybersecurity vendor, and 45% plan to add a new cybersecurity product category, in 2024 – largely in response to the quickly evolving threat landscape. For IT Operations and Security teams, it is also a stark reminder of the criticality of systems-level resiliency, and risk and incident remediation processes, not only in the aftermath of a cyber-attack but also for human error and other disaster events. We anticipate many will be conducting thorough audits for resiliency and security, as well as penetration testing.
What Ops Teams are doing to recover?
As outlined earlier, CrowdStrike issued a fix via the same auto-deploy feature used in the initial update. While helpful for systems that hadn’t experienced the BSOD, the update method can’t be used with machines in a down state. Operators must find a way to remove the CrowdStrike update outside of the company’s auto update feature.
There are two categories of recovery, Physical and Virtual:
Virtual: Each recovery method is dictated by the data protection process leveraged by the operations team. Organizations with robust system snapshot processes can rollback either the entire system or the CrowdStrike update to a known-good state. This process is typically instant with cloud-based or virtualized systems recovered in seconds to minutes.
For companies without a recent backup, the process is to manually rollback the update by creating a remote connection to the system and removing the CrowdStrike update. Larger organizations may choose to write an automation script that systematically connects to each impacted machine and removes the update.
Physical: The physical process to recover is similar. However, physical machines are impacted by the location of the devices. Operators may need physical access to the impacted machine to restore from backup or manually remove the file. Further complicating physical recovery is the use of disk encryption.
As part of Windows Security, Microsoft provides full disk encryption with its BitLocker feature. The feature is intended to secure physical systems from intruders. The feature enables customers to place Windows PC in fairly public locations while limiting security exposure. However, it creates a challenge when a system needs manual intervention. In the case of rolling back the CrowdStrike update, someone needs to physically type in a long encryption key on boot up.
BitLocker isn’t an issue for virtualized environments as Microsoft does not support this feature for disk hosting the operating system files. BitLocker is available to data volumes on virtualized systems.
Many organizations have enlisted end users to do this work which has slowed down the recovery for many environments. As part of future remediation, companies may begin to look toward solutions that allow access to systems remotely at the hardware layer.
Architectural concerns
It is best practice to place the operating system and application binaries on a separate logical disk than the application data. Some customers with recent operating systems have discovered that the recovery point objective (RPO) for the operating system is within the desired service level but the RPO for the application data isn’t. Depending on the backup of snapshot technology, this effectively makes the backup useless in the CrowdStrike scenario.
Operating system image engineers should set the application data environment variables to point to a data volume vs. the system volume which is currently the default for most Windows applications. When creating application deployment frameworks, developers should take this into account when selecting default installation paths.
Code, Chaos, Compliance, and Lessons from the CrowdStrike Outage
In the wake of the recent CrowdStrike outage that crippled critical infrastructure across healthcare and travel sectors, developers are left grappling with a complex situation. The incident exposed the potential for seemingly innocuous software updates to trigger widespread chaos. But beyond the technical glitch lies a deeper issue: the tension between security best practices and regulatory constraints. The European Commission’s (EU) long-standing agreement with Microsoft, which some believe limited their ability to implement stricter security measures, throws a spotlight on this challenge. This multifaceted situation offers valuable lessons for developers, urging them to prioritize robust development practices while navigating the ever-evolving regulatory landscape.
Lessons for Developers from the CrowdStrike Outage and the EU Agreement
The recent CrowdStrike outage that caused widespread disruption highlights a complex interplay between security, functionality, and regulations.
Here’s what developers can learn:
- Deep Code Review and Testing: The outage reportedly stemmed from a potentially buggy CrowdStrike update. This emphasizes the importance of rigorous code review practices and comprehensive testing procedures, especially when dealing with kernel-level interactions. Invest in static code analysis tools and unit testing frameworks to catch potential issues before deployment.
- Granular Permissions and Sandboxing: Kernel-level access grants immense power to security software. Explore the possibility of implementing granular permission models that restrict software to specific functionalities within the kernel. Additionally, consider sandboxing critical system components to mitigate the impact of potential bugs.
- Adherence to Industry Standards: Standardization efforts in security software development can help reduce compatibility issues and potential vulnerabilities. Actively participate in defining and adhering to industry standards for security software interactions with operating systems.
- Transparency and Communication: Clear communication with users and system administrators is crucial. Developers should provide detailed patch notes outlining changes and potential risks associated with updates. Additionally, have a well-defined rollback plan in place in case of unforeseen issues.
- The Balancing Act: The EU agreement highlights the delicate balance between security and functionality. Apple’s approach of restricting kernel access for improved security offers an alternative perspective. Developers should strive to find a balance that prioritizes security without compromising vital functionalities.
- Navigating Regulatory Landscape: Be aware of the regulatory landscape in your target markets. The EU’s competition concerns with Microsoft demonstrate the potential impact of regulations on software development. Stay informed about evolving regulations and adapt your development practices accordingly.
The CrowdStrike incident serves as a stark reminder of the importance of responsible software development across the entire software development lifecycle (SDLC). By incorporating these lessons and fostering open communication, developers can build and ensure more robust and secure software solutions.
Azure Outage
This outage was distinct from the US central Azure outage that happened at the same time. The recent Microsoft Azure outage in the Central US region, lasting from Thursday night through Friday, was distinct and unrelated to the CrowdStrike Falcon Sensor issue. The Azure outage stemmed from a backend configuration change that disrupted multiple services, including Virtual Machines and Azure Storage. This incident caused widespread service disruptions for Azure customers, affecting critical infrastructure and business operations. In contrast, the CrowdStrike Falcon Sensor issue involved cybersecurity vulnerabilities that impacted CrowdStrike’s threat detection capabilities. While both incidents occurred concurrently, they were separate events, each highlighting different aspects of vulnerability within cloud and cybersecurity infrastructures.
Looking Ahead
This outage had significant repercussions across industries reliant on its cybersecurity solutions globally. In the long term, this outage poses challenges for CrowdStrike. While the company has been proactive in addressing the vulnerability and implementing corrective measures, the incident has underscored potential weaknesses in its security infrastructure. As a team we will be tracking the long-tail implications of this outage and what remedial steps CrowdStrike is making, specifically in its CI/CD pipeline, to ensure this type of outage doesn’t happen again. Given this high-profile nature of the outage, we fully expect increased scrutiny from clients and stakeholders, potentially affecting future business prospects. TL;DR The next quarterly earnings call will be ‘fun’.
However, it’s worth noting that CrowdStrike’s stock, which initially plummeted following the outage, has already shown signs of recovery. Investors seem to have confidence in the company’s ability to rebound and maintain its position as a leader in the cybersecurity market.
From a regulatory perspective, European Union regulators are increasingly focusing on digital resilience, and the Digital Operational Resilience Act (DORA) exemplifies this trend. DORA aims to establish a comprehensive framework for digital operational resilience, ensuring that financial entities can withstand, respond to, and recover from all types of ICT-related disruptions and threats. This regulation will impose stringent requirements on cybersecurity providers like CrowdStrike to maintain robust and resilient systems.
The CrowdStrike outage will likely accelerate the enforcement of such regulations, particularly in the EU, where local regulators are keen on enhancing cybersecurity measures. Companies will need to demonstrate not only compliance with these regulations but also a proactive approach to identifying and mitigating vulnerabilities. As cybersecurity becomes more regulated, CrowdStrike and its peers will need to adapt to these evolving standards to maintain their market positions.
In conclusion, while the CrowdStrike outage has exposed certain vulnerabilities and posed immediate challenges, the company’s quick response and the market’s reaction suggest a potential for recovery. However, the increasing regulatory focus on cybersecurity, particularly in the EU, will necessitate ongoing vigilance and adaptation to new standards to ensure long-term resilience and trust in their services.
Disclosure: The Futurum Group is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.
Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of The Futurum Group as a whole.
Other Insights from The Futurum Group:
CrowdStrike IT Outage: Critical Global Impact and Implications for Cybersecurity
Chronosphere Partners with CrowdStrike and Acquires Calyptia
Author Information
Regarded as a luminary at the intersection of technology and business transformation, Steven Dickens is the Vice President and Practice Leader for Hybrid Cloud, Infrastructure, and Operations at The Futurum Group. With a distinguished track record as a Forbes contributor and a ranking among the Top 10 Analysts by ARInsights, Steven's unique vantage point enables him to chart the nexus between emergent technologies and disruptive innovation, offering unparalleled insights for global enterprises.
Steven's expertise spans a broad spectrum of technologies that drive modern enterprises. Notable among these are open source, hybrid cloud, mission-critical infrastructure, cryptocurrencies, blockchain, and FinTech innovation. His work is foundational in aligning the strategic imperatives of C-suite executives with the practical needs of end users and technology practitioners, serving as a catalyst for optimizing the return on technology investments.
Over the years, Steven has been an integral part of industry behemoths including Broadcom, Hewlett Packard Enterprise (HPE), and IBM. His exceptional ability to pioneer multi-hundred-million-dollar products and to lead global sales teams with revenues in the same echelon has consistently demonstrated his capability for high-impact leadership.
Steven serves as a thought leader in various technology consortiums. He was a founding board member and former Chairperson of the Open Mainframe Project, under the aegis of the Linux Foundation. His role as a Board Advisor continues to shape the advocacy for open source implementations of mainframe technologies.
Mitch Ashley is VP and Practice Lead of DevOps and Application Development for The Futurum Group. Mitch has over 30+ years of experience as an entrepreneur, industry analyst, product development, and IT leader, with expertise in software engineering, cybersecurity, DevOps, DevSecOps, cloud, and AI. As an entrepreneur, CTO, CIO, and head of engineering, Mitch led the creation of award-winning cybersecurity products utilized in the private and public sectors, including the U.S. Department of Defense and all military branches. Mitch also led managed PKI services for broadband, Wi-Fi, IoT, energy management and 5G industries, product certification test labs, an online SaaS (93m transactions annually), and the development of video-on-demand and Internet cable services, and a national broadband network.
Mitch shares his experiences as an analyst, keynote and conference speaker, panelist, host, moderator, and expert interviewer discussing CIO/CTO leadership, product and software development, DevOps, DevSecOps, containerization, container orchestration, AI/ML/GenAI, platform engineering, SRE, and cybersecurity. He publishes his research on FuturumGroup.com and TechstrongResearch.com/resources. He hosts multiple award-winning video and podcast series, including DevOps Unbound, CISO Talk, and Techstrong Gang.
Keith Townsend is a technology management consultant with more than 20 years of related experience in designing, implementing, and managing data center technologies. His areas of expertise include virtualization, networking, and storage solutions for Fortune 500 organizations. He holds a BA in computing and an MS in information technology from DePaul University. He is the President of the CTO Advisor, part of The Futurum Group.
At The Futurum Group, Paul Nashawaty, Practice Leader and Lead Principal Analyst, specializes in application modernization across build, release and operations. With a wealth of expertise in digital transformation initiatives spanning front-end and back-end systems, he also possesses comprehensive knowledge of the underlying infrastructure ecosystem crucial for supporting modernization endeavors. With over 25 years of experience, Paul has a proven track record in implementing effective go-to-market strategies, including the identification of new market channels, the growth and cultivation of partner ecosystems, and the successful execution of strategic plans resulting in positive business outcomes for his clients.
With a focus on data security, protection, and management, Krista has a particular focus on how these strategies play out in multi-cloud environments. She brings approximately 15 years of experience providing research and advisory services and creating thought leadership content. Her vantage point spans technology and vendor portfolio developments; customer buying behavior trends; and vendor ecosystems, go-to-market positioning, and business models. Her work has appeared in major publications including eWeek, TechTarget and The Register.
Prior to joining The Futurum Group, Krista led the data protection practice for Evaluator Group and the data center practice of analyst firm Technology Business Research. She also created articles, product analyses, and blogs on all things storage and data protection and management for analyst firm Storage Switzerland and led market intelligence initiatives for media company TechTarget.