Cloudflare Outage on November 18, 2025: A Deep Dive by Sumtrix

On the 18th of November, 2025, a significant outage rippled through Cloudflare’s global network starting at 11:20 UTC. Users attempting to access numerous websites behind the Cloudflare network encountered error pages stemming from internal network failures. Unlike previous incidents of cyber attacks or malicious intrusions, this outage was traced to an internal system misconfiguration, specifically, a change in permissions on one of Cloudflare’s ClickHouse database clusters. The result was a cascade of errors that brought core network traffic routing to a standstill for several hours.

What Went Wrong: The Bot Management Feature File Trigger

This outage was rooted in a seemingly simple internal change, a permissions update in Cloudflare’s ClickHouse distributed database environment changed how system tables were queried, resulting in a “feature file” used by Cloudflare’s Bot Management system growing unexpectedly large. This file contains machine learning model features essential for distinguishing legitimate traffic from automated bots. The inflated file size surpassed software limits designed for performance optimization, triggering system failures across the network’s core proxy systems.

Ripple Effects: Network and Service Disruptions

The failure of the Bot Management system caused widespread HTTP 5xx errors, the server error class that reflects internal system difficulties. This disruption affected critical Cloudflare products dependent on core proxy functions, including Workers KV, a key-value store essential for stateful applications, and Access, affecting user authentication flow. While Cloudflare’s dashboard remained mostly operational, login issues arose linked to the malfunctioning Turnstile captcha service, affecting user sessions and administrative control.

Read

No Content Available

Cloudflare’s attempt to diagnose initially suspected a hyper-scale DDoS attack due to surges in traffic and the simultaneous outage of its independent status page which is hosted off-site. However, investigation soon pointed conclusively to the feature file issue stemming from the software changes. The changing state of network health, sometimes recovering, then failing again, made initial analysis complex, as partial propagations of the file led to fluctuating fault behaviors across the distributed systems.

Recovery and Lessons Learned

Cloudflare swiftly halted the propagation of the malformed feature file and restored a previously known good version shortly after 14:30 UTC. Full traffic flow was achieved by 17:06 after a multi-hour recovery effort involving staged service restarts and load management. The team’s incident response revealed a rare yet critical failure mode in the coordination of distributed system permissions and highlighted opportunities for enhanced fault tolerance and operational safeguards.

As a result, Cloudflare is committing to hardening configuration file ingestion protocols, enabling global “kill switch” capabilities for feature deployments, and tightening error handling to avoid cascading failures. This incident, acknowledged as Cloudflare’s most severe outage since 2019, serves as a sobering reminder of the fragility of modern internet infrastructure and the challenges of operating at global scale.