When a Big Chunk of the Internet Went Dark, Clouflare Outage Story

So yeah, on November 18 at 11:20 UTC, big chunk of the Internet just… stopped loading. Spotify spun. ChatGPT gave up. Canva froze. Even Downdetector.com went down , which, honestly, felt sarcastic.

But here’s the thing: no one got hacked. No DDoS won. just a size problem.

It was just one config file. A little too big. Pushed a little too fast. And because of how Cloudflare’s network is built, that one file took down thousands of machines , one by one, across 330 cities , like dominos wired to the same timer.

Let me walk you through it, the way how I pieced it together while waiting for my npm pakage to load during build in PRODUCTION :(

First, What Even Is Cloudflare Under the Hood

Most people think of Cloudflare as a CDN or a firewall. Fair. But technically its hosting + DB + VPN + Security, but all these HTTP request hits this pipeline:

TLS termination (hello, SNI routing),
Then into FL, short for Frontline , their ultra-low-latency proxy layer,
Then into Pingora, which handles caching, origin fetches, and Workers.

FL is the gatekeeper. It decides: bot or human, cached or fresh, allowed or blocked. To do that fast, it loads feature files , tiny config blobs , from a system called Quicksilver.

Quicksilver is wild. It’s a globally replicated key-value store, built on LMDB, running a copy in every single PoP. When a config changes, it syncs to every city in seconds. That’s how Cloudflare stays consistent at scale.

But consistency has a cost: if a bad config goes in, it goes everywhere.

The Incident: A File That Was Just a Bit Too Big

Here’s how it unfolded, per Cloudflare’s report: a routine permissions update in a ClickHouse cluster "just tightening access controls" caused a query to return duplicate rows, doubling the Bot Management feature file size; FL, enforcing a hard memory limit for safety on high-throughput edge nodes, crashed outright on startup — no fallback, no warning, just failure.

Now here’s the kicker: the file regenerated every 5 minutes. And the ClickHouse update was rolling out gradually , so sometimes you got the good file, sometimes the bad one.

That’s why the outage pulsed. Services would pop back for 90 seconds, then vanish again. Engineers thought it was a DDoS at first , and honestly thats what it looked like.

Once all ClickHouse nodes were updated, every new file was bad. FL stayed down. Global routing collapsed.

By 14:30, they stopped the generator, shoved a known-good file into Quicksilver manually, and restarted FL PoP by PoP. Full recovery by 17:06.

What Broke (Beyond the Obvious)

It wasn’t just websites returning 502s. The failure rippled:

didn’t load → no logins anywhere (even Cloudflare’s own dashboard),

my blog.

When a Big Chunk of the Internet Went Dark, Clouflare Outage Story

First, What Even Is Cloudflare Under the Hood

The Incident: A File That Was Just a Bit Too Big

What Broke (Beyond the Obvious)

How They Fixed It (and What’s Next)

My Take (As a Student Who Runs Everything on Cloudflare)