
So yeah, on November 18 at 11:20 UTC, big chunk of the Internet just… stopped loading. Spotify spun. ChatGPT gave up. Canva froze. Even Downdetector.com went down , which, honestly, felt sarcastic.
But here’s the thing: no one got hacked. No DDoS won. just a size problem.
It was just one config file. A little too big. Pushed a little too fast. And because of how Cloudflare’s network is built, that one file took down thousands of machines , one by one, across 330 cities , like dominos wired to the same timer.
Let me walk you through it, the way how I pieced it together while waiting for my npm pakage to load during build in PRODUCTION :(
Most people think of Cloudflare as a CDN or a firewall. Fair. But technically its hosting + DB + VPN + Security, but all these HTTP request hits this pipeline:
FL is the gatekeeper. It decides: bot or human, cached or fresh, allowed or blocked. To do that fast, it loads feature files , tiny config blobs , from a system called Quicksilver.
Quicksilver is wild. It’s a globally replicated key-value store, built on LMDB, running a copy in every single PoP. When a config changes, it syncs to every city in seconds. That’s how Cloudflare stays consistent at scale.
But consistency has a cost: if a bad config goes in, it goes everywhere.
Here’s how it unfolded, per Cloudflare’s report: a routine permissions update in a ClickHouse cluster "just tightening access controls" caused a query to return duplicate rows, doubling the Bot Management feature file size; FL, enforcing a hard memory limit for safety on high-throughput edge nodes, crashed outright on startup — no fallback, no warning, just failure.
Now here’s the kicker: the file regenerated every 5 minutes. And the ClickHouse update was rolling out gradually , so sometimes you got the good file, sometimes the bad one.
That’s why the outage pulsed. Services would pop back for 90 seconds, then vanish again. Engineers thought it was a DDoS at first , and honestly thats what it looked like.
Once all ClickHouse nodes were updated, every new file was bad. FL stayed down. Global routing collapsed.
By 14:30, they stopped the generator, shoved a known-good file into Quicksilver manually, and restarted FL PoP by PoP. Full recovery by 17:06.
It wasn’t just websites returning 502s. The failure rippled:
Affected apps? Almost every major one: Twitter, ChatGpt, Spotify, Canva, Perplexity, Uber, DoorDash, Claude, and yes, even Downdetector.
All their servers didnt disappear. only The door did.
The fix was surgical:
Long term? They’re adding:
This wasn’t a “they messed up” moment. It was a “scale found the edge of the map” moment. And the fact they published a full post-mortem while still cleaning up? That’s respect.
I have a Old pc at home using Cloudflare Tunnel to expose my lab. I built my final-year project on Workers. I use R2 for backups. I trust this stack , not blindly, but earnedly.
When the outage hit, I didn’t rage. I opened dig +short <my-domain> and saw it still resolved. So I knew: DNS was up, TLS handshake worked , it was deeper. FL.
And when the blog dropped an hour later? I read it like a University Results, and the apology by Matthew Prince, just broke me.
We are sorry for the impact to our customers and to the Internet in general. Given Cloudflare's importance in the Internet ecosystem any outage of any of our systems is unacceptable. That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team. We know we let you down today. - Matthew Prince
Because here’s the truth: any system this fast, this distributed, this ambitious will eventually meet a corner case no test caught.
What matters is: do they own it? Do they explain it? Do they fix it and share how?
Cloudflare did all three.
So yeah, the Internet blinked. But it didn’t break. And I’m still shipping to "*.workers.dev" tonight.
-- mohammad