Content Delivery at Netflix Scale

June 22, 2026 (4d ago)

Cache misses at scale: breaking down Netflix Open Connect

I haven't worked on anything close to Netflix scale. The biggest thing I've shipped doesn't move in a month what they move in a second. But I read their engineering post on classifying cache misses the other day, and what stuck with me wasn't the size of it, it was how familiar it felt. It's the same handful of fundamentals you pick up early on (caching, locality, data modeling), just stretched until they wrap around the whole planet.

If the only place you've ever read "cache miss" is in a textbook about CPUs and stuff, this is that exact idea. The cache is just a big server living inside your internet provider now, and a miss costs real money and a worse picture on your TV.

What's a cache miss, really?

Your CPU doesn't go to RAM every time it needs a number. RAM is slow (relatively speaking), so there's a hierarchy in front of it, L1, L2, L3 caches, tiny and stupidly fast, sitting right next to the core. The rule is simple: the closer the data is, the cheaper it is to get. When the thing you need is already in the nearby fast cache, that's a hit. When it isn't, and you have to go fetch it from somewhere further and slower, that's a miss.

That's it. That's the whole concept. Hot data close, cold data far, and you pay a tax every time you guessed wrong.

Now zoom out. Way out. Swap "L1 cache vs RAM" for "a server in your city vs a data center two countries away," and you've basically described a CDN. Netflix's CDN is called Open Connect, and a cache miss there is the same disappointment as the CPU one, you wanted the bytes from close by, and they came from far.

Enter Open Connect

Open Connect is Netflix's own CDN, they built it instead of renting one. The hardware is a custom box called an OCA (Open Connect Appliance), and the whole trick is where those boxes live. Netflix puts them at internet exchange points and, better yet, physically inside the ISPs you and I pay every month.

Think about what that buys them. When the OCA sits inside your ISP, your stream barely touches the public internet, it hops from a box down the road straight to your couch. Closer cache, cheaper bytes, better picture. The works.

There are really two separate systems:

  • the data plane, the OCAs. The fridges, or more like shelfs, full of movies, doing the actual heavy lifting of pushing video.
  • the control plane, a set of services running up in AWS that decide which OCA you should talk to.

The thing serving the bytes and the thing deciding who serves the bytes are not the same system. That split shows up over and over in big systems: the brain and the muscle can be, and usually should be, different services.

What happens when you press play

Here's the flow when you hit play, more or less:

  1. a playback apps service in AWS figures out which files you actually need for what you asked to watch.
  2. a steering service looks at where you are and which OCAs have what, and builds you a ranked list, Netflix calls it the proximity rank. It's just a rank-ordered list of OCAs, best (closest) first.
  3. a cache control service is quietly collecting reports from every OCA, how full they are, how healthy they are, which files they're holding, what network routes they advertise.
  4. you get back a list of URLs, and your client connects to the one at the top.

How does "closest" get decided? Not with a map and a ruler, with BGP routes. ISPs advertise the IP ranges they own, and Steering matches your IP to the longest, most specific prefix it can find. Closest in network terms, which is what actually matters, not closest in actual distance.

So when everything goes right, you connect to the OCA sitting at position 0, i.e the best one.

So, when is it a miss?

A miss is when your bytes come from anywhere that isn't the best available OCA, independent of what state that OCA happens to be in. In their words: not served from position 0 in the proximity rank.

That last part is the bit I'd have gotten wrong. It's not "did you get served, yes or no." You got your movie either way, the stream worked. The miss is about efficiency: you got served, but from further down the list than you should've. The movie played, the system still lost a little.

Two ways to miss: content vs health

A miss isn't just a miss, why you missed matters, because the fix is completely different depending on the reason. So they sort every miss into a small set of buckets:

Content miss ("C"). The closest OCA simply didn't have the file. The bytes you wanted weren't on the box down the road, so you had to reach further to get them. This is a prediction problem, it means Netflix guessed wrong about what people in your area would want, or didn't copy it onto the local box in time. It tells them things like: is our popularity prediction any good? are we pre-positioning content fast enough? do these boxes need bigger disks?

Health miss ("H"). The closest OCA had the file, it just couldn't serve you. The box was saturated: CPU maxed, disk hammered, no headroom left. Each OCA watches its own bottleneck metrics and basically raises its hand to say "I'm full, send them elsewhere." This one's a capacity and load-balancing problem, not a prediction one. It tells them: do we need a second copy of this insanely popular thing? are we balancing load well across boxes that aren't all the same spec?

And then, no miss. Proximity rank zero. You got the best box. Nothing to fix.

Same symptom from the outside, your stream came from further than ideal, but a content miss and a health miss send you to two completely different teams with two completely different to-do lists. Separating them is the whole point.

Wait, it's a data problem

Here's where it clicked for me. To even know a miss happened, let alone label it C or H, you have to line up two stories about the same play:

  • What the control plane decided, the ranked list Steering handed you, plus the filtering logic behind it. (They call these the steering playback manifest logs.)
  • What actually happened, the OCA logs, once streaming starts: which files went out, how many bytes.

A miss only reveals itself when you join those two. The decision said "use box A." The reality said "you streamed from box C." Join them, compare, and there's your miss, and the proximity rank tells you how far down you fell.

Sounds easy right?. At this scale it sure isn't, and the way they handle it has a lot to do with databases:

Logs land in Kafka, per region. Every region they operate in drops its events into local Kafka, high throughput, low latency, doesn't fall over. Fine.

Then they pull everything into one region before joining. Their data is naturally sharded by geography, events scattered across regions. And joining across shards is not good; do it naively and you get duplicates and a mess. So they pay for one cross-region transfer up front, land everything in a single region, then join. Stated plainly: co-locate your data before you join it. Moving the data once is cheaper than joining it everywhere, forever.

They enrich with slow-changing dimension tables. The raw logs are not much, ids and numbers. The human-readable context (what country, what movie, what encode type) lives in dimension tables that barely change, and you stitch it on. Think of star schemas.

Then the streaming window join, merging the manifest logs with the OCA logs into one unified record per play: the files needed, the routable OCAs, the proximity rank (0-based, 0 is best), the decision label (C or H), bytes sent, hours streamed, plus the OCA's capacity and health. That unified row is the data model. Everything downstream reads from it.

And the headline metric that falls out:

content shed ratio = content shed traffic / total streamed traffic

i.e, of all the bytes we pushed, what fraction got "shed" to a worse location because the right one didn't have the content. One number, watchable in near-real-time, sliceable per OCA, per movie, per file, per encode, per country.

The quiet superpower: replaying the past

Because every play gets reduced to a clean, unified record, they can replay the whole thing offline, feed the same logs through the logic with different parameters and ask "what if?" What if this box had more disk? What if we'd pre-positioned that title? You answer it in a simulation, against real history, without touching a single byte of live traffic.

That's the dividend of modeling your data well.

Final Thoughts

I came in expecting planet-scale (no pun intended) magic and left with something more useful: it's the fundamentals, all the way down. A cache miss is a cache miss whether it's L1 and RAM or an ISP closet and a data center. Locality is locality. And the only reason any of it is actionable is a boring, well-modeled join between "what we decided" and "what actually happened."

The one genuinely new wrinkle, for me, was refusing to treat all misses the same, a content miss and a health miss look identical from the outside and have nothing to do with each other under the hood. So don't just log that something went wrong, log enough to know why, because the why is what tells you how it is fixed.