Cache Stampede: The Thundering Herd at 3 AM

YEHYoussef El Hejjioui·June 25, 2026·5 min read

You've been there. It's 2 AM, the pager screams, and your dashboard lights up like a poorly configured Christmas tree. You stare, bleary-eyed, at a latency graph that's gone vertical, and your database connection pool is maxed out. But traffic isn't up, not really. What the hell happened? Another cache entry just decided it was done with life, and your entire application decided to dogpile the poor backend trying to regenerate it. Welcome, friend, to the beautiful disaster we affectionately call a cache stampede.

It's not some exotic, theoretical problem. This is a recurring, deeply human problem born from the simplest of premises: a highly requested piece of data, cached for performance, expires. And then, every single concurrent request for that data, instead of hitting the now-gone cache, simultaneously slams the underlying data source. That 'thundering herd' isn't just a catchy phrase; it's a very literal description of what happens to your database or upstream service. It's like everyone in a packed stadium suddenly deciding they're thirsty and all rushing the same water fountain.

The Anatomy of the Meltdown

Imagine a critical endpoint, say, '/api/v1/product/featured', which fetches a curated list of top products. This list is expensive to compute – maybe it hits a recommendation engine, joins a few tables, or fetches data from some external, notoriously slow microservice. So, you cache it. Smart move, right? You slap a 5-minute TTL on it in Redis, or Memcached, or even just an in-memory map.

Everything's humming along beautifully for days, weeks, maybe even months. Your API latency is stellar, your database is bored, and you're feeling pretty good about your architecture. Then, at exactly the wrong moment, that 5-minute TTL expires. Suddenly, hundreds, thousands, maybe even tens of thousands of concurrent user requests arrive at your API, all asking for that same '/api/v1/product/featured' data. Each one finds the cache empty. Each one then, in its infinite wisdom, decides it should be the one to go regenerate the data.

What follows is predictable and painful: your backend service instances, now operating without the protective shield of the cache, all try to execute that expensive computation or query. Your database connection pool saturates. Queries start timing out. The upstream microservice you depend on, which was already on the edge, buckles under the sudden load. Latency spikes from milliseconds to seconds. Users see spinners, then error messages. And because you've probably got some retry logic in place, those timeouts lead to more requests, exacerbating the problem into a self-inflicted DDoS attack. It's a cascade, an avalanche, all because a tiny 'ttl' counter hit zero.

Fighting Back: Strategies for Survival

So, how do you prevent your carefully constructed performance layer from turning into a weapon against itself? There's no single silver bullet, just a series of pragmatic tradeoffs and defensive maneuvers.

1. The Brute Force: Locking Mechanisms

The most straightforward, almost instinctual, approach is to say, "Okay, only one request gets to regenerate the data." This means introducing a lock. If your cache is local to a single application instance, a simple in-process mutex or semaphore might suffice. The first request grabs the lock, regenerates the data, puts it back in the cache, and releases the lock. Subsequent requests for that key, while the lock is held, simply wait for the lock to be released, then check the cache again (which should now be populated).

However, in a distributed system, this becomes a distributed lock. You'd use something like Redis's 'SET NX EX' command or ZooKeeper. The first process to acquire the lock becomes the "hero" that regenerates the cache. Everyone else waits. This works, mostly. But distributed locks are not free. They add network latency, complexity, and the delightful potential for deadlocks if your lock-holding process crashes before releasing it. You've got to bake in timeouts and robust error handling. And if the process that's doing the regeneration is slow, all those waiting requests are still holding open connections, tying up resources. It solves the stampede, but might just replace it with a different kind of bottleneck.

2. The Smart Play: Probabilistic Early Recomputation

This is often a much more elegant solution for frequently accessed data. Instead of letting the cache item expire hard, you introduce a 'soft' expiration. When a request comes in for a cached item, you check its real TTL. If the item is, say, 90% through its lifespan (e.g., 4.5 minutes into a 5-minute TTL), a single, randomly selected request gets the responsibility of asynchronously refreshing the cache in the background. All other requests continue to be served the existing, slightly stale but still valid, cached data.

This prevents the "hard expiry cliff." The refresh happens gracefully, well before the item actually expires. You're effectively smoothing out the cache invalidation process. The key is 'randomly selected' and 'asynchronously'. You don't want every request at the 90% mark trying to refresh. A touch of randomness, a bit of 'jitter', helps distribute this proactive refresh work across different requests and times. It means a few requests might get slightly older data for a few seconds, but your backend never sees that thundering herd.

3. Request Coalescing / Deduplication

Similar in spirit to local locking, but often implemented as a layer before the cache lookup. If multiple requests come in for the same expired cache key, and you've decided to regenerate it, you don't let every subsequent request also try to regenerate. Instead, the first request initiates the regeneration. Subsequent requests for that key are 'coalesced' – they are added to a waiting list (or receive a 'promise' or 'future') and will all be served the result of that single regeneration effort when it completes.

This is highly effective in-process for a single application instance. When combined with a distributed lock for the actual write to the shared cache, it forms a robust pattern. It ensures that even if you have hundreds of simultaneous identical requests hitting a single application server, only one expensive operation is performed locally, and only one write operation is coordinated across your distributed cache.

4. Circuit Breakers and Rate Limiters: Damage Control

These aren't direct preventions against a cache stampede, but they're essential for preventing the stampede from taking down your entire service. A circuit breaker, placed around your expensive backend call, can detect when the backend is struggling (e.g., too many timeouts, high error rates) and 'trip', quickly failing subsequent calls rather than piling on more load. This saves the backend from complete collapse and allows it to recover.

Rate limiters, applied at various layers (API gateway, service ingress), can prevent an overwhelming number of requests from even reaching your application, or certainly from reaching the expensive backend calls. If the system is struggling, shedding load is a pragmatic response. It might mean denying some users service temporarily, but it's better than everyone getting an error.

5. Smart(er) Cache Invalidation

Sometimes, the best offense is a good defense. If your cache invalidation isn't solely reliant on TTLs, you can avoid the cliff altogether. Event-driven invalidation (e.g., publishing a message to a pub/sub system when source data changes) means your cache updates proactively rather than reactively on expiry. This is ideal, but also often much harder to implement correctly, especially with complex data dependencies. It shifts the complexity from TTL management to distributed event consistency. Choose your poison.

The Weary Conclusion

Cache stampedes are a fact of life in distributed systems. You won't avoid them by buying some 'enterprise' solution or adopting the latest framework. They stem from fundamental concurrency challenges and the performance-cost tradeoffs we constantly make. The key is understanding your workload, profiling your expensive operations, and then pragmatically applying one or more of these techniques.

There's no magic bullet. You pick the approach that best balances complexity, latency, and consistency for a given piece of data. Sometimes a simple distributed lock is 'good enough' for something that's rarely hit. For high-volume, critical data, you'll want the probabilistic early recomputation or request coalescing. And always, always, have your circuit breakers ready. Because production, as we know, cares little for your carefully planned theoretical ideal, and everything eventually breaks in the most inconvenient way possible.

So next time your pager goes off at 3 AM and your metrics look like a bad abstract painting, you'll at least have a few ideas beyond just restarting everything and hoping for the best. Good luck out there. You'll need it.

Frequently Asked Questions

What is a cache stampede?+

A cache stampede occurs when a popular cache entry expires and multiple concurrent requests simultaneously attempt to regenerate the same data. Instead of serving requests from the cache, all requests hit the backend service or database at once, potentially overwhelming it.

Why is a cache stampede also called the thundering herd problem?+

The term "thundering herd" describes the sudden surge of concurrent requests that occur when a frequently accessed cache entry expires. Large numbers of application instances or users simultaneously attempt to retrieve or regenerate the same piece of data, creating a traffic spike.

What causes cache stampedes?+

Cache stampedes are usually caused by cache entries with fixed expiration times, high request volumes, expensive backend operations, and the absence of coordination mechanisms such as locking or request deduplication.

How do you prevent a cache stampede?+

Common techniques for preventing cache stampedes include distributed locking, request coalescing, probabilistic early recomputation, stale-while-revalidate strategies, and proactive cache warming.

What is request coalescing in caching?+

Request coalescing is a technique where multiple concurrent requests for the same expired cache key are grouped together. Only one request regenerates the data while the remaining requests wait for the result, preventing duplicate backend work.

What is stale-while-revalidate caching?+

Stale-while-revalidate is a caching strategy that allows applications to serve slightly stale cached data while refreshing the cache asynchronously in the background. This prevents sudden load spikes when cache entries expire.

Are distributed locks a good solution for cache stampedes?+

Distributed locks can effectively prevent cache stampedes by ensuring that only one process regenerates expired data. However, distributed locks introduce additional complexity, latency, and failure scenarios that must be handled carefully.

What is probabilistic early recomputation?+

Probabilistic early recomputation is a technique where cache entries are refreshed before their actual expiration time. A randomly selected request performs the refresh asynchronously, reducing the likelihood of a large number of simultaneous cache misses.

Can Redis prevent cache stampedes?+

Redis can help mitigate cache stampedes through features such as distributed locks, key expiration policies, and Lua scripts. However, preventing cache stampedes typically requires additional application-level strategies.

Why are cache stampedes dangerous?+

Cache stampedes can overload databases, saturate connection pools, increase application latency, trigger cascading failures, and potentially cause complete service outages.

What is the difference between a cache stampede and a cache avalanche?+

A cache stampede occurs when a single popular cache entry expires and causes a surge of requests. A cache avalanche happens when many cache entries expire simultaneously, resulting in a much larger traffic spike.

How does jitter help prevent cache stampedes?+

Jitter introduces randomness into cache expiration times so that cache entries do not expire simultaneously. This spreads cache refresh operations over time and reduces sudden spikes in backend traffic.

What are the best practices for avoiding cache stampedes in distributed systems?+

Best practices include using distributed locks sparingly, implementing stale-while-revalidate strategies, adding TTL jitter, monitoring cache hit ratios, applying circuit breakers, and using request coalescing for high-traffic endpoints.

Which caching strategy is best for high-traffic applications?+

High-traffic applications often combine multiple strategies, including stale-while-revalidate, request coalescing, TTL jitter, and circuit breakers, to achieve both high performance and resilience.

YEH

Youssef El Hejjioui

Studies and Development Engineer

Projects That Will Make You Hate Your Life (and Become a Better Developer)

Forget another todo app. The real lessons aren't found in tutorials, they're carved out of production incidents at 3 AM. This isn't about shiny new frameworks; it's about understanding the core rot underneath.

12 min