Stale-While-Revalidate & Resilient Caching

Serving stale-but-usable content from the edge while a fresh copy is fetched in the background, and continuing to serve it when the origin is down — turning cache expiry from a latency cliff into a smooth, fault-tolerant curve.

  • Use stale-while-revalidate to return an expired-but-recent object instantly and refresh it asynchronously, so users never wait on a revalidation round-trip.
  • Use stale-if-error (and provider serve-stale features) to keep serving the last good response through origin 5xx errors, timeouts, and connection failures.
  • Tune the stale window independently from freshness max-age, and coalesce concurrent revalidations to avoid a thundering herd hitting your origin.
  • Combine async refresh with origin failover and negative (error) caching so a single failed origin degrades gracefully instead of cascading.
Lifecycle of a cached object under stale-while-revalidate and stale-if-error A horizontal timeline showing the fresh window where the edge serves a HIT, the stale-while-revalidate window where the edge serves stale instantly and refreshes in the background, and the stale-if-error window where the edge serves stale only when the origin errors. Cache object lifecycle (Cache-Control: max-age=60, stale-while-revalidate=120, stale-if-error=600) t=0 t=60s t=180s t=780s FRESH serve HIT, no fetch STALE-WHILE-REVALIDATE serve stale now, refresh in background STALE-IF-ERROR serve stale only if origin returns 5xx async revalidate -> origin user waits 0ms user waits 0ms resilient to outage

The default cache contract is binary: an object is either fresh (served instantly as a HIT) or expired (the edge must block while it revalidates against your origin). That blocking revalidation is where tail latency lives — every time max-age lapses, the next request pays a full origin round-trip, and if the origin is slow or down, that request hangs or fails. The Cache-Control extensions defined in RFC 5861, stale-while-revalidate and stale-if-error, break that binary by adding a grace window after expiry during which the edge is permitted to serve the old copy. This is the foundation of resilient caching, and it pairs naturally with the freshness directives covered in Cache-Control & CDN TTL and the deletion controls in Cache Purging & Invalidation.

How stale serving works at the protocol level

A cached response carries two independent clocks once these directives are in play. The first is max-age (or s-maxage for shared caches), which defines the freshness lifetime. Until it elapses, the object is fresh and served without contacting the origin. The second clock is the stale window, expressed as an offset after expiry.

stale-while-revalidate

stale-while-revalidate=N tells the cache: for N seconds after the object becomes stale, you may return the stale object immediately to the client, but you must kick off an asynchronous revalidation against the origin. The client gets the old bytes with zero added latency; the cache quietly fetches the new version and stores it for the next request. Functionally the freshness boundary becomes soft — the perceived max-age for latency purposes is max-age + stale-while-revalidate, but only the first max-age seconds are truly authoritative.

Consider this header:

Cache-Control: public, max-age=60, stale-while-revalidate=120
  • 0–60s: fresh. Served as a HIT.
  • 60–180s: stale-but-revalidatable. The first request in this window is served the stale copy instantly and triggers a background fetch. Once that fetch completes, the object is fresh again for another 60s.
  • >180s: fully expired. The next request must block on a synchronous revalidation.

stale-if-error

stale-if-error=N is the resilience half. For N seconds after expiry, if a revalidation attempt fails — the origin returns 500, 502, 503, 504, or the connection times out or is refused — the cache is allowed to serve the stale copy instead of propagating the error. This is what keeps your site up when the origin falls over. The two directives are orthogonal and frequently combined:

Cache-Control: public, max-age=300, stale-while-revalidate=60, stale-if-error=86400

Here you get a 60-second async-refresh grace for latency smoothing, plus a full day of error tolerance so an origin outage never reaches the client. The error window is deliberately long because outages are rare and serving day-old content beats serving a 502.

The wire mechanics matter: when a CDN serves stale, well-behaved implementations annotate the response. Cloudflare emits cf-cache-status: REVALIDATED or a stale indicator; Fastly exposes state through X-Cache and surrogate headers. Always confirm via these debug headers rather than assuming, because misconfiguration usually shows up as never serving stale — the most common failure mode discussed below. For how these directives interact with the underlying record TTLs your resolvers cache, see Cache-Control & CDN TTL.

Provider-specific implementation

Cloudflare

Cloudflare honors stale-while-revalidate and stale-if-error from your origin Cache-Control header on the Edge cache, and layers two of its own features on top.

Origin-driven directives. Send the header from your origin and Cloudflare applies it. To set or override it at the edge without touching the origin, use a Cache Rule or a Worker:

export default {
  async fetch(request, env, ctx) {
    const cache = caches.default;
    let response = await cache.match(request);
    if (response) return response; // edge HIT

    response = await fetch(request);
    response = new Response(response.body, response);
    response.headers.set(
      "Cache-Control",
      "public, max-age=60, stale-while-revalidate=120, stale-if-error=86400"
    );
    // store without blocking the response
    ctx.waitUntil(cache.put(request, response.clone()));
    return response;
  },
};

Always Online is Cloudflare’s archival serve-stale: it crawls and stores a copy of your pages, and if your origin is completely unreachable it serves that archived version with a banner. It is coarser than stale-if-error (page-level, periodically crawled, not your live cache object) but requires no headers and covers full-origin death.

Serve Stale Content (configurable on Enterprise/Cache Reserve setups) lets you serve stale on errors and during revalidation as a zone-wide policy, decoupled from per-object headers. Enable it when you want a uniform resilience floor regardless of what your origin emits.

Amazon CloudFront

CloudFront’s support is more indirect — it does not natively honor stale-while-revalidate for background refresh in the way Cloudflare and Fastly do, so you assemble resilience from three primitives.

Origin failover groups define a primary and secondary origin; on configured failure status codes (500, 502, 503, 504) or connection failures, CloudFront retries the secondary. This is your first line of defense and is configured per cache behavior:

resource "aws_cloudfront_distribution" "site" {
  origin_group {
    origin_id = "failover-group"
    failover_criteria {
      status_codes = [500, 502, 503, 504]
    }
    member { origin_id = "primary-origin" }
    member { origin_id = "secondary-origin" }
  }
  # ... primary and secondary origin blocks ...
}

Error caching min TTL controls how long CloudFront caches a 4xx/5xx response so repeated failures do not hammer the origin. Set a small but non-zero value (for example 10 seconds) on 5xx to absorb bursts:

custom_error_response {
  error_code            = 503
  error_caching_min_ttl = 10
  response_code         = 503
  response_page_path    = "/maintenance.html"
}

Custom error responses let you point a failing status code at a static maintenance page stored in S3, so users get a branded page instead of a raw gateway error. For true serve-stale-on-error semantics, pair CloudFront with a Lambda@Edge or CloudFront Function origin-response handler that detects a 5xx and returns a previously cached body — the comparison of where this logic should live is covered in Cloudflare Workers vs AWS Lambda@Edge for Request Routing.

Fastly

Fastly has the richest native support and exposes everything in VCL. It honors stale-while-revalidate and stale-if-error from origin headers automatically, and also gives you explicit control over the stale state machine.

sub vcl_fetch {
  # Honor origin directives, but also set a floor for resilience.
  set beresp.stale_while_revalidate = 60s;
  set beresp.stale_if_error = 86400s;

  # Serve stale for any 5xx from origin during the deliver phase.
  if (beresp.status >= 500 && beresp.status < 600) {
    if (stale.exists) {
      return(deliver_stale);
    }
    set beresp.cacheable = false;
  }
  return(deliver);
}

sub vcl_deliver {
  if (resp.stale_while_revalidate) {
    set resp.http.X-Cache = "STALE-WHILE-REVALIDATE";
  }
}

Fastly’s return(deliver_stale) and the stale.exists predicate give you per-request decisions, and its revalidation is request-coalesced by default so a single origin fetch services all waiting clients. This makes Fastly the cleanest platform for tight, header-precise stale behavior; the trade-off is that you are writing and deploying VCL.

Platform comparison

Provider Mechanism Wire behavior Failover / Notes
Cloudflare Origin stale-while-revalidate/stale-if-error; Always Online; Serve Stale Content Serves stale instantly, background refresh; cf-cache-status reflects state Always Online covers full-origin death from crawled archive; Workers can set headers dynamically
CloudFront Origin failover groups + error caching min TTL + custom error responses No native async SWR; failover retries secondary origin on 5xx True serve-stale needs Lambda@Edge; error pages via S3
Fastly Native VCL stale_while_revalidate, stale_if_error, deliver_stale Honors origin headers; coalesced revalidation; per-request stale decisions Most precise control; requires VCL deployment
Azure Front Door Honors Cache-Control freshness; limited SWR Background refresh not guaranteed; relies on origin health probes Resilience mainly via origin group health probes + priority

Step-by-step: deploying resilient stale serving

This procedure assumes a Cloudflare or Fastly edge in front of an HTTP origin. Adapt the header source as noted.

  1. Decide your three windows. Pick max-age from how often the content legitimately changes, stale-while-revalidate from acceptable staleness during refresh (often 30–120s), and stale-if-error from how long you would rather serve old content than fail (often hours to a day).

  2. Emit the header from the origin. The cleanest source of truth is your application:

    # nginx origin
    add_header Cache-Control "public, max-age=60, stale-while-revalidate=120, stale-if-error=86400" always;
  3. Verify the edge stored the directive. Request the object twice and inspect headers:

    curl -sI https://www.example.com/api/widgets | grep -i -E 'cache-control|cf-cache-status|age'

    Expect cf-cache-status: HIT and your full Cache-Control on the second request.

  4. Force expiry and observe stale-while-revalidate. Wait past max-age but within the stale window, then time two back-to-back requests:

    sleep 65
    curl -s -o /dev/null -w "first:  %{time_total}s status:%{http_code}\n" https://www.example.com/api/widgets
    curl -s -o /dev/null -w "second: %{time_total}s status:%{http_code}\n" https://www.example.com/api/widgets

    The first should return fast (stale served) while the background refresh runs; the second should also be fast and now fresh. If the first request is slow, stale-while-revalidate is not active.

  5. Test stale-if-error by faking an outage. Temporarily block the edge from reaching the origin (firewall rule, or point the origin at a sinkhole) and request a recently-cached, now-expired object. You should still get a 200 with a stale body and an increasing Age, not a 5xx.

  6. Wire negative caching / failover. On CloudFront, attach the origin failover group and set error_caching_min_ttl. On Cloudflare, enable Serve Stale Content or Always Online as your floor. This guarantees resilience even for objects whose origin headers were misconfigured.

TTL, caching, and propagation implications

Stale serving changes how you reason about TTLs. The effective time-to-fresh-content is max-age plus however long it takes the next request after expiry to trigger and complete a background revalidation — not max-age alone. On low-traffic paths an object can sit stale for the entire stale-while-revalidate window because no request arrives to trigger the refresh, then serve one stale response, then refresh. Plan content cadence around the busy case, and treat the stale window as latency insurance, not as part of your freshness budget.

Because each CDN POP maintains its own cache, the windows are evaluated per-POP. A revalidation triggered in Frankfurt does not refresh the copy in Singapore. This is the same per-edge independence that makes purges propagation-sensitive — when you ship new content and need it everywhere immediately, an explicit purge as described in Cache Purging & Invalidation is the only deterministic mechanism; stale directives are about graceful aging, not instant consistency.

One subtlety: a purge interacts with stale-if-error. If you purge an object and the origin is simultaneously down, there is no longer a stale copy to fall back on — the purge removed it. For maximum resilience during risky deploys, prefer soft-purge / revalidate-style invalidation (which marks objects stale rather than deleting them) where the provider supports it, so stale-if-error still has something to serve.

Troubleshooting and rollback

Symptom Likely cause Fix
Edge never serves stale; every expiry blocks CDN not honoring stale-while-revalidate; header stripped or overridden Confirm header survives to edge (curl -I); on Fastly set beresp.stale_while_revalidate explicitly; on Cloudflare verify Cache Rules aren’t overriding
Stale served but never refreshes Background fetch failing silently; ctx.waitUntil missing in Worker Check origin logs for the async revalidation; ensure async store isn’t dropped
Origin overwhelmed at expiry (thundering herd) Revalidation not coalesced; many POPs/requests refresh at once Enable request coalescing (Fastly default; Cloudflare Tiered Cache); stagger max-age with small jitter
5xx reaching users despite stale-if-error Stale copy already evicted, or window too short, or object never cached Lengthen stale-if-error; verify object was cacheable; add provider serve-stale floor
Stale content served far too long stale-while-revalidate window too wide on low-traffic path Shorten window; add a heartbeat request or scheduled warmer to trigger refresh

Rollback protocol. If stale serving causes a correctness problem (users seeing outdated prices, leaked pre-release content):

  1. Immediately set stale-while-revalidate=0, stale-if-error=0 at the origin or via an edge Cache Rule to disable grace windows.
  2. Issue an explicit purge of the affected paths to evict the stale copies from every POP.
  3. Verify with curl -I that cf-cache-status/X-Cache shows fresh MISS then HIT cycles with no stale annotations.
  4. Re-enable conservative windows once the root cause is fixed.

Edge cases and gotchas

  • Authenticated / per-user responses. Never apply stale-if-error to responses that vary by user without a correct Vary and cache key, or one user’s stale copy can leak to another. Pair this with a tight cache key — see the cache-key guidance under the parent topic.
  • must-revalidate and no-cache override grace. If the origin also sends must-revalidate or no-cache, many caches will refuse to serve stale; the directives conflict. Audit the full header.
  • Age header keeps climbing under stale-if-error. During a long outage Age can exceed max-age by hours. That is expected, but downstream caches and clients may treat very old objects as suspect — set a sane stale-if-error ceiling.
  • Background refresh hides origin errors. A failing async revalidation produces no client-visible error, so an origin can be broken for the full window before anyone notices. Alert on origin error rate independently of edge status codes.
  • POST and non-idempotent methods are never served stale. Only safe, cacheable responses participate.
  • Surrogate-Control vs Cache-Control. Some CDNs read stale directives from Surrogate-Control separately from the client-facing Cache-Control; you may need both to control edge behavior and downstream behavior independently.

Frequently Asked Questions

What is the difference between stale-while-revalidate and stale-if-error? stale-while-revalidate serves the expired copy instantly while a background refresh runs, purely to remove revalidation latency on a healthy origin. stale-if-error only serves the stale copy when a revalidation actually fails with a 5xx, timeout, or connection error — it is a resilience fallback, not a latency optimization. They are independent and usually set together.

Why is my CDN never serving stale content even though I set the header? The most common causes are the header being stripped or overridden between origin and edge, a conflicting must-revalidate/no-cache directive, or the CDN simply not supporting native async revalidation (CloudFront). Confirm with curl -I that the full Cache-Control reaches the edge, check for Cache Rules overriding it, and on Fastly set beresp.stale_while_revalidate explicitly in VCL rather than relying on the origin header alone.

How do I stop a thundering herd when many objects expire at once? Enable request coalescing so concurrent requests for the same object share one origin fetch (default on Fastly, available via Tiered Cache on Cloudflare), and add small random jitter to your max-age values so objects do not all expire on the same second. stale-while-revalidate itself helps because only one request per POP triggers the background refresh while the rest are served stale.

Does serving stale content hurt SEO or correctness? For most content the brief staleness is invisible and the availability gain outweighs it. The risk is correctness-sensitive data — pricing, inventory, access-controlled content — where a stale copy can mislead users or leak. Scope stale directives narrowly, keep stale-if-error windows reasonable, and pair them with an explicit purge on deploy so you can force consistency when it matters.

Back to CDN Caching & Performance Optimization