Rate Limiting API Requests at the Edge

After working through this guide you will be able to throttle abusive API traffic at the edge gateway before it reaches your origin, using either provider-native rules or a custom Worker backed by a Durable Object token bucket. You will return correct 429 responses with Retry-After and RateLimit-* headers, key limits on API keys instead of shared IPs, and exempt your own internal traffic from the counter.

Rate limiting at the edge protects the origin from credential-stuffing, scraping, and accidental client loops while the request is still hundreds of milliseconds away from your backend. The hard part is not the check itself — it is keeping an accurate distributed counter across hundreds of points of presence without serializing every request through one hot object. This guide builds a per-key counter that stays consistent and fast.

Key objectives:

  • Choose a stable rate-limit key (API key, token sub, or fallback IP) instead of an easily rotated identifier.
  • Implement a sliding-window or token-bucket counter inside a Durable Object so all PoPs agree on the count.
  • Return 429 Too Many Requests with Retry-After and standards-track RateLimit-* headers.
  • Exempt internal and health-check traffic, and avoid counter hotspots that throttle your own platform.
Token bucket rate limiting flow at the edge An API request reaches an edge Worker, which derives a key, asks a Durable Object token bucket whether a token is available, and either forwards to origin or returns 429. API client + API key Edge Worker derive key, exempt? Durable Object token bucket refill + take() Origin API protected tokens > 0 allow, decrement tokens == 0 429 Too Many Retry-After internal traffic skips bucket one DO instance per key = serialized, consistent count

Native rules versus a Worker token bucket

Before writing code, decide whether you even need a custom counter. Cloudflare’s native Rate Limiting Rules are configured in the dashboard or via Terraform and require zero compute code. They match on path, method, and a characteristic such as IP or a header, then count requests in a fixed period. They are the right tool for coarse abuse protection — for example, “no more than 100 POST /login per IP per minute.” They share the same evaluation surface as your WAF rules that block common attacks, so layering both is normal.

A Worker plus Durable Object is the right tool when limits must be per API key with different tiers, when you need a true token bucket that allows short bursts, or when the same response that does JWT validation at the edge must also enforce the caller’s plan quota. The table below summarizes the trade-off.

Capability Native Rate Limiting Rules Worker + Durable Object
Setup effort Dashboard / Terraform only Code + deploy + binding
Key granularity IP, header, cookie Any value: API key, token sub, tenant
Algorithm Fixed window Token bucket or sliding window
Per-tier limits Limited Arbitrary, read from KV/D1
Burst handling Hard cutoff Smooth refill
Cost model Per request, no compute Worker + DO invocations
Custom 429 body Templated Fully programmable

Prerequisites and environment setup

You need Node 18+, Wrangler 3.60 or newer, and a Cloudflare account with Workers Paid (Durable Objects require it). Confirm your toolchain:

node --version          # v18.x or later
npx wrangler --version  # 3.60.0 or later
npx wrangler whoami     # confirms account + auth

Create the project and a wrangler.jsonc that declares the Durable Object binding and migration. The new_sqlite_classes migration is required for the storage backend on current accounts.

{
  "name": "edge-rate-limiter",
  "main": "src/index.ts",
  "compatibility_date": "2024-09-23",
  "durable_objects": {
    "bindings": [
      { "name": "RATE_LIMITER", "class_name": "TokenBucket" }
    ]
  },
  "migrations": [
    { "tag": "v1", "new_sqlite_classes": ["TokenBucket"] }
  ],
  "vars": {
    "DEFAULT_LIMIT": "60",
    "REFILL_PER_SEC": "1"
  }
}

Store the shared secret that internal services present so they can bypass the limiter:

npx wrangler secret put INTERNAL_BYPASS_TOKEN

Step-by-step procedure

Step 1 — Choose and derive a stable key

The key decides whose budget a request spends. Prefer an authenticated identifier: the API key from an Authorization or X-API-Key header, or the sub claim of a verified JWT. Fall back to client IP only for unauthenticated routes, because IPs are shared behind NAT and trivially rotated. Hash the key so raw secrets never become a Durable Object name.

async function deriveKey(request: Request): Promise<string> {
  const apiKey = request.headers.get("x-api-key");
  if (apiKey) return "k:" + (await sha256(apiKey));
  // fall back to the connecting IP for anonymous traffic
  const ip = request.headers.get("cf-connecting-ip") ?? "0.0.0.0";
  return "ip:" + ip;
}

async function sha256(input: string): Promise<string> {
  const data = new TextEncoder().encode(input);
  const digest = await crypto.subtle.digest("SHA-256", data);
  return [...new Uint8Array(digest)]
    .map((b) => b.toString(16).padStart(2, "0"))
    .join("");
}

The side effect that matters: every distinct key maps to exactly one Durable Object instance via idFromName, so all requests for that key serialize through one consistent counter regardless of which PoP they land on.

Step 2 — Implement the token bucket Durable Object

The bucket stores a token count and the timestamp of the last refill. On each call it lazily adds tokens for the elapsed time, caps at capacity, and tries to take one. Lazy refill means no alarms or background timers are needed.

export class TokenBucket {
  private state: DurableObjectState;
  private capacity = 60;        // burst ceiling
  private refillPerSec = 1;     // sustained rate

  constructor(state: DurableObjectState) {
    this.state = state;
  }

  async fetch(request: Request): Promise<Response> {
    const url = new URL(request.url);
    this.capacity = Number(url.searchParams.get("cap") ?? this.capacity);
    this.refillPerSec = Number(url.searchParams.get("rate") ?? this.refillPerSec);

    const now = Date.now() / 1000;
    let tokens = (await this.state.storage.get<number>("tokens")) ?? this.capacity;
    let last = (await this.state.storage.get<number>("last")) ?? now;

    // lazily refill based on elapsed wall-clock time
    tokens = Math.min(this.capacity, tokens + (now - last) * this.refillPerSec);
    last = now;

    let allowed = false;
    if (tokens >= 1) {
      tokens -= 1;
      allowed = true;
    }
    await this.state.storage.put({ tokens, last });

    const remaining = Math.floor(tokens);
    const retryAfter = allowed ? 0 : Math.ceil((1 - tokens) / this.refillPerSec);
    return Response.json({ allowed, remaining, retryAfter, limit: this.capacity });
  }
}

Expected behavior: a fresh key starts full at capacity. Sustained callers settle at refillPerSec requests per second; idle callers slowly refill back to the burst ceiling.

Step 3 — Wire the Worker, exempt internal traffic, and return 429

The Worker derives the key, short-circuits internal traffic, calls the bucket, and attaches headers. Exemption happens before the bucket call so your own services never consume budget or create hotspots.

interface Env {
  RATE_LIMITER: DurableObjectNamespace;
  INTERNAL_BYPASS_TOKEN: string;
  DEFAULT_LIMIT: string;
  REFILL_PER_SEC: string;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // 1. exempt internal + health-check traffic
    const url = new URL(request.url);
    const bypass = request.headers.get("x-internal-token") === env.INTERNAL_BYPASS_TOKEN;
    if (bypass || url.pathname === "/health") {
      return fetch(request);
    }

    // 2. derive key and route to its Durable Object
    const key = await deriveKey(request);
    const id = env.RATE_LIMITER.idFromName(key);
    const stub = env.RATE_LIMITER.get(id);
    const cap = env.DEFAULT_LIMIT;
    const rate = env.REFILL_PER_SEC;
    const res = await stub.fetch(
      `https://do/take?cap=${cap}&rate=${rate}`
    );
    const { allowed, remaining, retryAfter, limit } =
      (await res.json()) as { allowed: boolean; remaining: number; retryAfter: number; limit: number };

    const headers = new Headers({
      "RateLimit-Limit": String(limit),
      "RateLimit-Remaining": String(remaining),
      "RateLimit-Policy": `${limit};w=${limit}`,
    });

    // 3. reject when the bucket is empty
    if (!allowed) {
      headers.set("Retry-After", String(retryAfter));
      return new Response(
        JSON.stringify({ error: "rate_limited", retry_after: retryAfter }),
        { status: 429, headers: { ...Object.fromEntries(headers), "content-type": "application/json" } }
      );
    }

    // 4. forward to origin, propagating limit headers to the client
    const originResponse = await fetch(request);
    const merged = new Headers(originResponse.headers);
    headers.forEach((v, k) => merged.set(k, v));
    return new Response(originResponse.body, { status: originResponse.status, headers: merged });
  },
};

Deploy it:

npx wrangler deploy

Returning 429 with Retry-After is the contract well-behaved clients respect — most HTTP libraries and SDKs back off automatically when they see it, which sheds load without manual intervention.

Verification

Drive the endpoint past its limit with a curl loop and watch the status flip to 429. With DEFAULT_LIMIT=60 and REFILL_PER_SEC=1, a tight burst exhausts the bucket after roughly 60 requests.

for i in $(seq 1 70); do
  curl -s -o /dev/null \
    -H "x-api-key: test-key-123" \
    -w "%{http_code} " \
    https://api.example.com/v1/items
done; echo

Expected output — a run of 200 codes, then 429 once the bucket empties:

200 200 200 ... 200 429 429 429 429

Inspect the headers on a single request to confirm the budget is reported:

curl -sI -H "x-api-key: test-key-123" https://api.example.com/v1/items \
  | grep -iE 'ratelimit|retry-after'
ratelimit-limit: 60
ratelimit-remaining: 41
ratelimit-policy: 60;w=60

Confirm internal traffic bypasses the counter — these should never return 429 no matter how fast you loop:

curl -sI -H "x-internal-token: $INTERNAL_BYPASS_TOKEN" \
  https://api.example.com/v1/items | head -1

Tail live logs while testing to watch which keys hit the limit:

npx wrangler tail --format pretty

Troubleshooting

Counter hotspots and a single overloaded object

If one Durable Object serves a disproportionate share of traffic — for example every anonymous request collapses onto ip:0.0.0.0 when cf-connecting-ip is missing — that object serializes the entire flood and adds latency. Diagnose by logging the derived key distribution in wrangler tail. Fix it by sharding high-volume anonymous keys: append a small shard suffix (key + ":" + (hash % N)) and divide the limit by N, trading a little accuracy for throughput. Authenticated keys rarely need sharding because they spread naturally.

Window edges and clock skew

A token bucket using Date.now() inside the object is safe because all requests for a key hit the same object and therefore the same clock — there is no cross-machine skew. The classic failure is mixing fixed windows across objects: a fixed-window counter lets a caller send a full quota at 00:59 and again at 01:00, doubling the effective rate at the boundary. The token bucket here avoids that because refill is continuous, not stepped. If you must use fixed windows, add a second overlapping window or switch to a sliding-window log to smooth the edge.

Distributed accuracy versus availability

Routing every request through one object guarantees an accurate count but couples availability to that object. If the Durable Object is briefly unreachable, decide your failure mode explicitly: fail open (allow the request, log it) protects availability at the cost of letting abuse through, while fail closed returns 429 and protects the origin. For most public APIs, fail open on the limiter but keep your coarse native rate limiting and WAF layer as a backstop so a limiter outage never removes all protection.

Limits not enforced after deploy

If requests never get throttled, the most common cause is a missing or misnamed binding. Confirm RATE_LIMITER and class_name match in wrangler.jsonc, and that the migrations block ran — wrangler deploy prints the applied migration tag. A second cause is exemption logic catching real traffic: verify the INTERNAL_BYPASS_TOKEN secret is not leaking to clients and that /health is not a prefix match swallowing /health-data.

Frequently Asked Questions

Should I rate limit on IP address or API key? Prefer the API key or authenticated token sub for any route behind auth, because IPs are shared behind carrier NAT and corporate proxies, so a single bad actor can punish thousands of legitimate users on the same address. Use IP only as a fallback for genuinely anonymous endpoints.

When are native Rate Limiting Rules enough? When your limits are coarse and keyed on IP, path, or a simple header, native rules need no code and evaluate alongside your WAF. Reach for a Worker plus Durable Object only when you need per-key tiers, smooth bursts, or to combine limiting with auth in the same response.

Why use a Durable Object instead of KV for the counter? KV is eventually consistent and caches reads, so concurrent increments across PoPs lose updates and undercount. A Durable Object gives a single-threaded, strongly consistent instance per key, which is exactly what an accurate counter requires.

What headers should a 429 response include? Always include Retry-After with the seconds until the next token, plus RateLimit-Limit, RateLimit-Remaining, and RateLimit-Policy so well-behaved clients can self-throttle without guessing. Many SDKs back off automatically when they see Retry-After.

How do I let internal services skip the limit? Check a shared secret header before the bucket call and short-circuit to the origin when it matches, so internal and health-check traffic never consumes budget or creates a hotspot. Rotate that secret like any other credential and keep it out of client-facing config.

Back to API Gateway at the Edge