Configuring Edge Health Checks and Automatic Failover

Q: Should the health path require authentication?

No. Keep /healthz unauthenticated but unguessable or internal-only, and make it return 200 purely on local readiness. Auth on the probe path just adds a dependency that can fail independently and cause false failovers.

Q: How do I avoid alert fatigue from transient failovers?

Tune consecutive_down upward and add retries so single slow probes are absorbed, route pool-health notifications to a low-noise channel, and gate alerts on sustained state changes rather than every probe event. Asymmetric up/down thresholds do most of the work.

This guide shows you how to wire active health monitoring to an edge load balancer so that traffic shifts off a dead origin automatically, then shifts back once it recovers — with sensitivity tuned so a single slow probe never triggers a needless failover. You will build the same control plane twice: once on a Cloudflare Load Balancer (monitor + pools + LB), and once on AWS Route 53 (health check + primary/secondary record set), then prove the failover works by faulting the primary and watching traffic move.

After working through it you will be able to:

Define an HTTP/HTTPS health monitor with the right path, interval, timeout, and expected status codes.
Build primary and fallback origin pools and attach them to a load balancer with deterministic failover order.
Tune consecutive_up/consecutive_down thresholds so origins do not flap between healthy and unhealthy.
Verify a failover live with curl and dig, and diagnose the three failure modes that cause false positives.

Prerequisites and environment setup

You need a load balancer that can hold at least two origin pools. On Cloudflare that means a zone with Load Balancing enabled (a paid add-on) and two origins reachable on public IPs or hostnames. On AWS you need a hosted zone in Route 53 and two origins (ALBs, EC2 instances, or any reachable endpoint). This builds directly on a multi-origin layout — if you have not split traffic across regions yet, start with weighted load balancing across multi-region origins and add health checks on top.

Tool versions used below:

terraform -version   # >= 1.6
# Cloudflare provider 4.x; AWS provider 5.x
dig -v               # 9.18+
curl --version       # 8.x

Export credentials before running Terraform:

export CLOUDFLARE_API_TOKEN="<token-with-LB-edit>"
export AWS_PROFILE="ops"
export TF_VAR_zone_id="<cloudflare-zone-id>"

Both origins must expose a cheap, dependency-free health endpoint. Do not point the monitor at / — a homepage that queries a database will report unhealthy whenever the database hiccups, cascading a soft failure into a full failover. Use a dedicated /healthz that returns 200 only when the instance can actually serve traffic, and keep its body small (the monitor can match on body text but you pay for it in probe latency).

Step 1: Define the HTTP health monitor

The monitor is the probe definition: which path to hit, how often, how long to wait, and which response counts as healthy. Start conservative on interval and timeout, then tighten once you trust the signal.

resource "cloudflare_load_balancer_monitor" "https" {
  account_id     = var.account_id
  type           = "https"
  method         = "GET"
  path           = "/healthz"
  expected_codes = "200"
  port           = 443

  interval        = 15   # seconds between probes
  timeout         = 5    # fail the probe after 5s
  retries         = 2    # extra attempts before counting a failure
  consecutive_up   = 2   # probes to mark a pool healthy
  consecutive_down = 3   # probes to mark a pool unhealthy

  header {
    header = "Host"
    values = ["app.example.com"]
  }
  description = "app https healthz"
}

Expected side effect: nothing routes yet. The monitor is inert until attached to a pool. The Host header matters — Cloudflare probes the pool’s origin IP directly, so without an explicit Host the origin may serve a default vhost and return a 404 that you misread as an outage.

Step 2: Create primary and fallback pools

A pool groups one or more origins and inherits a monitor. Health is evaluated per origin; a pool is healthy while at least minimum_origins of its origins pass. Define the primary in your main region and the fallback elsewhere.

resource "cloudflare_load_balancer_pool" "primary" {
  account_id      = var.account_id
  name            = "primary-iad"
  monitor         = cloudflare_load_balancer_monitor.https.id
  minimum_origins = 1
  notification_email = "[email protected]"

  origins {
    name    = "iad-1"
    address = "198.51.100.10"
    enabled = true
  }
}

resource "cloudflare_load_balancer_pool" "fallback" {
  account_id      = var.account_id
  name            = "fallback-sfo"
  monitor         = cloudflare_load_balancer_monitor.https.id
  minimum_origins = 1

  origins {
    name    = "sfo-1"
    address = "203.0.113.20"
    enabled = true
  }
}

Expected output after apply: both pools show healthy = true in the dashboard within a couple of probe intervals. If a pool reports unhealthy immediately, jump to the allowlisting fix in troubleshooting — the probe is almost certainly being blocked at the origin firewall.

Step 3: Attach pools to the load balancer with failover order

The load balancer ties a hostname to an ordered list of pools. With default_pool_ids listed primary-first and a fallback_pool_id set, Cloudflare serves the first healthy pool in order and only uses the fallback when every default pool is down.

resource "cloudflare_load_balancer" "app" {
  zone_id          = var.zone_id
  name             = "app.example.com"
  default_pool_ids = [
    cloudflare_load_balancer_pool.primary.id,
    cloudflare_load_balancer_pool.fallback.id,
  ]
  fallback_pool_id = cloudflare_load_balancer_pool.fallback.id
  proxied          = true
  steering_policy  = "off"   # strict priority order, not geo or random
  session_affinity = "none"
}

Setting steering_policy = "off" gives you deterministic priority failover: pool order is the failover order. If you later want latency- or geography-aware steering instead of strict priority, layer that on with geo-routing edge functions — but keep priority steering while you are validating failover, because it is the easiest to reason about.

Apply it:

terraform apply
# cloudflare_load_balancer.app: Creation complete after 3s

Step 4: The Route 53 equivalent

Route 53 models the same idea with a health check plus paired failover records. The health check is the monitor; the PRIMARY/SECONDARY failover routing policy is the pool order.

resource "aws_route53_health_check" "primary" {
  fqdn              = "iad.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/healthz"
  request_interval  = 10    # 10 or 30 only
  failure_threshold = 3     # consecutive failures before unhealthy
}

resource "aws_route53_record" "primary" {
  zone_id = var.hosted_zone_id
  name    = "app.example.com"
  type    = "A"
  set_identifier  = "primary-iad"
  health_check_id = aws_route53_health_check.primary.id
  failover_routing_policy { type = "PRIMARY" }
  alias {
    name                   = aws_lb.iad.dns_name
    zone_id                = aws_lb.iad.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "secondary" {
  zone_id = var.hosted_zone_id
  name    = "app.example.com"
  type    = "A"
  set_identifier = "secondary-sfo"
  failover_routing_policy { type = "SECONDARY" }
  alias {
    name                   = aws_lb.sfo.dns_name
    zone_id                = aws_lb.sfo.zone_id
    evaluate_target_health = true
  }
}

The key difference: Route 53 failover is DNS-based, so a resolver that cached the primary answer keeps using it until the TTL expires. Cloudflare’s proxied load balancer makes the decision at the edge on every request, so failover is effectively instant. With Route 53 you trade speed for simplicity — keep the record TTL low (30–60s) so clients re-resolve quickly after a failover.

Step 5: Tune sensitivity to avoid flapping

Flapping is the failure mode where an origin oscillates healthy/unhealthy, dragging traffic back and forth and spamming alerts. The cure is asymmetric thresholds: be slow to declare recovery, quick enough to declare death.

Setting	Aggressive	Balanced	Conservative
interval	5s	15s	60s
timeout	2s	5s	10s
consecutive_down	2	3	5
consecutive_up	3	4	5
Detection time	~10s	~45s	~5min

Make consecutive_up larger than consecutive_down. A dead origin should leave rotation in roughly 45 seconds, but a flapping one must pass several clean probes in a row before it is trusted again. The same low-then-high asymmetry that governs DNS cutovers applies here — if you are coordinating this with a record change, review TTL strategy so the resolver-side cache does not outlast your failover window.

Verification

First confirm both pools are healthy and traffic lands on the primary. Add a header at each origin so you can see which one answers:

curl -sI https://app.example.com/ | grep -i x-served-by
# x-served-by: iad-1

Now fault the primary. The cleanest fault is to make /healthz fail without taking the whole box down, so the monitor reacts but you keep a shell:

# On the primary origin:
sudo ln -s /dev/null /var/www/healthz   # force /healthz -> 404

Watch the pool flip and traffic move. Within consecutive_down * interval seconds the served-by header changes:

while true; do
  curl -sI https://app.example.com/ | grep -i x-served-by
  sleep 5
done
# x-served-by: iad-1
# x-served-by: iad-1
# x-served-by: sfo-1   <-- failover happened

On Route 53, confirm the resolved answer changed (allow for record TTL):

dig +short app.example.com @1.1.1.1
# 203.0.113.20   (was 198.51.100.10)

Restore /healthz, wait consecutive_up * interval, and verify traffic returns to iad-1. If it returns immediately on the first good probe, your consecutive_up is too low and you are at risk of flapping.

Troubleshooting

The pool flaps between healthy and unhealthy

Diagnosis: check the pool health-event log; rapid alternating events confirm flapping. The cause is almost always a timeout shorter than the origin’s tail latency, so slow-but-alive probes intermittently fail. Run a manual timing probe:

curl -o /dev/null -s -w 'connect:%{time_connect} ttfb:%{time_starttransfer}\n' \
  https://198.51.100.10/healthz -H 'Host: app.example.com'
# connect:0.04 ttfb:6.20   <-- exceeds a 5s timeout

Fix: raise timeout above the observed p99 TTFB, increase retries, and raise consecutive_up. Make /healthz genuinely cheap so its latency is decoupled from application load.

A healthy origin is marked critical (allowlisting)

Diagnosis: the origin serves probes fine from your laptop but the pool reports unhealthy. The origin firewall or WAF is dropping the probe source. Confirm from the origin’s access log that probe requests are absent or blocked.

Fix: allowlist the monitor’s source ranges. Cloudflare publishes its health-check egress IPs; Route 53 publishes per-region health-checker ranges. Permit those CIDRs to reach the health path. If your WAF runs at this same edge, scope the bypass to /healthz only — see WAF and rate limiting at the edge for writing a narrow allow rule rather than disabling protection.

False-positive timeouts during deploys

Diagnosis: every deploy triggers a brief failover even though the origin recovers in seconds. The monitor catches the restart window. Check whether failover events line up with deploy timestamps.

Fix: either drain the origin gracefully (set enabled = false on that origin before the restart, then re-enable) or raise consecutive_down so a short blip does not cross the threshold. Graceful drain is preferable because it removes the origin cleanly instead of letting users hit 502s during the detection window.

Both pools report unhealthy at once

Diagnosis: if the fallback also fails, look for a shared dependency — a common database, a single expected status code that both origins stopped returning, or a Host header mismatch affecting both. Probe each origin directly:

curl -so /dev/null -w '%{http_code}\n' https://203.0.113.20/healthz -H 'Host: app.example.com'
# 521   <-- origin down, not a monitor problem

Fix: correct the shared root cause; widen expected_codes only if the origin legitimately returns something other than 200 (for example "200,204"). Never widen it to 2xx blindly — that masks real 204-on-empty-body bugs.

Frequently Asked Questions

How fast does failover actually happen? On a proxied Cloudflare load balancer it is near-instant once the pool is marked critical, because routing is decided per request at the edge. Detection itself takes consecutive_down * interval — about 45 seconds with the balanced settings above. Route 53 adds the record TTL on top, since clients keep the cached primary answer until it expires.

Should the health path require authentication? No. Keep /healthz unauthenticated but unguessable or internal-only, and make it return 200 purely on local readiness. Auth on the probe path just adds a dependency that can fail independently and cause false failovers.

Why not point the monitor at the homepage? A homepage exercises databases, caches, and downstream services. When any of those degrade, the monitor reports the origin as dead and fails over even though the box is fine. A dedicated lightweight endpoint keeps the health signal about the instance, not the entire dependency graph.

Can I run health checks without a paid load balancer? Route 53 health checks are billed per check but do not require a separate load balancer product — paired failover records are enough for active-passive. Cloudflare’s health monitors require the Load Balancing add-on. For a single active-active setup you can also weight origins manually; see the related guide on multi-region weighting.

How do I avoid alert fatigue from transient failovers? Tune consecutive_down upward and add retries so single slow probes are absorbed, route pool-health notifications to a low-noise channel, and gate alerts on sustained state changes rather than every probe event. Asymmetric up/down thresholds do most of the work.

Back to Load Balancing at the Edge