Best TTL Values for High-Traffic SaaS Platforms

Optimizing Time-To-Live for high-traffic SaaS architectures requires balancing DNS query load, cache hit ratios, and incident-response agility. This guide gives you production-grade TTL baselines per record type, a method for synchronizing DNS TTL with CDN cache-control headers, and the exact diagnostic workflows needed to roll out a low-TTL failover window without triggering a cache stampede on your authoritative servers. The objective: pick TTL values you can defend in a postmortem, and verify them with dig and curl rather than guesswork. For foundational context on cache expiration mechanics, see DNS Fundamentals & Advanced Record Configuration.

Key operational takeaways:

  • Baseline TTL recommendations for A/AAAA, CNAME, NS, and SOA records in multi-region SaaS
  • Synchronizing DNS TTL with CDN Cache-Control headers to prevent origin overload
  • A staged pre-incident TTL-reduction procedure that respects how resolvers actually cache
  • Diagnostic commands for verifying resolver caching behavior and propagation status
TTL selection decision flow for SaaS record types A decision diagram mapping DNS record types and change frequency to recommended TTL bands, from 60 seconds for active failover up to one week for delegation records. Pick TTL by record role and change frequency Origin A / AAAA may failover CDN CNAME vendor-managed IPs NS / SOA delegation, stable Active incident staged 48h prior 300s baseline 3600–86400s 86400–604800s 60s floor Resolver floor many clamp TTL < 30s do not set < 60s verify with dig

Prerequisites & Environment Setup

You need shell access with dig (from bind-utils or dnsutils), curl, and the CLI for your DNS provider. The examples below use the AWS CLI for Route 53 and curl against the Cloudflare API. Export credentials before you start:

# Confirm tooling is present
dig -v && curl --version | head -1 && aws --version

# Cloudflare API token with DNS:Edit on the target zone
export CF_TOKEN="cf_your_token_here"
export ZONE_ID="your_zone_id"

# Route 53 hosted zone id for the SaaS apex
export HZ_ID="Z1EXAMPLE"

Pick a low-traffic test record (for example ttl-test.your-saas.com) so you can rehearse the failover procedure without touching production endpoints. Confirm your authoritative nameservers respond before continuing.

Production TTL Baselines for SaaS Infrastructure

The single most useful rule: TTL should track how often the record legitimately changes, plus how fast you need a change to take effect. The table below is the baseline most high-traffic SaaS platforms converge on.

Record Recommended TTL Rationale
A / AAAA (origin) 300s Rapid failover without flooding authoritative servers
CDN CNAME 3600s–86400s Vendor rotates IPs behind the name; long TTL cuts query volume
MX 3600s–86400s Mail routing rarely changes; stability matters more than agility
NS / SOA 86400s–604800s Delegation must be stable; churn destabilizes the chain
Active failover (staged) 60s Temporary, only during an incident window

A/AAAA records (origin IPs) default to 300s (5 minutes). This enables rapid failover without overwhelming authoritative servers during traffic spikes. If you front origins with an edge load balancer, see Configuring Edge Health Checks and Automatic Failover for how health checks and DNS TTL interact.

CDN CNAME records sit at 3600s to 86400s. Extended TTLs align with edge cache lifecycles and drastically reduce recursive query volume across global resolvers. The CDN’s IP pools change infrequently and are abstracted behind the CNAME target, so long TTLs are safe.

NS and SOA records stay at 86400s to 604800s. These must remain stable. Frequent changes increase unnecessary zone-transfer volume and destabilize delegation across parent registries. For dynamic TTL adjustment strategies across the whole zone, see Mastering TTL Strategies.

# Check current TTL from a specific resolver
dig @8.8.8.8 api.your-saas.com +noall +answer
# Expected: api.your-saas.com.  300  IN  A  203.0.113.10
# The middle number (300) is the remaining cached TTL, counting down.

dig @1.1.1.1 api.your-saas.com +noall +stats | grep 'Query time'
# Expected: ;; Query time: 12 msec

Step-by-Step: Synchronizing DNS TTL with CDN Cache-Control

DNS TTL and CDN Cache-Control directives are independent layers. DNS TTL controls how long resolvers cache the IP address; Cache-Control controls how long edge nodes cache the HTTP response. Mismatched values do not directly cause routing failures, but a high DNS TTL during CDN IP rotation forces resolvers to hold stale origin IPs. A full treatment of the response-caching side lives in Cache-Control & CDN TTL; the steps here cover only the DNS-facing alignment.

Step 1 — Read the current DNS TTL and the edge response headers together.

# DNS side
dig @1.1.1.1 cdn.your-saas.com +noall +answer
# Expected: cdn.your-saas.com.  3600  IN  CNAME  d1abc.cloudfront.net.

# HTTP side
curl -I -s https://api.your-saas.com/health | grep -iE 'cache-control|age'
# Expected:
# cache-control: public, s-maxage=60, max-age=0
# age: 17

Step 2 — Use s-maxage for shared edge caching, not DNS TTL. Control how long edge nodes hold a response with Cache-Control: s-maxage; this affects nothing about resolution. Keep the DNS CNAME TTL long because the CDN target name is stable even as the underlying response cache expires every minute.

Step 3 — Match origin A-record TTL to the CDN’s documented IP rotation window. Most managed CDNs rotate edge IPs on a 300s–900s cadence. A 300s origin TTL keeps you inside that window so resolvers never route to a decommissioned IP for long.

Step 4 — Watch cache-hit ratio during any TTL transition. A sudden drop in CDN hit ratio during a TTL change indicates resolver anomalies or premature evictions rather than a real traffic shift. Correlate it against authoritative query volume before reacting.

Step-by-Step: Staged Pre-Incident TTL Reduction

The most important fact about TTL changes: lowering a TTL only affects future queries. Records already cached at the old, higher TTL keep counting down on the old value. So you must lower TTL ahead of any change you want to propagate quickly.

Step 1 — Lower the TTL at least 48 hours before a maintenance window. Set the record to 300s (or your staged 60s) well in advance so global resolvers adopt the new value before the window opens.

curl -s -X PATCH "https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/dns_records/{record_id}" \
  -H "Authorization: Bearer ${CF_TOKEN}" \
  -H "Content-Type: application/json" \
  --data '{"ttl": 300}'
# Expected: {"success":true, ... "ttl":300 ...}

Step 2 — Confirm the lowered TTL is being served everywhere. Poll several public resolvers and watch the TTL ceiling fall to your new value.

for r in 1.1.1.1 8.8.8.8 9.9.9.9 208.67.222.222; do
  echo -n "$r -> "; dig @$r api.your-saas.com +noall +answer | awk '{print $2}'
done
# Expected: each resolver eventually reports 300 (or lower) as the max value.

Step 3 — Execute the change via API, never the console UI. During a P1 incident, console latency is unacceptable. Apply the failover record programmatically.

aws route53 change-resource-record-sets \
  --hosted-zone-id "${HZ_ID}" \
  --change-batch file://failover.json
# Expected: "Status": "PENDING" then "INSYNC" within ~60s
{
  "Changes": [
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.saas-platform.com",
        "Type": "A",
        "TTL": 60,
        "ResourceRecords": [
          { "Value": "203.0.113.20" }
        ]
      }
    }
  ]
}

Note: a plain A record with a TTL does not use Route 53 Failover or routing-policy fields. Those require a routing policy with SetIdentifier and HealthCheckId and are configured separately.

Step 4 — Plan the rollback before you flip. If the new IP raises error rates, revert the record via API to the previous stable IP immediately, keep the low TTL for ~15 minutes to flush resolver caches, then restore the 300s baseline.

Step 5 — Mind DNSSEC signature windows. If the zone is signed, RRSIG records carry their own validity periods. Ensure RRSIG validity is longer than your DNS TTL so signatures do not expire mid-rollout and trigger validation failures.

Verification

After any TTL or record change, verify resolution and propagation explicitly rather than assuming.

# Fast single-shot check against an edge resolver
dig @1.1.1.1 api.your-saas.com +time=2 +tries=1 +short
# Expected: 203.0.113.20

# Differentiate authoritative vs recursive caching layers
dig api.your-saas.com +trace | tail -5
# Expected: final answer line served from your authoritative NS

# Flush a local stub cache when re-testing from one host
sudo systemd-resolve --flush-caches   # or: sudo resolvectl flush-caches

# Confirm the SOA negative-cache (minimum) TTL is sane
dig @ns1.provider.com your-saas.com SOA +noall +answer
# Expected last field ~300 for the SOA minimum on a SaaS zone

For a deeper propagation methodology across global vantage points, see Debugging DNS Propagation Delays Across Global Resolvers.

Troubleshooting

Setting TTL below 30 seconds

  • Symptom: Unpredictable failover timing; sudden authoritative-server load spikes.
  • Diagnosis: Many recursive resolvers (ISP, corporate firewalls) clamp sub-30s values to a configured minimum, often 30s or 60s. Your intended 10s never takes effect.
  • Fix: Use 60s as the absolute production floor. Validate with dig +noall +answer across 1.1.1.1, 8.8.8.8, and 208.67.222.222 and confirm the served TTL.

CDN CNAME pointing to a provider with dynamic IP rotation

  • Symptom: Intermittent 502/504 errors during routine CDN maintenance windows.
  • Diagnosis: A high DNS TTL causes resolvers to route to decommissioned IPs until the cache expires.
  • Fix: Match the origin A-record TTL to the provider’s IP rotation window (typically 300s–900s). Run a dig polling script to alert on TTL drift.

SOA negative-caching TTL mismatch during zone updates

  • Symptom: Prolonged NXDOMAIN responses after a subdomain is temporarily removed.
  • Diagnosis: Extended negative caching (SOA minimum TTL) makes resolvers cache failed lookups longer than intended.
  • Fix: Set the SOA minimum TTL to 300s for SaaS zones. Verify with dig @ns1.provider.com your-saas.com SOA +noall +answer.

TTL change “not propagating” within the expected window

  • Symptom: Some resolvers still serve the old IP long after the change.
  • Diagnosis: You lowered the TTL and changed the record at the same time. Records cached under the old high TTL keep their original countdown.
  • Fix: Always lower TTL first, wait at least the old TTL duration, then change the record. Confirm the new low TTL is served everywhere before the change window.

Stale answers from a single corporate forwarder

  • Symptom: One office reports the old endpoint while public resolvers are correct.
  • Diagnosis: An on-prem forwarder enforces a minimum cache TTL or ignores low TTLs.
  • Fix: Flush the forwarder cache, and treat its floor as a hard constraint when planning failover timing for that segment.

Frequently Asked Questions

What is the optimal TTL for a SaaS API endpoint behind a global CDN? 300 seconds. It balances rapid IP rotation during edge-node maintenance against acceptable query volume for high-traffic APIs, preventing both stale routing and authoritative-server overload.

Can I set TTL to 0 to force immediate DNS updates? RFC 2181 section 8 permits a 0 TTL, but most resolvers clamp it to 1–30 seconds and some treat it as an error. Zero TTL also disables caching entirely, which suits specific health-check records but not production traffic. Use 60s as the practical floor.

How do I verify if my TTL changes have propagated globally? Run dig @resolver_ip domain.com +noall +answer across several geographic resolvers such as 1.1.1.1, 8.8.8.8, and 9.9.9.9. The TTL countdown in each answer section tells you how long that resolver will keep serving the old value, so you can predict exactly when propagation completes.

Why lower TTL 48 hours before a change instead of right before? Because lowering a TTL only affects future lookups. Records cached at the old, higher value continue their original countdown. Lowering early lets every resolver adopt the new short TTL so your eventual change propagates within minutes.

Back to Mastering TTL Strategies