observabilityincident responsecloud

Observability Playbook for Hybrid Outage Scenarios: From Cloudflare to AWS

mmidways

2026-01-28

10 min read

Practical, step-by-step playbook to diagnose and survive multi‑provider outages spanning Cloudflare and AWS using synthetics, tracing, and failover tests.

Hook: When CDN, Cloud, and App Endpoints Fail Together — What Do You Do First?

Outage response across multiple providers is one of the most stressful, highest-impact scenarios engineering teams face. In 2026, with global traffic routed through CDNs like Cloudflare and core services hosted on providers like AWS, cross-provider incidents are no longer rare — they're an operational reality. This playbook gives you a step‑by‑step, battle-tested approach to diagnose and survive simultaneous failures of CDN, cloud provider, and app endpoints using synthetic monitoring, distributed tracing, spike detection, and failover testing.

Executive Summary — Critical Actions First

Run independent synthetics immediately from multiple vantage points (bypass CDN and test origin directly).
Triangulate with distributed tracing and logs to identify provider vs application scope.
Detect spikes and anomalies with robust, time‑aware algorithms (EWMA, percentile baselines) and prioritize alerts by impact (SLO/RTO/RPO).
Execute pre‑tested failover plans (DNS, load balancer, regional DB failover) only after quick health checks and blast radius assessment.
Capture evidence (traces, screenshots, pcap where possible) and update the incident runbook in real time.

The 2026 Context: Why Cross‑Provider Outages Demand a New Playbook

Late 2025 and early 2026 saw several high‑profile incidents where outages at CDN and cloud layers coincided with application degradations, magnifying impact across regions. The shift toward edge architectures, multi‑cloud strategies, and universal adoption of OpenTelemetry means teams must correlate signals across more layers than ever before. Observability in 2026 emphasizes:

Provider‑agnostic telemetry pipelines (OpenTelemetry is the de facto standard).
Third‑party and in‑house synthetics that test both edge and origin.
Chaos and failover testing baked into CI/CD to validate recovery procedures.
Improved eBPF‑based host observability for network path and socket level insights.

Play 1 — Run Immediate, Independent Synthetics

When a multi‑provider incident hits, your first step is to gather quick, independent evidence of which layer is failing. Synthetics provide deterministic checks from controlled locations.

What to run in the first 5 minutes

Global HTTP check against the public URL from 3 distinct providers/regions (e.g., third‑party probes + internal agents).
Origin bypass check using curl --resolve or TLS SNI override to hit the origin IP directly.
DNS resolution checks (dig + trace) to validate authoritative DNS and CDN edge responses.
TCP/TLS handshake checks to confirm connectivity to the edge and to origin.

Quick examples (copy/paste)

curl -I https://www.example.com --connect-timeout 5
# Bypass CDN by resolving host to origin IP
curl -I https://www.example.com --resolve 'www.example.com:443:203.0.113.10' --connect-timeout 5
# Check authoritative DNS
dig +short www.example.com @ns1.exampledns.com
# TLS verify and SNI
openssl s_client -connect 203.0.113.10:443 -servername www.example.com -brief

Run these from multiple vantage points. If edge checks fail while origin bypass succeeds, the problem is likely in the CDN or edge control plane. If both fail, suspect provider or origin app problems.

Play 2 — Use Distributed Tracing to Pinpoint the Failure Boundary

Distributed tracing is essential in 2026. Make sure your traces carry provider boundary metadata (edge POP id, origin region, AZ, and service mesh hop). If you don't already tag spans with this info, prioritize that after the incident.

Trace‑guided triage workflow

Search for recent traces containing errors or increased latency across the last 15 minutes.
Filter traces by entry point (edge vs direct origin vs API gateway) to see where latency/error increases first appear.
Inspect span attributes for network errors (DNS_LOOKUP_FAILURE, TLS_HANDSHAKE_TIMEOUT, ECONNREFUSED) and for provider tags (cloud.provider, cloud.region).
Correlate trace IDs with logs and synthetic timestamps to build a timeline.

Example OpenTelemetry attributes to include in spans:

http.server_name or net.sock.peer.addr
cloud.provider, cloud.region, edge.pop
provider.event_id if you ingest provider incident IDs

Play 3 — Spike Detection and Rate‑based Anomaly Thresholds

Not all spikes are equal. Simple threshold alerts will either overwhelm you or miss early signs. Use time‑aware detection tuned for your traffic profiles.

Recommended algorithms and parameters

EWMA (Exponentially Weighted Moving Average) for latency and error rate smoothing — responsive to recent changes.
Robust Z‑score on request per second (RPS) and 95th/99th percentiles for latency to flag outliers.
Percentile baselining (rolling 7‑day window with day‑of‑week segmentation) to account for traffic patterns.

Practical thresholds

Immediate pager if 5xx errors exceed 2% of requests for 1 minute and are +5x baseline.
Page if RPS drops >30% globally but origin latency remains normal — indicates CDN or DNS issue.
Alert for anomaly if 95th percentile latency increases by 50% for more than 3 minutes.

Play 4 — Triangulate Evidence: CDN vs AWS vs Your App

Use the combined signals from synthetics, traces, and provider dashboards to determine scope. Below are fast checks for common failure modes.

CDN (Cloudflare) suspected

Global edge failures on synthetics while origin bypass (direct to IP) succeeds.
Distributed traces show that the request fails before reaching origin span.
Cloudflare status page or API returns incidents; presence of edge POP tags in spans showing errors.

AWS region/provider suspected

Multiple AWS services show degraded status (EC2, ELB, RDS) and CloudWatch metrics spike or vanish.
Traces show errors originating from AWS-managed components (ELB timeouts, NAT gateway failures).
Route53 or VPC route changes coinciding with the incident.

Application suspected

Origin bypass checks fail and logs show application exceptions (OOM, DB timeouts).
Distributed traces reach origin but show service-level errors/exceptions.

Play 5 — Run Safe Failover (and Know When Not to)

Failover is powerful but dangerous. Establish a small “blast radius” and follow a checklist before failing over.

Failover checklist

Confirm origin and read replicas health via direct checks.
Verify data replication lag (RPO constraints) — if replication lag exceeds your RPO, do not failover writes.
Assess RTO impact and customer SLA obligations; pick strategy (read‑only vs active‑active vs DNS failover).
Execute automated, saved Route53 or Cloudflare load balancer failover plan using pre‑approved runbook steps.
Keep DNS TTLs low for critical endpoints (30–60s) in prepped failover windows, but weigh DNS caching risks in normal operation.

Example: Dual‑region failover steps (Route53):

# Promote secondary by swapping Route53 health checks and routing policies
aws route53 change-resource-record-sets --hosted-zone-id Z123456789 --change-batch file://swap-to-secondary.json

When not to failover

Provider control plane is degraded globally (you may worsen convergence).
Data replication lag violates RPO for writes — risking data loss.
Failover would rely on the same failed provider or network path.

Incident Runbook — Detailed Play‑by‑Play

Below is a concise runbook you can paste into your incident management tool. Keep it versioned and rehearsed.

Initial detection (0–5 minutes)

Run global synthetics (edge + origin bypass).
Check provider status pages (Cloudflare, AWS). Save incident IDs.
Create an incident channel; assign roles: incident commander (IC), comms, engineering lead, SRE lead.
Capture telemetry timestamps, sample trace IDs, and a screenshot of failing checks.

Triage (5–20 minutes)

Aggregate traces and logs; identify earliest failed hop or provider tag.
Run network path diagnosis (traceroute/mtr) from multiple locations.
Assess data path and stateful dependencies (DB, queues). Check replication lag and backup health.

Mitigation (20–60 minutes)

Execute safe failover if indicated (DNS swap, LB redirect, region promotion).
Throttle non‑essential traffic (API rate limits, disable heavy background jobs).
Communicate status and ETA to stakeholders; update public status page with honest RTO/RPO expectations.

Recovery & Follow‑up

Confirm recovery with synthetics and full traffic tests.
Run post‑mortem within 72 hours with remediation and action owners.
Update your runbook, tests, and CI/CD failover automation based on findings; include automation best-practices from tool-stack audits.

Operational Metrics: Aligning SLO, RPO, and RTO for Cross‑Provider Incidents

Design SLOs that explicitly account for multi‑provider failure modes. Typical mappings:

SLO: 99.95% availability for core API — drives alerting sensitivity and failover thresholds.
RPO: Determine maximum acceptable data loss for writes (e.g., 1 minute) and ensure replication supports it.
RTO: Target recovery times for critical paths (e.g., 5 minutes for read traffic failover, 30 minutes for write failover).

Instrument metrics that directly inform RPO/RTO decisions: DB replication lag, queue length, consumer lag, and global error rate by edge vs origin.

Failover Testing: How to Build Confidence Without Causing Customer Pain

Failover testing must be automated and safe. Two proven approaches in 2026:

Game days with canary traffic: route a small percentage of real traffic to the failover path under production safeguards.
Full chaos‑style rehearsals in staging that mimic provider outages using network partitioning and DNS manipulation.

Automated failover test checklist

Preflight: Verify metrics and alerts are active and that rollback playbooks are implemented.
Execute: Run automation that flips Route53/Cloudflare LB to secondary endpoints; monitor application and DB health.
Validate: Run client‑facing synthetics and trace a sample request end‑to‑end.
Rollback: Revert to primary; ensure consistency and data convergence.

Evidence & Post‑Mortem Best Practices

Collect and preserve artifacts for faster root cause analysis and compliance:

Trace samples and span graphs around the incident window.
Synthetic run logs and screenshots with timestamps and vantage points.
Provider incident IDs, change logs, and any API responses.
Configuration snapshots (DNS/Route53, LB policies) before and after changes.

"The right telemetry saved us — not by avoiding an outage, but by ensuring we failed over safely and recovered within our RTO." — SRE team lead, global SaaS firm

Advanced Strategies and 2026 Trends to Adopt Now

Adopt these techniques that became mainstream by 2025–2026:

Provider‑agnostic control plane for failover decisions that stores runbooks as code and supports transactional rollbacks.
Edge-aware tracing: include POP and edge control plane identifiers in spans to trace failures at the CDN level.
Increase use of eBPF observability on hosts and sidecars to detect socket‑level anomalies that span tools miss.
Run synthetic checks from both ISP‑diverse public probes and private agents inside cloud VPCs to detect provider route blackholes. For private-agent strategies and low-cost private probes, see Raspberry Pi cluster guides like Raspberry Pi cluster builds.

Checklist: What to Implement This Week

Ensure synthetics include origin bypass, DNS, and TCP/TLS checks from multiple providers.
Tag spans with cloud.provider/region and edge.pop in your OpenTelemetry pipeline.
Automate failover playbooks and store them in version control; rehearse monthly with game days.
Set up robust spike detection using EWMA and percentile baselining for critical metrics.
Document RTO/RPO for each critical path and map runbook steps to those objectives.

Final Takeaways — What Separates Teams That Survive from Those That Don't

Teams that recover quickly and with low customer impact do three things consistently:

They capture high‑fidelity, provider‑agnostic telemetry ahead of incidents.
They rehearse failover and keep automated, reversible playbooks ready (see automation and audit guidance in tool-stack audits).
They tie monitoring and runbooks to clear SLOs, RTOs, and RPOs so decisions during an incident are fast and aligned to business risk.

Call to Action

If you don’t have a multi‑provider observability pipeline or playbooks versioned as code, start now. Download our free incident runbook template and synthetic test scripts, or schedule a 30‑minute workshop with our SRE consultants to rehearse a Cloudflare + AWS failover game day. Get proven tooling and runbooks that help you reduce RTO and enforce RPO in real cross‑provider outages.

midways

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.