AWS outage 2025: what CTOs should do differently
Read Time 6 mins | Written by: Jimmy Jacobson
On October 20, 2025, at 3:11 AM Eastern Time, a DNS resolution failure in AWS's US-EAST-1 brought large portions of the internet to a standstill.
My sister couldn't log into her healthcare portal. Teams across the country were locked out of critical business tools. Hubspot, Asana, and some of our own systems at Codingscape were inaccessible. For over seven hours, Snapchat went dark, Fortnite players couldn't log in, Ring doorbells stopped recording, and trading platforms like Robinhood left investors helpless. UK banks lost service. Even Amazon's own Alexa fell silent.
Less than a month later, it happened again – this time at Cloudflare. On November 18, a single configuration change took down X, ChatGPT, Spotify, Discord, PayPal, and thousands of other services. Even Downdetector went dark.
Two major outages in 30 days. Two different providers. The same underlying problem: we've built our digital infrastructure on a handful of platforms, and when they fail, everything fails with them.
The technical details matter, but the strategic implications matter more. This wasn't just an outage – it was a stress test of our centralized digital infrastructure, and most organizations failed.
Here's my retro on what happened with AWS, what it actually cost, and what you can do to prepare for the next one – because there will be a next one.
Why one AWS outage broke the internet
AWS runs a huge portion of the internet. Most modern business applications – from your project management tools to your CRM – live on AWS infrastructure. When AWS goes down, millions of services go down with it.
The irony? Even AWS runs on AWS. Amazon's own monitoring and management tools rely on the same infrastructure. It's turtles all the way down. (Ask me about that saying if you don't know it.)
US-EAST-1 caused the problem. This is AWS's original cloud region, and when it has issues – like it did on Monday, and three years ago before that – the ripple effects hit hard.
The real cost of AWS downtime: $75 million per hour
The outage cost US companies approximately $75 million per hour.
Industry analysts estimate the total financial impact reached into the hundreds of billions of dollars because millions of workers couldn't do their jobs. Business operations stopped or stalled across industries from airlines to factories.
Over 1,000 companies were directly affected. For context, a similar incident involving CrowdStrike in 2024 caused $5.4 billion in losses for Fortune 500 companies alone.
But direct revenue loss is only part of the equation, there were hidden costs:
- Reputational transfer: End users do not blame AWS. They blame you. Your customers don't care about your cloud provider’s DNS issues; they care that your service was unavailable.
- The Insurance gap: Many cyber insurance policies do not trigger unless an outage lasts eight hours or more. This event lasted roughly seven hours. Most companies absorbed these losses entirely, impacting Q4 P&L directly. E.g. out of the $5.4 billion loss with the 2024 CrowdStrike outage, only $1.08 billion would be covered by insurance
- Opportunity cost: Every engineer currently fixing data inconsistencies caused by the outage is an engineer not shipping your roadmap.
What actually went wrong with AWS
The root cause was deceptively simple: a DNS resolution failure affecting DynamoDB API endpoints in US-EAST-1. Because DynamoDB is a "foundational service" that dozens of other AWS services depend on, the failure cascaded rapidly. Within minutes, 113 AWS services were affected.
What's more concerning is how long it took AWS to diagnose the problem. It took 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint". For a company that holds up most of the internet, this is too long and preventable.
For those 75 minutes, visitors to the AWS status page were met with an "all is well!" default response – despite thousands of services burning down globally.
The talent problem no one's talking about
Here's the uncomfortable question: did Amazon cause this problem by replacing engineers with automation and losing the institutional knowledge needed to fix it?
Amazon’s retro on the US-EAST-1 outage points to an automation failure: “The incident was triggered by a latent defect within the service’s automated DNS management system that caused endpoint resolution failures for DynamoDB.” They didn't elaborate, but it sounds like they automated a crucial process that once had an engineer watching it.
Over 27,000 Amazon employees have been affected by layoffs between 2022 and 2025, with internal reports suggesting regretted attrition rates between 69-81% across employment levels. When your most experienced engineers leave, they take decades of institutional knowledge about system failure modes with them.
You can hire smart people to explain how DNS works at a deep technical level, but you can't hire the person who remembers what to do when DNS goes down. That tribal knowledge often doesn't exist in documentation – it exists in the minds of engineers who've been through multiple outages.
This suggests it wasn't just a novel technical problem. It was a people problem manifesting as a technical failure.
Why your multi-AZ setup didn't save you
Many CTOs assumed their multi-AZ deployments would protect them from regional failures. They were wrong.
Single-region dependency creates a single point of failure – relying solely on one AWS region like US-EAST-1 puts your services at risk if that region suffers an outage. When foundational services fail at the regional level, spreading across availability zones within that region provides zero protection.
Platform services failures take down entire application ecosystems. Outages in identity, networking, and event systems can ripple across every application that depends on those services.
What CTOs should do differently
Calculate your real cost of failure
Don't rely on generic industry estimates. Calculate revenue per minute, customer churn risk, contractual penalties, and reputational damage specific to your business. The question is not "How much does it cost to run?" but "How much does it cost to fail?"
This number becomes the business justification for every resilience investment.
Pick the right disaster recovery strategy
There are four main approaches, each with different cost and complexity tradeoffs:
- Backup and restore is suitable for mitigating against data loss, with backups replicated to other AWS Regions. It's the cheapest option but provides RTO measured in hours.
- Pilot light keeps critical data live and data stores up-to-date with the active Region, but services remain idle until failover. This offers 10-30 minute recovery times at moderate cost.
- Warm standby maintains live data while keeping infrastructure running at reduced capacity, requiring scale-up before failover to meet production needs. This provides single-digit minute recovery.
- Multi-site active/active has each Region hosting a highly available workload stack serving production traffic, with data replicated live between regions. This offers near-zero RTO but at the highest cost.
If your definition of disaster goes beyond disruption of a physical data center to that of a Region, or if you're subject to regulatory requirements, consider Pilot Light, Warm Standby, or Multi-Site Active/Active.
Monitor from outside your cloud
Build observability that doesn't depend on AWS. The AWS status page lagged reality by 75 minutes during this incident. Deploy synthetic monitoring from outside AWS and set up alerting through non-AWS channels.
Transparency matters. When issues arise, you need to communicate with customers immediately – and you can't do that if you're relying on the same infrastructure that's down.
Design failovers that actually work
When choosing AWS resources for disaster recovery, remember that data planes have higher availability design goals than control planes. Use only data plane operations as part of your failover for maximum resiliency. Design failover procedures that route traffic, not launch instances or modify infrastructure.
Test relentlessly
Simulate regional outages and switch traffic to alternate deployments, monitoring failover performance and updating plans based on findings. Use AWS Resilience Hub to validate whether you'll meet RTO and RPO targets. Run drills during business hours to uncover real operational challenges.
Go multi-cloud where it matters
For high-risk workloads, spreading across different providers reduces systemic risk. Start with authentication, identity services, and payment processing. Don't try to go multi-cloud everywhere – focus on services where the cost of failure exceeds the cost of complexity.
Keep your senior engineers close
Technology alone won't save you. Create runbooks for regional failure scenarios, document historical incidents, and invest in senior engineers who understand system failure modes. Build redundancy in expertise, not just infrastructure.
AWS outages are inevitable – will you be ready?
Resilience used to be an engineering goal; now it's a market differentiator. Clients, partners, and investors all ask one question: "Can you stay online when others can't?"
The October 2025 AWS outage wasn't an anomaly. It was a predictable consequence of hyper-concentrated digital infrastructure combined with organizational hollowing-out at scale. The companies that came through with minimal impact weren't lucky – they had invested in resilience architecture, practiced their failover procedures, and built teams that could respond effectively when things went sideways.
Here's what I keep coming back to: the most critical question isn't whether another major outage will happen. It's whether your organization will be ready when it does. That answer determines whether you're building a resilient business or just hoping for the best.
I know which one I'd bet on.
Don't Miss
Another Update
new content is published

