AWS Disaster Recovery: Deep Dive into Service Behaviors
Disaster Recovery (DR) in AWS is categorised into four primary techniques. This guide provides a detailed breakdown of how individual AWS services behave within each technique, including architectural examples and a deep dive into encryption handling across regions.
1. Disaster Recovery Techniques: Service Matrix
The table below summarizes how core services are utilized across the four DR strategies.
| Service | Backup & Restore | Pilot Light | Warm Standby | Active-Active |
|---|---|---|---|---|
| Route 53 | Manual DNS update to new endpoint. | DNS Failover with Health Checks. | DNS Failover with Health Checks. | Latency/Weighted routing. |
| Global Acc. | New endpoint added to Accelerator. | Failover via Health Checks. | Traffic Dial adjusted (e.g., 90/10). | Active in both (Anycast IP). |
| ALB/ELB | Provisioned only during recovery. | Pre-provisioned in DR region (idle). | Always running, handling min traffic. | Always running, handling live traffic. |
| EC2 / ASG | Rebuild from AMIs. ASG=0. | ASG=0. Launch from copied AMIs. | ASG=Running (Smallest instance). | ASG=Running (Full scale). |
| RDS | Restore from Snapshots. | Cross-Region Read Replica. | Cross-Region Read Replica. | Global Database (Aurora). |
| EBS | Restore from Snapshots. | Snapshots replicated & updated. | Snapshots replicated & updated. | Native replication (via app). |
| S3 | Periodic manual copies. | Cross-Region Replication (CRR). | Cross-Region Replication (CRR). | Bi-directional replication. |
| SQS | New queue in DR region. | New queue, logic handles failover. | Pre-provisioned, idle. | Regional queues, app synchronizes. |
2. Deep Dive: Data Replication Mechanisms
Understanding how data moves between regions is critical for setting RPO (Recovery Point Objective).
A. Native Replication (Service-Specific)
Native replication refers to features built directly into the AWS service to handle data movement automatically. - Aurora Global Database: Uses a dedicated replication infrastructure in the storage layer. It provides RPO of < 1 second and RTO of < 1 minute. - DynamoDB Global Tables: Provides multi-master, multi-region replication. Every write in one region is automatically propagated to all other regions, usually within a second. - RDS Cross-Region Read Replicas: Uses the database's native asynchronous replication (e.g., MySQL binary logs or PostgreSQL streaming replication) to keep a secondary copy in another region.
B. Cross-Region Replication (CRR) (Storage Layer)
Primarily associated with Amazon S3. - How it works: When an object is uploaded to a source bucket, S3 automatically replicates it to a destination bucket in a different region. - Requirements: Versioning must be enabled on both buckets. - Use Case: Used for Pilot Light and Warm Standby to ensure configuration files, installation packages, and backups are ready in the DR region. - S3 RTC (Replication Time Control): An optional feature that guarantees 99.9% of objects are replicated within 15 minutes.
C. Bi-directional Replication (Active-Active)
Data is synchronized in both directions, allowing users to write to either region. - Mechanism: S3 supports two-way replication. DynamoDB Global Tables are natively bi-directional. - Conflict Resolution: Typically follows "Last Writer Wins," but application logic must be careful with concurrent updates to the same record in different regions.
3. The "Message Gap": How SQS Data handles Disaster
A common question in DR is: "What happens to the messages currently sitting in my SQS queue if the region fails?"
The Hard Truth: SQS is a Regional Service
Standard and FIFO SQS queues are Regional. They do not natively replicate messages to another region. - In a Disaster: If Region A (Primary) goes completely offline, any messages currently in the "In-Flight" or "Available" state in that region are trapped. They cannot be accessed until Region A comes back online.
How to handle this:
- Pilot Light / Warm Standby:
- During failover, your application in the DR region starts using a fresh, empty SQS queue.
- Result: You might experience a temporary "gap" where the messages from the primary region haven't been processed. Once the primary region recovers, you can run a "cleanup" consumer to process those old messages.
- Active-Active with Zero Data Loss:
- App-Level Dual-Write: The producer application writes the message to SQS queues in both Region A and Region B.
- Idempotency: The consumer must be idempotent (using a unique MessageID) so that if it receives the same message from both regions, it only processes it once.
- External Sync (RabbitMQ/Kafka):
- If using RabbitMQ on EC2, you can use the Shovel or Federation plugin to continuously "move" messages from the Primary to the Standby region, ensuring the Standby queue is never empty.
4. Detailed Strategy Workflows & Diagrams
A. Backup and Restore
Infrastructure is non-existent in the DR region until a disaster occurs.
graph LR
subgraph "Primary Region"
App1[App Server] --> DB1[(Primary DB)]
DB1 --> S3_1[S3 Backup]
end
S3_1 -.->|Daily Copy| S3_2[S3 DR Region]
subgraph "DR Region (Disaster Occurs)"
S3_2 --> DB2[(Restore DB)]
DB2 --> App2[Spin up App]
end
- RDS: Snapshots are taken daily and copied to the DR region. Restore takes time proportional to DB size.
- EBS: Snapshots are copied. Data loss = time since last snapshot.
B. Pilot Light
Data is live, but application servers are "off".
graph LR
subgraph "Primary Region"
R53[Route 53] --> ALB1[ALB]
ALB1 --> ASG1[EC2 ASG]
ASG1 --> RDS1[(Master RDS)]
end
RDS1 -->|Async Replication| RDS2[(RDS Read Replica)]
subgraph "DR Region"
RDS2
ASG2[ASG Capacity 0]
end
R53 -.->|Failover| ASG2
- RDS: Promote Read Replica to Master.
- EC2: Update ASG
DesiredCapacityfrom 0 to N. - ALB: Pre-provisioned to avoid "warm-up" delays.
C. Warm Standby
A "shadow" environment is always running at minimal scale.
graph LR
subgraph "Primary Region"
R53[Route 53] --> ALB1[ALB]
ALB1 --> ASG1[EC2 (10 Instances)]
ASG1 --> RDS1[(Master RDS)]
end
RDS1 -->|Async Replication| RDS2[(RDS Read Replica)]
subgraph "DR Region"
ALB2[ALB] --> ASG2[EC2 (1 Instance)]
ASG2 --> RDS2
end
R53 -.->|Health Check Fail| ALB2
- EC2: Smallest instance type (e.g., t3.micro) keeps the app alive.
- RTO: Seconds to Minutes (Scale up via Auto Scaling).
D. Active-Active (Multi-Site)
Traffic is served from both regions simultaneously.
graph TD
User((User)) --> R53[Route 53 Latency Routing]
R53 --> RegionA[Region A - Active]
R53 --> RegionB[Region B - Active]
subgraph "Data Sync"
DBA[(Aurora Global DB)] <--> DBB[(Aurora Global DB)]
S3A[S3 CRR] <--> S3B[S3 CRR]
end
- DynamoDB: Use Global Tables for multi-master writes.
- Aurora: Use Global Databases for < 1s replication.
- Route 53: Directs users to the closest region.
3. Deep Dive: Encryption & Key Management (KMS)
Can we copy an encrypted key to a different region?
The short answer is: NO for standard keys, YES for Multi-Region Keys.
1. Standard (Single-Region) Keys
- Behavior: Standard KMS keys are physically locked to the region where they were created. They cannot be exported or copied.
- DR Handling:
- To move an encrypted resource (EBS Snapshot, RDS Snapshot, or S3 Object), you must re-encrypt the data during the copy process.
- Example (EBS Copy):
aws ec2 copy-snapshot --source-region us-east-1 --source-snapshot-id snap-123 --destination-region us-west-2 --encrypted --kms-key-id <Key_in_West_2> - This process decrypts the data in the source region using the source key and re-encrypts it in the destination region using the destination key.
2. Multi-Region Keys
- Behavior: These allow you to create a "Primary" key in one region and "Replica" keys in other regions. These replicas share the same Key ID and Key Material.
- DR Benefit: You don't need to re-encrypt data. A volume encrypted in
us-east-1can be attached to an instance inus-east-2if both regions use the same Multi-Region replica key. - Workflow: Create Primary in
us-east-1-> Replicate tous-west-2. Both now share the same ARN (with different region parts) and can decrypt each other's data.
4. Service-Specific DR Details (The "How-To")
🏘️ AWS Global Accelerator
- Global Service: Unlike ALB (Regional), Global Accelerator is a global network service.
- Anycast IP: Provides two static anycast IP addresses that act as a single entry point for your application globally.
- DR Advantage:
- Speed: Failover happens within seconds because it doesn't rely on DNS TTL (Time-to-Live).
- Traffic Dials: You can use "Traffic Dials" to control the percentage of traffic directed to a specific region (ideal for Warm Standby testing).
- Endpoint Weights: Fine-grained control over traffic within a region.
🛡️ Elastic Block Store (EBS)
- Backup/Restore: Manual snapshots.
- Pilot/Warm: Use Data Lifecycle Manager (DLM) to automate snapshot creation and cross-region copy.
- Re-encryption: Essential if using Single-Region KMS. DLM can handle this automatically.
🗄️ Relational Database Service (RDS)
- Snapshot Copying: Like EBS, RDS snapshots can be copied cross-region and re-encrypted.
- Read Replicas: The best for Pilot Light/Warm Standby.
- Encrypted DB? The Read Replica MUST also be encrypted.
- If the master uses KMS, the cross-region replica will use a KMS key in its own region.
📨 Messaging: SQS & RabbitMQ
- SQS: Being a regional service, if a region goes down, the messages in that queue are lost until the region recovers.
- DR Tip: Apps should perform "Dual-Writes" to two regions or use an abstraction layer if zero message loss is required.
- RabbitMQ: Use the Shovel plugin to continuously move or copy messages from a primary cluster to a standby cluster in another region.
🌐 CloudFront
- Origin Failover: Configure CloudFront with an Origin Group. If the primary ALB (Region A) returns a 5xx error, CloudFront automatically routes the request to the secondary ALB (Region B). This is a powerful, low-latency DR mechanism for the frontend.