AWS Design Patterns

Ten battle-tested AWS architecture patterns for common system design challenges. Each pattern includes the problem, the AWS solution, and the rationale for why it works.

1. High-Traffic E-Commerce Platform

Problem

An online retailer expects a 10x traffic surge during Black Friday. The current single-server setup crashes at 5K concurrent users. They need 99.99% uptime, sub-200ms response times, and the ability to handle 100K+ concurrent users.

Solution

Services: Application Load Balancer (ALB) + Auto Scaling Group (ASG) + RDS Multi-AZ + ElastiCache + CloudFront + S3 + Route53

Route53 ──► CloudFront ──► ALB ──► Auto Scaling Group (EC2)
                              ├──► ElastiCache (Redis)
                              └──► RDS Multi-AZ (Primary/Standby)

Why This Solution

ALB distributes incoming traffic across an Auto Scaling Group that dynamically scales EC2 instances based on CPU utilization and memory pressure. RDS Multi-AZ with a synchronous standby replica in a different Availability Zone provides automatic failover and database high availability. ElastiCache (Redis) offloads read-heavy product catalog queries, reducing database load by up to 80%. CloudFront caches static assets (images, CSS, JS) at 600+ edge locations worldwide, slashing latency. Route53 with latency-based routing directs users to the closest healthy region. This architecture is cost-effective because you only pay for the compute capacity you consume during spikes, and all services are fully managed, eliminating the need for server patching and OS maintenance.

2. Serverless Image Processing Pipeline

Problem

A mobile app lets users upload photos which need resizing, format conversion, and NSFW detection. Processing 10K images per day currently takes hours on a single EC2 instance and requires constant monitoring.

Solution

Services: S3 (trigger) + Lambda + SQS + DynamoDB + Rekognition + S3 (output)

S3 Upload ──► Lambda (resize) ──► SQS ──► Lambda (NSFW check) ──► S3 Output
                  │                        │
                  └──► DynamoDB (state)    └──► Rekognition (moderation)

Why This Solution

Event-driven architecture using S3 event notifications to trigger Lambda functions eliminates idle server time. SQS decouples the resize and NSFW processing stages, providing resilience — if one Lambda fails, the message remains in the queue for retry. DynamoDB tracks per-image processing state and status, enabling idempotent processing. Amazon Rekognition provides pre-trained content moderation APIs with no ML expertise required. The entire pipeline is serverless: no EC2 instances to manage, no cluster to size, and it scales from zero to thousands of concurrent invocations automatically. Pricing is pay-per-request, making it ideal for spiky or unpredictable workloads.

3. Real-Time Analytics Dashboard

Problem

A SaaS platform needs to process 50K events per second from user interactions and surface real-time dashboards to customers. The current batch processing approach using nightly ETL jobs is too slow — data is always 24 hours stale.

Solution

Services: Kinesis Data Streams + Lambda + Kinesis Analytics + S3 + QuickSight + ElastiCache

User Events ──► Kinesis Data Streams ──► Lambda (enrichment)
                                │
                    ┌───────────┴───────────┐
                    ▼                       ▼
            Kinesis Analytics         S3 (raw archive)
                    │                       │
                    ▼                       ▼
            ElastiCache (hot)        QuickSight (SPICE)

Why This Solution

Kinesis Data Streams captures streaming data durably with 24-hour retention, allowing multiple consumers to process the same stream. Lambda functions enrich and transform records in near real-time with sub-second latency. Kinesis Data Analytics runs SQL queries directly on streaming data for windowed aggregations (e.g., "users in last 5 minutes"). QuickSight with SPICE (Super-fast, Parallel, In-memory Calculation Engine) delivers interactive dashboards that load in milliseconds. ElastiCache provides sub-millisecond reads for frequently accessed hot metrics. This architecture delivers sub-second event-to-dashboard latency compared to the hours or days required by traditional batch ETL.

4. Multi-Region Disaster Recovery

Problem

A financial services company requires a Recovery Point Objective (RPO) of 5 minutes and a Recovery Time Objective (RTO) of 15 minutes across AWS regions. A single-region deployment risks total service outage during a regional failure.

Solution

Services: Route53 (failover) + Aurora Global Database + S3 CRR + DynamoDB Global Tables + CloudFront + Global Accelerator

Route53 (active-passive failover)
    │
    ├── Primary Region (us-east-1)
    │   ├── Aurora Global (writer)
    │   ├── DynamoDB (local)
    │   └── S3 (source)
    │
    └── Secondary Region (eu-west-1)
        ├── Aurora Global (reader, promote on failover)
        ├── DynamoDB Global Tables (multi-master)
        └── S3 CRR (destination)

Why This Solution

Aurora Global Database replicates data across regions with a typical lag of under 1 second, consuming no database capacity for replication. DynamoDB Global Tables provide active-active multi-region replication with last-writer-wins conflict resolution. S3 Cross-Region Replication (CRR) asynchronously copies assets to the secondary region. Route53 health checks monitor the primary region's endpoints and automatically update DNS records to point to the secondary region on failure, with a configurable DNS TTL as low as 60 seconds. Global Accelerator provides static anycast IP addresses that route traffic to the closest healthy endpoint. No manual intervention is required during failover, meeting the sub-15-minute RTO target.

5. Microservices with API Gateway

Problem

A legacy monolith needs to be decomposed into microservices. The new architecture must handle authentication, rate limiting, API versioning, and service discovery without adding operational complexity.

Solution

Services: API Gateway + ECS Fargate + Cloud Map + ALB + RDS + ElastiCache + Cognito

Cognito ──► API Gateway ──► ALB ──► ECS Fargate (service A)
  (auth)      (rate limit,    │         ECS Fargate (service B)
               versioning)    │         ECS Fargate (service C)
                              │
                              ├──► Cloud Map (service discovery)
                              ├──► ElastiCache
                              └──► RDS (Aurora)

Why This Solution

API Gateway handles cross-cutting concerns — authentication via Cognito authorizers, rate limiting with usage plans, API versioning via stage variables, and request validation — all at the edge before traffic reaches compute. ECS Fargate runs containers without provisioning or managing EC2 instances; each microservice scales independently based on its own CloudWatch alarms. Cloud Map provides service discovery with health checks so services find each other via DNS or HTTP APIs. Aurora RDS is shared via a database-per-service pattern or a schema-per-service pattern with connection pooling via RDS Proxy. Each microservice can be deployed, scaled, and updated independently, enabling team autonomy.

6. Event-Driven Data Lake

Problem

A company collects data from 20+ sources including CRM, ERP, social media, and IoT sensors. Current data silos make cross-source analytics impossible. They need a unified, queryable platform without managing Hadoop clusters.

Solution

Services: S3 (data lake) + Glue + EventBridge + Lambda + Athena + Lake Formation + QuickSight

Sources (20+) ──► EventBridge ──► Lambda (transform) ──► S3 (raw zone)
                                     │
                                     ▼
                               Glue (crawl & catalog)
                                     │
                                     ▼
                          Athena (serverless SQL queries)
                                     │
                                     ▼
                          QuickSight (dashboards)

Why This Solution

Amazon EventBridge ingests events from 20+ AWS and SaaS sources with filtering and routing rules. Lambda functions transform and land data into S3 in a structured format (Parquet/ORC). AWS Glue crawlers automatically discover schema changes and update the Data Catalog with table definitions. Athena provides serverless SQL querying — no clusters to provision, pay only for data scanned. Lake Formation centralizes column-level and row-level security policies across all analytics services. This eliminates the operational burden of managing an Apache Hadoop or Spark cluster while providing a unified view across all data sources with ACID transactions via Lake Formation.

7. CI/CD Pipeline for Containers

Problem

Developers push code 50 times per day. Manual deployments take 2 hours, frequently introduce errors, and lack a reliable rollback mechanism. The team needs automated testing, building, and deployment.

Solution

Services: CodeCommit + CodeBuild + ECR + CodePipeline + ECS/EKS + CodeDeploy (blue/green)

CodeCommit ──► CodeBuild (test & build) ──► ECR (image registry)
     │                                            │
     └────────────── CodePipeline ─────────────────┘
                            │
                            ▼
                    CodeDeploy (blue/green)
                            │
                            ▼
                    ECS Fargate (production)

Why This Solution

Fully managed CI/CD eliminates the need to maintain a Jenkins server or build infrastructure. CodeBuild auto-scales build compute — no waiting for build agents. CodePipeline orchestrates the full release workflow with manual approval gates and integration with GitHub, Bitbucket, and S3. Blue/green deployments via CodeDeploy spin up the new task definition (green) alongside the existing one (blue), shift traffic gradually, and automatically roll back if health checks fail. ECR stores container images with vulnerability scanning. This pipeline reduces deployment time from 2 hours to under 5 minutes with built-in compliance auditing.

8. HIPAA-Compliant Healthcare Platform

Problem

A telemedicine startup needs to handle Protected Health Information (PHI) with strict compliance requirements including HIPAA and SOC 2. Data must be encrypted at rest and in transit with full audit trails for every access.

Solution

Services: VPC + PrivateLink + KMS + CloudTrail + Config + Shield Advanced + Macie + S3 (SSE)

VPC (isolated network)
 ├── PrivateLink (no internet for PHI)
 ├── KMS (CMK for envelope encryption)
 ├── CloudTrail (all API calls logged)
 ├── Config (compliance rules)
 ├── Shield Advanced (DDoS protection)
 ├── Macie (PII discovery in S3)
 └── S3 (SSE-S3/SSE-KMS for at-rest encryption)

Why This Solution

AWS PrivateLink keeps all PHI traffic within the AWS network, never traversing the public internet. KMS with Customer Master Keys (CMK) provides envelope encryption with automatic key rotation and integration with CloudTrail for key usage auditing. CloudTrail logs every API call with identity, timestamp, and payload — essential for HIPAA audit requirements. AWS Config monitors resources against compliance rules (e.g., "S3 buckets must have encryption enabled") and auto-remediates violations. Macie uses machine learning to automatically discover, classify, and protect PII in S3 buckets. Shield Advanced provides DDoS protection. All services are HIPAA-eligible under AWS's BAA, and no VPN or bastion host complexity is needed since traffic stays within the AWS backbone.

9. Cost-Optimized Batch Processing

Problem

A genomics research lab runs massive batch compute jobs requiring 10,000 cores overnight. Using on-demand EC2 instances costs $10,000 per night, consuming the majority of the research budget.

Solution

Services: AWS Batch + Spot Instances + S3 + Lifecycle Policies + CloudWatch

CloudWatch Events ──► AWS Batch (job queue)
                            │
                            ▼
                    Spot Fleet (EC2 Spot)
                            │
                            ▼
                    S3 (input/output)
                            │
                            ▼
                    S3 Lifecycle (Glacier after 30 days)

Why This Solution

AWS Batch dynamically provisions the right compute resources based on job requirements (vCPU, memory, GPU). Spot Instances are 70-90% cheaper than on-demand; AWS Batch handles Spot interruption by checkpointing jobs and resuming on new capacity. CloudWatch Events triggers batch jobs on a nightly schedule. S3 Lifecycle Policies automatically transition data to S3 Glacier Deep Archive after 30 days, reducing storage costs by another 95%. For genomics workloads, AWS Batch can be configured with Elastic Fabric Adapter (EFA) for tightly coupled MPI jobs. The same throughput is achieved at roughly 80% lower cost compared to on-demand, saving $8,000 per night.

10. Global Content Delivery Platform

Problem

A video streaming startup needs to deliver 4K content to users across 200 countries with under 1 second of buffering time. Origin servers are located exclusively in US-East, causing high latency for international users.

Solution

Services: CloudFront + S3 + Lambda@Edge + WAF + Shield + MediaTailor + CMAF

S3 (origin) ──► CloudFront (600+ POPs)
                    │
        ┌───────────┼───────────┐
        ▼           ▼           ▼
  Lambda@Edge   WAF & Shield  MediaTailor
  (customize    (block bad    (ad insertion
   content)      traffic)      at edge)

Why This Solution

CloudFront's 600+ Points of Presence worldwide cache content at the edge, dramatically reducing latency for users far from the origin. Lambda@Edge runs lightweight Node.js/Python functions at CloudFront edge locations to customize content — rewriting URLs, injecting headers, A/B testing — with no compute in the origin region. AWS WAF blocks malicious traffic at the edge using managed rule sets (OWASP top 10, SQL injection, XSS) before it reaches the origin. Shield Advanced provides L3/L4 DDoS protection with 95% cost credit if DDoS causes a spike in CloudFront bill. MediaTailor inserts personalized ads into video streams at edge locations, eliminating centralized ad stitching. CMAF (Common Media Application Format) chunked encoding ensures low-latency streaming with sub-100ms TTFB globally.

← Back to System Design Index