DevOps Design Patterns

Ten battle-tested DevOps system design patterns covering deployment strategies, observability, security, compliance, cost governance, and disaster recovery for production Kubernetes and cloud-native environments.

1. GitOps with ArgoCD and Kustomize

Problem

A team of 20 developers deploys to 5 Kubernetes clusters (dev, staging, and prod in 3 regions). Deployments are inconsistent across environments, and manual kubectl commands cause 3–4 production incidents per month. There is no audit trail, and rollbacks are performed by hunting through shell history.

Solution

Stack: GitHub + ArgoCD + Kustomize + EKS/GKE + Sealed Secrets + OPA/Gatekeeper

All Kubernetes manifests live in a Git repository. Kustomize defines a base set of resources and environment-specific overlays (dev, staging, prod-us, prod-eu, prod-apac). ArgoCD is installed on each cluster and continuously syncs the cluster state to match the desired state defined in Git. Sealed Secrets encrypts Kubernetes Secrets so they can be stored safely in the repo. OPA/Gatekeeper enforces admission policies before any resource is applied.

├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
├── overlays/
│   ├── dev/
│   ├── staging/
│   └── prod/
│       ├── us-east-1/
│       ├── eu-west-1/
│       └── ap-southeast-1/
└── cluster-bootstrap/
    └── argocd-appsets.yaml

Why This Solution

GitOps makes Git the single source of truth for all cluster state. ArgoCD continuously reconciles the live cluster with the repository, automatically reverting any manual drift (self-healing). Kustomize provides environment-specific overlays without the complexity of template engines like Helm. Sealed Secrets allows teams to commit encrypted secrets alongside their manifests without exposing plaintext credentials. OPA/Gatekeeper enforces policies such as "no latest tag", "containers must define resource limits", and "deployments must have PodDisruptionBudgets". Rollback is as simple as running git revert — ArgoCD handles the rest.

2. Blue-Green Deployment with Zero Downtime

Problem

Deploying a new version of the application causes 5–10 minutes of downtime. Failed deployments require complex rollback scripts that sometimes fail themselves, prolonging the outage and eroding customer trust.

Solution

Stack: ALB/Nginx Ingress + CodeDeploy/Spinnaker + Auto Scaling + Route53 + CloudWatch

Two identical environments (blue and green) run side by side. The load balancer or ingress controller routes all production traffic to the blue environment. When a new version is ready, it is deployed to the green environment. Health checks validate the green deployment. Once healthy, the load balancer switches traffic from blue to green atomically. If the deployment fails, the load balancer continues routing to blue with zero impact.

User ──► Route53 ──► ALB
                      ├──► Blue  (current, 100% traffic)
                      └──► Green (new, 0% traffic → 100% after health check)

Why This Solution

Blue-green deployment eliminates downtime entirely because the live environment is never touched during deployment. Traffic switching is instantaneous via DNS or load balancer configuration. Rollback is instantaneous — just switch traffic back to the previous environment. ALB/Nginx Ingress supports weighted routing, enabling canary testing: send 5% of traffic to green before full cutover. CloudWatch alarms monitor error rates during rollout and automatically roll back if thresholds are breached. This pattern requires no application-level changes and works for any HTTP service.

3. Centralized Logging and Observability Stack

Problem

The team runs 200 microservices across 50 nodes. When an incident occurs, logs are scattered across individual servers. Engineers SSH into each box and grep through files. Troubleshooting a single customer issue takes 2+ hours.

Solution

Stack: Fluentd/Filebeat + Elasticsearch/OpenSearch + Kibana/Grafana + S3/GCS (cold storage) + CloudWatch/Stackdriver

Each node runs Fluentd (or Filebeat) as a DaemonSet, collecting container stdout/stderr and system logs with minimal resource overhead. Logs are forwarded to an OpenSearch cluster for indexing and full-text search. Grafana provides unified dashboards combining logs, metrics (Prometheus), and traces (Jaeger). Logs older than 30 days are automatically transitioned to S3 Glacier via lifecycle policies for compliance retention.

Pod ──► Fluentd (DaemonSet) ──► OpenSearch ──► Kibana/Grafana
                                    │
                                    └──► S3 (cold storage after 30 days)

Why This Solution

Fluentd collects logs from every node with a consistent, low-overhead agent (<100 MB RAM per node). OpenSearch provides sub-second query performance on terabytes of logs, enabling teams to search across all 200 microservices from a single interface. Grafana correlates logs with metrics and traces in one pane of glass, dramatically reducing mean time to resolution (MTTR). S3 lifecycle policies reduce storage costs by 90% for historical logs while maintaining compliance. Structured logging (JSON format) enables powerful field-level filtering. Correlation IDs propagated via HTTP headers allow complete request tracing across service boundaries.

4. Container Security Scanning Pipeline

Problem

A security audit found 40 critical CVEs in production containers. Developers build and push Docker images directly to the registry without any security checks. Vulnerable base images proliferate across the organization.

Solution

Stack: ECR/GCR + Trivy/Clair + GitHub Actions + Lambda + SNS + OPA

Trivy is integrated into the CI pipeline (GitHub Actions). On every pull request, the image is built and scanned for CVEs, secrets, and misconfigurations. OPA policies define acceptable risk levels — builds with critical or high CVEs are blocked. When a scan passes, the image is pushed to the registry. ECR/GCR native scanning continuously monitors stored images for newly discovered CVEs. SNS notifies the security team when violations are detected.

PR ──► GitHub Actions
          ├── Build image
          ├── Trivy scan (block on CRITICAL/HIGH)
          ├── OPA policy check
          └── Push to ECR
      ECR ──► Continuous scan ──► SNS alert (if new CVE found)

Why This Solution

Trivy scans images for CVEs, secrets, and misconfigurations in seconds with no external database dependency. Integration into CI means every PR is scanned before an image reaches a registry — vulnerable images are blocked at the gate. OPA policies codify the organization's risk appetite (e.g., "allow up to 5 HIGH CVEs but zero CRITICAL"). ECR native scanning provides ongoing protection: when a new CVE is disclosed, the registry evaluates every stored image and triggers SNS notifications. This layered approach (prevention + detection + notification) ensures containers are safe at build time and stay safe at runtime.

5. Secrets Management and Rotation

Problem

Database passwords, API keys, and TLS certificates are hardcoded in application config files committed to Git. A former employee still has access to production secrets. There is no audit trail for who accessed which secret.

Solution

Stack: HashiCorp Vault / AWS Secrets Manager + KMS + Lambda + IAM + Consul

Secrets are stored centrally in AWS Secrets Manager. IAM roles grant workloads access based on identity — each application gets a narrowly scoped policy. Lambda functions run on a schedule to rotate RDS credentials automatically. KMS encrypts secrets at rest with automatic key rotation. For advanced use cases, HashiCorp Vault provides dynamic secrets (short-lived, auto-expiring database credentials) and a detailed audit log.

App ──► IAM (assume role) ──► AWS Secrets Manager
                                    │
                                    ├── Retrieve DB password
                                    ├── Auto-rotate (Lambda, every 30 days)
                                    └── Audit log → CloudTrail

Why This Solution

Secrets Manager centralizes secret storage and eliminates the "secrets in Git" problem. Automatic rotation via Lambda ensures credentials are regularly refreshed without manual intervention — if a secret is leaked, it is only valid for a limited window. IAM-based access means no shared root credentials; every workload has a unique identity with fine-grained permissions. KMS provides envelope encryption with automatic key rotation, adding a second layer of defense. Vault's dynamic secrets are even stronger: each pod gets a unique, short-lived database credential that expires after the pod's lifetime. The complete audit trail satisfies SOC 2 and PCI-DSS compliance requirements.

6. Infrastructure Compliance as Code

Problem

Cloud costs increased 300% because developers created untagged resources that could not be attributed to any team. The security team found 50 S3 buckets publicly accessible. There was no mechanism to enforce policies before resources were created.

Solution

Stack: Terraform/CloudFormation + OPA/CloudFormation Guard + AWS Config + Lambda + SNS

All infrastructure is defined as code in Terraform and reviewed through pull requests. OPA policies (written in Rego) or CloudFormation Guard rules enforce tagging requirements, encryption settings, and public access restrictions before resources are provisioned. AWS Config continuously evaluates existing resources against compliance rules. Non-compliant resources trigger Lambda remediation functions (e.g., auto-apply encryption, remove public access). SNS notifies resource owners.

PR ──► Terraform plan
          ├── OPA policy check (tagging, encryption, public access)
          ├── Human review
          └── Terraform apply
      AWS Config ──► Non-compliant resource
                      ├── Lambda auto-remediation
                      └── SNS notification to owner

Why This Solution

Infrastructure as Code ensures all resources are defined, version-controlled, and peer-reviewed before creation. OPA/Guard policies provide preventive controls — they block non-compliant infrastructure at the CI stage, before any resource is created. AWS Config provides detective controls, continuously scanning the existing account for drift. Lambda provides corrective controls, automatically fixing common violations such as unencrypted EBS volumes or publicly accessible S3 buckets. This three-layer model (prevent + detect + correct) ensures compliance without slowing development. Tag enforcement alone typically saves 30–50% in cloud costs through accurate cost allocation and resource ownership tracking.

7. Canary Deployments with Service Mesh

Problem

Full blue-green deployments are too risky for critical services. The team needs to validate new versions on real production traffic before full rollout, but manual traffic splitting with load balancers is error-prone and does not provide automated rollback.

Solution

Stack: Istio/Linkerd + Flagger + Prometheus + Grafana + Kubernetes

Istio (or Linkerd) is deployed as a service mesh, providing fine-grained traffic control at layer 7. Flagger automates the canary analysis process: it shifts 1% of traffic to the new version, monitors Prometheus metrics (error rate, latency, success rate), and gradually increases the traffic weight if metrics remain healthy. If any metric exceeds the threshold (e.g., error rate increases by 1%), Flagger automatically rolls back in seconds. No application code changes are required.

Service v1 (stable) ──► 99% traffic
Service v2 (canary) ──► 1% traffic → 10% → 50% → 100%
         │
         └── Flagger checks Prometheus metrics
                ├── Healthy → increase weight
                └── Unhealthy → rollback to v1

Why This Solution

A service mesh (Istio) provides fine-grained traffic routing at layer 7 without modifying application code. Flagger automates the entire canary lifecycle: traffic shifting, metric analysis, and rollback — eliminating manual error. Prometheus collects real-time metrics (request rate, error rate, p50/p95/p99 latency) that Flagger evaluates every 60 seconds. Grafana dashboards visualize the canary progress in real time. If error rate increases by even 1%, Flagger triggers an automatic rollback in seconds, preventing a full outage. This approach gives teams confidence to deploy frequently while minimizing blast radius. It is particularly valuable for high-risk changes such as database schema migrations or library upgrades.

8. Self-Healing Infrastructure

Problem

PagerDuty alerts the on-call engineer at 3 AM for a failed EC2 instance. Manual replacement takes 30 minutes. This happens 2–3 times per month, causing sleep disruption and engineer burnout.

Solution

Stack: CloudWatch + Lambda + Auto Scaling + Route53 health checks + EventBridge + Systems Manager

Auto Scaling groups replace unhealthy EC2 instances automatically. Route53 health checks detect endpoint failures within 15 seconds. EventBridge listens for specific AWS events (e.g., status check failures, disk full alarms) and triggers Lambda functions for remediation. Systems Manager Automation runs predefined runbooks for complex issues. The on-call engineer is only paged for issues the automation cannot resolve.

EC2 instance fails
  ├── Auto Scaling ──► Launch replacement (2 min)
  ├── Route53 health check ──► Remove from DNS (15 sec)
  ├── EventBridge ──► Lambda
  │                    ├── Disk full? → clean temp files
  │                    ├── Process stuck? → restart service
  │                    └── Memory leak? → scale up
  └── If all automated remediation fails → PagerDuty

Why This Solution

Auto Scaling is the first line of defense — it automatically replaces failed instances without human intervention. Route53 health checks ensure traffic is only routed to healthy endpoints. EventBridge captures AWS events in real time and triggers Lambda functions for targeted remediation: clearing disk space, restarting stuck processes, or scaling resources. Systems Manager Automation provides more sophisticated runbooks for multi-step recovery procedures. The key principle is the "last mile" — automation handles the common failure modes, and the engineer is only paged for novel or complex issues. This reduces on-call fatigue, improves mean time to recovery (MTTR) from 30 minutes to under 2 minutes, and increases system availability.

9. Cloud Cost Governance and Optimization

Problem

The monthly cloud bill is $150K and growing 15% month-over-month. No team knows their individual spend. Approximately 40% of resources are idle after business hours and on weekends.

Solution

Stack: AWS Budgets/Cost Explorer + Lambda + SNS + Organizations + Auto Scaling (scheduled) + Spot Instances

AWS Budgets sends alerts when spend exceeds configurable thresholds. Cost Explorer with tag-based grouping shows spend per team, project, and environment. Lambda functions auto-stop non-production resources at 7 PM and restart them at 8 AM on weekdays (idle overnight and on weekends). Organizations consolidates billing across accounts for volume discounts. Spot Instances handle stateless, fault-tolerant workloads at 70% discount. Committed Use Savings Plans cover baseline capacity at 30–50% discount.

├── Cost Governance
│   ├── AWS Budgets → SNS alert at 80%, 100%, 120% of forecast
│   ├── Cost Explorer (tag-based) → per-team dashboards
│   └── Organizations → consolidated billing, volume discounts
├── Optimization
│   ├── Lambda (stop at 19:00, start at 08:00) → 60% savings
│   ├── Spot Instances → stateless workloads, 70% cheaper
│   └── Savings Plans → baseline capacity, 30–50% discount

Why This Solution

AWS Budgets provide proactive cost alerts — teams are notified before they overspend, not after. Cost Explorer with mandatory tagging enables accurate chargeback and shows which teams or projects are driving costs. The Lambda stop/start scheduler reduces non-production compute costs by up to 60% since dev and staging environments are idle 75% of the week. Spot Instances are ideal for batch jobs, CI workers, and stateless microservices — they are interruptible but 70% cheaper. Savings Plans provide predictable discounts for committed usage. Together, these strategies typically reduce cloud bills by 40–60% while maintaining performance and availability.

10. Disaster Recovery Automation

Problem

The DR plan is a 50-page document last updated 2 years ago. Estimated RTO is 8 hours and RPO is 24 hours. The annual DR test takes 3 days, requires a dedicated team, and always fails due to missing steps or outdated documentation.

Solution

Stack: Terraform + Route53 + RDS Cross-Region Replication + S3 CRR + Lambda + Step Functions + CloudEndure

Terraform defines the entire infrastructure as code, enabling the full environment to be recreated in a secondary region. RDS Cross-Region Replication provides automated database replication with a Recovery Point Objective (RPO) of under 1 minute. S3 Cross-Region Replication (CRR) ensures object storage is synchronously copied. AWS Step Functions orchestrates the failover playbook with manual approval gates at key decision points. CloudEndure provides continuous block-level replication for non-RDS workloads. The DR test is fully automated — run the Step Function, verify, fail back — completing in 2 hours instead of 3 days.

Primary Region (us-east-1)
├── RDS (primary) ──► Cross-Region Replica ──► us-west-2
├── S3 ──► CRR ──► us-west-2
└── EC2 (CloudEndure) ──► Continuous replication ──► us-west-2

Step Functions DR Playbook:
  1. Health check failed in primary region
  2. [Manual approval] Confirm failover
  3. Promote RDS replica to primary
  4. Point Route53 to secondary region
  5. Validate application health
  6. [Manual approval] Confirm failback when primary is restored

Why This Solution

Infrastructure as Code (Terraform) eliminates the 50-page document — the entire DR environment can be recreated from version-controlled code. RDS Cross-Region Replication reduces RPO from 24 hours to under 1 minute with automated, asynchronous replication. Step Functions codify the DR playbook into an executable, auditable workflow with approval gates for human judgment at critical decision points. CloudEndure provides continuous block-level replication for EC2 instances, ensuring non-RDS workloads are also protected. Automated DR testing means the plan is validated monthly (not annually), completing in 2 hours instead of 3 days. When a real disaster strikes, the team runs a tested, automated playbook with confidence, achieving an RTO of 15–30 minutes.

← Back to System Design Index