GCP System Design Patterns
A collection of proven Google Cloud Platform architecture patterns for solving common system design challenges at scale.
1. Scalable Web Application on GKE
Problem
A SaaS platform needs to handle traffic spikes from 1K to 100K users in minutes. Manual scaling causes downtime. Need zero-downtime deployments and multi-region HA.
Solution
Core services: Google Cloud Load Balancer + GKE + Cloud SQL (HA) + Memorystore + Cloud CDN + Cloud Armor + Cloud NAT
gcloud container clusters create saas-cluster \
--region=us-central1 \
--node-locations=us-central1-a,us-central1-b,us-central1-c \
--enable-autoscaling --min-nodes=3 --max-nodes=100 \
--addons=HorizontalPodAutoscaling,HttpLoadBalancing
Why This Solution
GCLB provides a global anycast IP with built-in autoscaling across backends. GKE with Horizontal Pod Autoscaler scales pods based on CPU or custom metrics. Cloud SQL HA provides regional persistence with automatic failover. Memorystore (Redis) offloads session state from the database for sub-millisecond reads. Cloud CDN caches static assets at 150+ global edge locations, reducing origin load. Cloud Armor provides WAF and DDoS protection at the edge before traffic reaches your cluster. Cloud NAT enables outbound connectivity from private nodes. This architecture is cost-effective because the GKE cluster autoscaler dynamically adds or removes nodes based on aggregate pod resource requests, so you pay only for what you use.
2. Serverless Data Warehouse for Real-Time Analytics
Problem
A retail company needs to analyze 10TB+ of daily sales data across 200 stores. Traditional data warehouse takes 6+ hours for queries and costs too much.
Solution
Core services: BigQuery + Dataflow + Pub/Sub + Cloud Storage + Looker + Cloud Composer
# Streaming pipeline with Dataflow
dataflow_pipeline = beam.Pipeline()
sales_events = (dataflow_pipeline
| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(topic=topic)
| 'ParseJSON' >> beam.Map(lambda x: json.loads(x))
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
table, schema=SCHEMA,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))
Why This Solution
BigQuery's serverless architecture auto-scales to petabytes with zero infrastructure management. Dataflow (Apache Beam) processes streaming data with exactly-once semantics and auto-scaling workers. Pub/Sub ingests 10M+ messages per second with configurable retention. BigQuery's columnar storage, clustering, and partitioning make analytical queries fast even on terabytes of data. Slot-based pricing provides predictable costs, and partitioning data by day lets you control query costs by scanning only relevant partitions. Unused flat-rate slots can be committed for 1- or 3-year terms at up to 40% discount. Cloud Composer orchestrates nightly ETL and model refresh workflows.
3. Event-Driven Microservices with Cloud Run
Problem
A fintech startup needs to deploy 15 microservices without managing Kubernetes. Need rapid scaling to zero when idle and instant scale-up on demand.
Solution
Core services: Cloud Run + Cloud Tasks + Firestore + Pub/Sub + Cloud Monitoring + Cloud Build
# service.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: payment-service
spec:
template:
spec:
containers:
- image: gcr.io/project/payment:latest
resources:
limits:
cpu: "1"
memory: 512Mi
env:
- name: MIN_INSTANCES
value: "2"
containerConcurrency: 80
Why This Solution
Cloud Run is a fully managed serverless container platform that scales to zero when idle, eliminating idle compute costs. Cold starts are under 200ms, and the min-instances setting can be configured for latency-sensitive services. Cloud Tasks provides reliable async processing with configurable retry logic and rate limiting. Firestore offers real-time data sync across clients with ACID transactions and serverless scaling. The Knative-based API means the service definition is portable across Cloud Run, Anthos, and other Knative environments, avoiding vendor lock-in. Cloud Build automates CI/CD, and Cloud Monitoring provides end-to-end observability.
4. IoT Data Processing Pipeline
Problem
A smart factory has 50K IoT sensors generating 1M readings per second. Need real-time alerts, historical analysis, and ML-based predictive maintenance.
Solution
Core services: IoT Core + Pub/Sub + Dataflow + Bigtable + BigQuery + Vertex AI + Looker
# Bigtable schema for time-series
create '@table' 'sensor_events',
{NAME => 'cf', VERSION => 1,
COMPRESSION => 'GZIP',
TTL => 2592000} # 30 days
Why This Solution
IoT Core authenticates, connects, and manages devices at global scale with built-in device registry and MQTT/HTTP bridge. Pub/Sub provides reliable, high-throughput message ingestion with configurable retention and exactly-once delivery to subscribers. Dataflow processes streaming data with sub-second latency and auto-scaling workers. Bigtable's wide-column storage model handles high-write, low-latency time-series data with consistent sub-10ms access. BigQuery enables complex analytical queries on archived cold data. Vertex AI trains predictive maintenance models on historical sensor patterns, and Looker provides business-facing dashboards.
5. Media Transcoding and Delivery Pipeline
Problem
A video platform receives 500 user-uploaded videos per hour. Need to transcode to multiple formats and resolutions, add DRM, and deliver globally.
Solution
Core services: Cloud Storage + Transcoder API + Pub/Sub + Cloud Functions + Cloud CDN + Media CDN
# Transcoder API job config
{
"inputUri": "gs://uploads/video.mp4",
"outputUri": "gs://output/video/",
"elementaryStreams": [
{"videoStream": {"h264": {"heightPixels": 720, "bitrateBps": 2500000}}},
{"videoStream": {"h264": {"heightPixels": 1080, "bitrateBps": 5000000}}},
{"audioStream": {"mpegAac1": {"bitrateBps": 128000}}}
],
"muxStreams": [
{"segmentSettings": {"segmentDuration": "6s"}}
]
}
Why This Solution
The Transcoder API is fully serverless with per-minute pricing, supporting all major codecs (H.264, H.265, VP9, AV1) and formats (MP4, HLS, DASH). No encoding infrastructure to manage. Pub/Sub notifies completion asynchronously, and Cloud Functions triggers post-processing workflows such as thumbnail generation and metadata extraction. Cloud CDN caches HLS/DASH manifests and segments at edge locations. Media CDN provides advanced streaming features including adaptive bitrate packaging, tokenized authentication, and multi-CDN failover for global reach.
6. Multi-Region Active-Active Database
Problem
A gaming company needs less than 10ms read/write latency globally. Single-region database causes 300ms latency for users in other continents.
Solution
Core services: Cloud Spanner + Global CLB + Memorystore + Cloud Armor
# Spanner configuration
CREATE TABLE PlayerData (
PlayerId INT64 NOT NULL,
Region STRING(16) NOT NULL,
Score INT64 NOT NULL,
LastLogin TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
) PRIMARY KEY (PlayerId, Region)
WITH interleave IN PARENT Players;
Why This Solution
Cloud Spanner provides strong external consistency across regions with less than 10ms replication latency, combining the benefits of relational databases with horizontal scalability. Automatic sharding and rebalancing eliminate manual DBA work. The Global CLB routes users to the nearest regional Spanner instance, minimizing latency. Memorystore caches session state and leaderboard data for sub-millisecond reads. Spanner's TrueTime API ensures globally ordered transactions without application-level compromise. No manual sharding or eventual-consistency handling needed.
7. Machine Learning Platform (MLOps)
Problem
Data scientists train models in notebooks but have no standardized way to productionize them. Deployments take weeks and models drift undetected.
Solution
Core services: Vertex AI + AI Platform Pipelines + Model Registry + Feature Store + Cloud Build + Cloud Monitoring
# Vertex AI Pipeline component
@component(
base_image="python:3.9",
packages_to_install=["google-cloud-aiplatform"],
output_component_file="deploy_component.yaml"
)
def deploy_model(project: str, model_id: str, endpoint_id: str):
from google.cloud import aiplatform
aiplatform.init(project=project)
model = aiplatform.Model(model_id)
endpoint = aiplatform.Endpoint(endpoint_id)
endpoint.deploy(model, traffic_percentage=100)
Why This Solution
Vertex AI Pipelines is built on Kubeflow Pipelines and provides reproducible, versioned ML workflows. Feature Store serves consistent, pre-computed features in both training and serving with low-latency access. Model Registry manages model versions, stages, and rollback. Continuous evaluation monitors production predictions against ground truth and detects drift automatically. Auto-scaling prediction endpoints scale to zero or handle millions of queries with no server management. This provides a complete end-to-end MLOps platform without stitching together separate tools.
8. Hybrid Cloud with Anthos
Problem
A bank has 60% workloads on-premises due to regulatory requirements. Need consistent Kubernetes experience across on-prem and cloud with centralized management.
Solution
Core services: Anthos + GKE on-prem + Config Sync + Cloud Interconnect + Cloud Run for Anthos + Service Mesh
# Config Sync policy
apiVersion: configsync.gke.io/v1beta1
kind: RootSync
metadata:
name: root-sync
namespace: config-management-system
spec:
sourceFormat: unstructured
git:
repo: https://source.developers.google.com/p/project/r/repo-name
branch: main
dir: /
auth: gcp
Why This Solution
Anthos provides a consistent Kubernetes experience across on-premises data centers and GCP, with unified management through the GCP Console. Config Sync enables GitOps-driven policy enforcement, ensuring all clusters converge to the desired state declared in Git. Cloud Interconnect provides dedicated, low-latency connectivity between on-prem and GCP with SLA-backed availability. Anthos Service Mesh (Istio-based) provides unified observability, traffic management, and mTLS security across clusters. Migrate for Anthos automates VM-to-container migration with minimal refactoring.
9. Cost-Optimized Data Processing with Preemptibles
Problem
A research institute runs 100,000 batch ML training jobs per month. On-demand GPU costs exceed $200K per month.
Solution
Core services: GKE + Preemptible VMs + Cloud Storage + Cloud Composer (Airflow) + GCS Lifecycle + Sustained Use Discounts
# Node pool with preemptible VMs
gcloud container node-pools create preemptible-pool \
--cluster=research-cluster \
--region=us-central1 \
--preemptible \
--machine-type=n1-standard-8 \
--accelerator=type=nvidia-tesla-t4,count=1 \
--node-taints=preemptible=true:NoSchedule
Why This Solution
Preemptible VMs are 60-91% cheaper than on-demand instances and ideal for fault-tolerant batch workloads. Cloud Composer (Airflow) orchestrates job queuing with retry logic, automatically requeuing jobs interrupted by preemption. State checkpoints are saved to Cloud Storage every 5 minutes so preempted jobs resume from the latest checkpoint instead of starting over. Sustained Use Discounts apply automatically when VMs run for a portion of the month. Committed Use Discounts for baseline capacity provide an additional 40% savings for predictable workloads.
10. Security and Compliance Monitoring
Problem
A healthcare company is audited quarterly for HIPAA/HITRUST. Manual evidence collection takes 2 weeks per audit and risks finding violations too late.
Solution
Core services: Security Command Center + Chronicle + Cloud Audit Logs + Forseti + Assured Workloads + CMEK
# Forseti policy library constraint
constraints:
- name: constraints/iam.allowedPolicyMemberDomains
resource:
- "//cloudresourcemanager.googleapis.com/projects/123"
- "//cloudresourcemanager.googleapis.com/organizations/456"
spec:
allowed_domains:
- "yourcompany.com"
- "gserviceaccount.com"
Why This Solution
Security Command Center provides continuous vulnerability scanning, threat detection, and compliance reporting against standards like CIS, SOC, and PCI. Chronicle (Google SecOps) delivers enterprise SIEM with 1-year log retention and advanced threat correlation. Forseti enforces GCP organization policies automatically, detecting violations of IAM, firewall, and resource constraints in real time. Assured Workloads provides FedRAMP, HIPAA, and HITRUST compliance controls with key separation. CMEK (Customer-Managed Encryption Keys) ensures you control encryption keys through Cloud KMS. Automated evidence collection reduces audit preparation from weeks to hours.
← Back to System Design Index