Knowledge Base v2

System Architecture
Deep Dives

End-to-end architectural breakdowns of how the world's most scalable systems are built — covering design choices, trade-offs, AWS services, AI/ML pipelines, and real-world scenarios.

7 System Architectures
7 AI/ML Guides
8 Project Scenarios
10 Troubleshooting Guides
System Architectures

How the World's Largest Systems Scale

Deep dives into production architectures at companies serving hundreds of millions of users — covering CDN design, database choices, microservices patterns, and failure modes.

Netflix

240M+ subscribers · 800+ microservices
Advanced

How Netflix serves 500M+ streaming hours daily across 190+ countries using a self-built CDN (Open Connect), Chaos Engineering, and per-title video encoding.

Open Connect CDN Chaos Engineering EVCache Spinnaker CI/CD
99.99%Availability
<2sFirst Frame
1000+Deploys/day
Client Apps ──► Open Connect (ISP PoPs)
                     │
              AWS (Control Plane)
              ┌──────┬──────┬──────┐
           Users  Reco  A/B  Analytics
           Svc    Eng   Test  Pipeline

Hotstar

10M+ concurrent · 1B+ views
Advanced

India's largest streaming platform handling 10M+ concurrent viewers during IPL matches — using ABR streaming, edge computing, and WebAssembly-based player.

ABR Streaming Edge Computing WebAssembly Multi-CDN
10M+Concurrent
40msLatency
Multi-CDNStrategy

Amazon

World's largest e-commerce
Advanced

The SOA→microservices journey that birthed AWS. Two-pizza teams, DynamoDB, and the distributed systems lessons that shaped modern software engineering.

Two-Pizza Teams DynamoDB SOA→Microservices Eventual Consistency
375M+Products
99.9%Uptime
$500B+Annual GMV

Uber

Global ride-hailing at scale
Advanced

Real-time dispatch engine, H3 geospatial indexing, and domain-oriented microservices powering millions of rides per day across 70+ countries.

H3 Geo Indexing Dispatch Engine Domain Microservices ETA Prediction ML
70+Countries
<1sMatch Time
10M+Rides/day

WhatsApp

2B+ users · Billions msg/day
Advanced

How 50 engineers supported 2B+ users using Erlang's concurrency model, a custom XMPP protocol, and E2E encryption at unprecedented scale.

Erlang BEAM E2E Encryption XMPP Protocol Connection Mgmt
2B+Users
100B+Messages/day
50Engineers (initial)

YouTube

500+ hours uploaded/min
Advanced

Video transcoding pipeline, Bigtable for metadata, Vitess for MySQL scaling, and a recommendation engine serving billions of views daily.

Transcoding Pipeline Vitess Bigtable Recommendation ML
2B+Users
500hrsUploaded/min
1B hrsWatched/day

NSE (Stock Exchange)

Microsecond latency · HFT
Advanced

India's National Stock Exchange — order matching engine processing millions of orders per second with microsecond latency and co-location for HFT firms.

Order Matching Engine Co-location Disaster Recovery FPGA Hardware
<10μsLatency
10M+Orders/sec
99.999%Uptime
AI / ML

Artificial Intelligence & Machine Learning

From beginner fundamentals to advanced LLM architectures — covering every layer of modern AI systems with practical examples and AWS integrations.

Beginner
🎯

AI/ML Fundamentals

AI vs ML vs Deep Learning · Supervised/Unsupervised/RL · Bias-variance tradeoff · Evaluation metrics · ML workflow

Training vs Inference Overfitting Precision/Recall Cross-validation
Beginner–Intermediate
🔢

ML Algorithms

Linear/Logistic Regression · Decision Trees · Random Forests · XGBoost · K-Means · SVM · Dimensionality reduction

Gradient Descent Ensemble Methods Feature Engineering Hyperparameters
Intermediate
🧠

Deep Learning

Neural networks · CNNs for vision · RNNs/LSTMs for sequences · Transformers · Multi-head attention mechanism

Backpropagation Attention BatchNorm Dropout
Advanced
💬

Large Language Models

GPT architecture · LLaMA · Pre-training & fine-tuning · RLHF · RAG systems · Prompt engineering · Quantization

Transformer Architecture RAG Fine-tuning RLHF
Intermediate–Advanced
⚙️

ML Pipelines & MLOps

Training pipelines · Feature stores · Model registry · Inference serving · A/B testing · Monitoring & drift detection

Feature Store Model Registry Data Drift Canary Deployment
Intermediate
☁️

AI on AWS

SageMaker · Bedrock · Rekognition · Comprehend · Textract · EC2 GPU instances · Custom model hosting

SageMaker Pipelines Bedrock APIs Model Endpoints GPU Instances
Advanced
🏛️

AI Systems Architecture

ChatGPT infrastructure · Midjourney architecture · Stable Diffusion pipeline · GitHub Copilot — how they're actually built

Inference at Scale Diffusion Models Vector Databases Multi-modal Systems

ML Terminology Quick Reference

HyperparameterSet before training (learning rate, depth) — not learned by the model
Loss FunctionMeasures how wrong the model is — gradient descent minimizes this
EpochOne full pass through the entire training dataset
RegularizationL1/L2/Dropout — prevents overfitting by penalizing complexity
Feature EngineeringCreating better input variables from raw data to improve predictions
AUC-ROCArea under ROC curve — higher = better discrimination between classes
Data DriftInput distribution changes over time → model degrades in production
RLHFReinforcement Learning from Human Feedback — how ChatGPT is aligned
AWS Architecture

AWS Well-Architected Framework

The 6 pillars of cloud excellence — from compute and storage selection to security, observability, and the end-to-end design process for production systems.

01
🏛️

Operational Excellence

Run operations as code · Small reversible changes · Game days · Post-mortems

CloudFormation · CloudTrail · Systems Manager · Config
02
🔐

Security

IAM least privilege · KMS encryption · WAF + Shield · GuardDuty threat detection

IAM · KMS · WAF · Shield · GuardDuty · Macie · Inspector
03
🔄

Reliability

Auto-recovery · Multi-AZ · Chaos testing · RDS failover · S3 11-nines durability

Route53 · ELB · Auto Scaling · RDS Multi-AZ · DynamoDB
04

Performance Efficiency

Managed services · CloudFront CDN · ElastiCache · Lambda serverless · Global Accelerator

Lambda · Fargate · DAX · ElastiCache · CloudFront
05
💰

Cost Optimization

Spot Instances · Savings Plans · S3 Intelligent-Tiering · Right-sizing · Reserved instances

Cost Explorer · Budgets · Savings Plans · Compute Optimizer
06
🌿

Sustainability

AWS Graviton (ARM) · Right-size · Auto-scaling to demand · Lifecycle policies

Graviton3 · Auto Scaling · S3 Lifecycle · Compute Optimizer

AWS Service Selection Guide

Use Case Primary Service When to Use Alternative
Simple API / functionsLambdaStateless, event-driven, <15minFargate
Containerized microservicesECS / EKSLong-running containers, orchestrationElastic Beanstalk
Key-value / document storeDynamoDBHigh scale NoSQL, <10ms p99 readsElastiCache
Relational databaseAuroraMySQL/PostgreSQL compatible, high writeRDS
Session / cacheElastiCacheSub-millisecond reads, ephemeral dataDAX (DynamoDB)
Object storageS3Unstructured data, backups, static sitesEFS for NFS
Full-text searchOpenSearchLog analytics, search, dashboardsKendra (ML search)
Message queueSQSDecoupled async processingKinesis (streaming)
Real-time streamingKinesisHigh-throughput event streamsMSK (Kafka managed)
Global CDNCloudFrontStatic assets, API caching, DDoSGlobal Accelerator
DNS + routingRoute53Domain mgmt, health checks, failoverCloudFront origins
Private networkingVPCIsolated networking, private subnetsVPC Peering / TGW

Cost Optimization Quick Wins

Compute
Use Spot for non-critical batch
60–90% savings
Compute
Savings Plans for baseline load
30–72% savings
Compute
Right-size with Compute Optimizer
20–40% savings
Storage
S3 Intelligent-Tiering
40–60% savings
Database
RDS Reserved Instances
30–60% savings
Networking
CloudFront + inter-AZ awareness
20–50% savings
Real-World Scenarios

End-to-End Project Scenarios

Full system designs with requirements, architecture diagrams, data flows, code, cost estimates, CI/CD, monitoring, security, and disaster recovery plans.

01
Retail / Marketplace

E-Commerce Platform

Microservices checkout · Flash sale handling · PCI compliance · Elasticsearch search · Cost estimation

ECSDynamoDBElastiCacheSQSStripe
02
OTT / Media

Video Streaming

Live ingest pipeline · Transcoding at scale · Multi-CDN strategy · DRM encryption · Per-title encoding

MediaConvertCloudFrontS3Elemental
03
Fintech

Payment Processing

Idempotent payment flows · Real-time fraud detection · PCI DSS compliance · Reconciliation pipelines

LambdaSageMakerAuroraKMSWAF
04
Communication

Real-Time Chat

WebSocket connections · Message routing at scale · E2E encryption · Group chat · Offline delivery

API Gateway WSElastiCacheDynamoDBSQS
05
Food Tech

Food Delivery

H3 geo-indexing · Auto-assignment engine · ML ETA prediction · Surge pricing · Restaurant onboarding

H3SageMakerKinesisDynamoDB
06
HealthTech

Telemedicine

SFU video calls · Appointment locking · E-prescriptions · NDHM compliance · HIPAA security

Chime SDKRDSKMSMacieSES
07
BI / Analytics

SaaS Analytics Platform

Event ingestion · Kafka/Flink streaming · Druid/Pinot OLAP · Multi-tenancy · Sub-second query SLA

MSKKinesisRedshiftAthenaEMR
08
Mobility

Ride-Sharing Platform

Matching engine · H3 dispatch · Dynamic surge pricing · GPS ingestion pipeline · City sharding

KinesisElastiCacheSageMakerAurora

Each Scenario Includes

📋 Requirements — Functional & non-functional
📐 Architecture Diagram — ASCII + conceptual
🔄 Data Flow — End-to-end request path
📈 Scaling Strategy — Horizontal & geo-distribution
💰 Cost Estimation — AWS pricing breakdown
🚀 CI/CD Pipeline — Deployment strategy
👁️ Monitoring — Metrics, logs, alerts
🔐 Security — Auth, encryption, compliance
🔁 Disaster Recovery — RTO/RPO targets
💻 Commented Code — Core service implementations
Read Code

Learn to Read Unfamiliar Codebases

Real, runnable example files with extensive inline comments — explaining what each line does, why it's written that way, and how the parts fit together.

Universal Reading Approach

1IDENTIFYREADME, docs, entry points
2SKIMFolder → files → functions
3TRACEFollow one feature end-to-end
4ISOLATEZoom into one module
5CONNECTMap relationships & data flow

Java

HelloWorld.java · Student.java · Calculator.java · FileReaderExample.java · SimpleHttpServer.java
Classes · Objects · Methods · Exceptions · HTTP Servers · try-with-resources
🐍

Python

hello.py · data_analysis.py · file_processor.py · simple_api.py
Comprehensions · Lambdas · Context Managers · File I/O · FastAPI
JS

JavaScript

basics.js · async-examples.js
Arrow Functions · Destructuring · Promises · async/await · Closures
TS

TypeScript

basics.ts · generics.ts
Interfaces · Unions · Discriminated Unions · Generics · Utility Types
Go

Go

hello.go · http-server.go · concurrency.go
Goroutines · Channels · WaitGroup · Interfaces · Structs
SQL

SQL

queries.sql
JOINs · GROUP BY · CTEs · Window Functions · Subqueries · Views
TF

Terraform

main.tf · variables.tf · terraform.tfvars
Resources · Data Sources · Variables · Outputs · Dependencies
SH

Shell / Bash

file-backup.sh
Shebang · Variables · Functions · Conditionals · Exit Codes · Error Handling
YML

YAML / K8s

docker-compose.yml · kubernetes-deployment.yml
Services · Volumes · Deployments · HPAs · Readiness Probes
SB

Spring Boot

DemoApplication.java · HelloController.java · UserService.java · application.yml
@RestController · @Service · @Transactional · Dependency Injection
Troubleshooting

DevOps Troubleshooting Reference

Root causes, diagnostic commands, and battle-tested solutions for the most common issues across Docker, Kubernetes, AWS, Linux, databases, and more.

🧠 Diagnostic Mindset — Before Jumping to Solutions

1WHAT changed? Deploy, config, code, infra
2WHEN did it start? Time correlation with events
3WHO is affected? All users? One service? One region?
4WHAT does broken look like? Error code, latency, crash
5CAN we reproduce? Dev environment, curl command
6IS there a rollback? Last known good state
💡 Always check logs first. Then metrics. Then code. The issue is usually NOT what you think.

Quick Reference: Most Common Issues by Symptom

SymptomMost Likely CauseFirst Command
Pod keeps crashingOOMKilled / Resource limits too lowkubectl describe pod <name>
Build fails on CIDependency cache staleClear CI cache → retry
Terraform stuckState lock not releasedterraform force-unlock <id>
Slow website / high latencyDB connection pool exhaustedSHOW PROCESSLIST
Disk full on serverContainer logs / Docker overlay2du -sh /var/log/*
Can't deploy to AWSIAM permissions / missing roleCheck CloudTrail → IAM policy
503 errors on endpointsHealth check failing on targetkubectl get endpoints
SSL / TLS errorCertificate expired or chain brokenopenssl s_client -connect host:443
CI tests flakyTest depends on external serviceMock external calls · Add retry
Can't SSH to serverSecurity group / firewall blocking 22nc -zv host 22
Replication lag growingLong-running write transactionsSHOW SLAVE STATUS
OOM on LambdaMemory limit too low / memory leakIncrease memory → add X-Ray tracing
🐳

Docker

Build failures · Runtime errors · Layer caching · Volume permissions · Network bridges · Registry auth

Kubernetes

Pod crashes · OOMKilled · Scheduling failures · CrashLoopBackOff · Service mesh · RBAC denials

☁️

AWS

Service quotas · IAM permission boundaries · VPC routing · Cost spikes · Throttling · Region failovers

🐧

Linux & Servers

Disk full · OOM killer · Process hangs · SSH issues · Zombie processes · systemd failures

🗄️

Databases

Connection pool exhausted · Slow queries · Replication lag · Deadlocks · Backup failures · Index bloat

🌐

Networking

DNS resolution failures · TLS/SSL cert errors · Connectivity timeouts · Latency spikes · Firewall rules

🔄

CI/CD Pipelines

Build failures · Secrets not injected · Cache invalidation · Self-hosted runner issues · Flaky tests

🌿

Git

Merge conflicts · Force push recovery · Lost commits · Submodule issues · Rebase gone wrong

📊

Monitoring & Logging

Missing metrics · Log ingestion drops · Alert fatigue · Dashboard gaps · Tracing blind spots

🏗️

Terraform / IaC

State locking · Provider errors · Configuration drift · Apply failures · Module versioning conflicts