AI/ML System Design Patterns
Ten battle-tested architecture patterns for building production AI/ML systems on AWS. Each pattern covers the problem, solution architecture, and rationale for key design decisions.
1. RAG-Powered Chatbot for Enterprise Knowledge Base
Problem
A large enterprise has 50K+ internal documents (PDFs, wikis, Confluence). Employees spend 2+ hours/day searching for information. A standard LLM can't answer questions about internal data.
Solution
Services: Amazon Bedrock (Claude) + OpenSearch Serverless + LangChain + Lambda + S3 + API Gateway + DynamoDB
User --[API Gateway]--> Lambda --[query]--> OpenSearch (vector store)
|--[context+query]--> Bedrock (Claude)
|--[answer]--> DynamoDB (chat history)
|--[response]--> User
Why This Solution
RAG (Retrieval-Augmented Generation) grounds LLM responses in enterprise data. Documents are chunked and embedded into OpenSearch Serverless vector store. When a user asks a question, the system retrieves the top-K most relevant chunks and injects them into the LLM prompt as context. This eliminates hallucination and keeps data private. OpenSearch Serverless auto-scales based on query volume — no GPU infrastructure to manage. LangChain handles the orchestration, document chunking, and prompt templating.
2. Real-Time Fraud Detection System
Problem
A payment processor handles 10M transactions/day. Current rule-based system catches only 60% of fraud with a 5% false positive rate. Need real-time scoring in <100ms.
Solution
Services: Kinesis + SageMaker + Lambda + DynamoDB + QuickSight + S3 + EventBridge
Transaction --[Kinesis]--> Lambda --[feature vector]--> SageMaker endpoint
|--[score]--> DynamoDB (history)
|--[alert]--> EventBridge --> SNS
Why This Solution
SageMaker endpoints serve ML models with <50ms inference latency. Kinesis streams transactions in real-time. Lambda pre-processes and calls the endpoint. DynamoDB stores transaction history and feature values. The model (XGBoost or Deep Learning) detects patterns rules can't catch. SageMaker automatically scales endpoints based on traffic. Model retraining happens weekly with new labeled data.
3. Document Processing Pipeline (OCR + NLP)
Problem
An insurance company receives 50K claims/day as PDFs, images, and scanned documents. Manual data entry takes 3 days per claim and has a 15% error rate.
Solution
Services: Textract + Comprehend + Step Functions + DynamoDB + S3 + Lambda + Bedrock
Upload --[S3 trigger]--> Textract (OCR) --> Comprehend (entities)
|--> Bedrock (ambiguity resolution)
|--> DynamoDB (structured output)
Why This Solution
Textract extracts text, forms, and tables from documents with >95% accuracy. Comprehend performs entity extraction (names, dates, policy numbers). Step Functions orchestrates the multi-step workflow with error handling and retries. Bedrock (Claude) handles complex reasoning for ambiguous cases that NLP can't resolve. The pipeline processes each claim in <2 minutes end-to-end vs. 3 days manually.
4. Personalized Product Recommendation Engine
Problem
An e-commerce platform with 10M products wants to increase conversion rate. Current "popular items" recommendations lead to a 2% CTR. Need personalized real-time recommendations.
Solution
Services: SageMaker + Personalize + ElastiCache + DynamoDB + API Gateway + CloudFront
User --[CloudFront]--> API Gateway --> Lambda --> Personalize (real-time)
|--> ElastiCache (hot cache)
|--> DynamoDB (user profile)
Why This Solution
Amazon Personalize provides real-time and batch recommendations without ML expertise — it uses the same technology as Amazon.com. SageMaker trains custom models for cold-start scenarios (new users, new items). ElastiCache caches popular recommendations for sub-millisecond delivery. Personalize supports multiple recommendation types: related items, frequently bought together, and personalized ranking.
5. Video Content Moderation Pipeline
Problem
A UGC platform gets 100K video uploads/day. 5% contain inappropriate content. Manual moderation doesn't scale and exposes reviewers to harmful content.
Solution
Services: Rekognition + Step Functions + Lambda + DynamoDB + SNS + SQS + A2I (Human-in-Loop)
Upload --[S3 trigger]--> Rekognition (auto-moderation)
95% clear --> DynamoDB (approved)
5% borderline --> A2I (human review)
Why This Solution
Rekognition Video detects unsafe content, celebrities, and text in videos. Step Functions orchestrates the workflow. Rekognition handles 95% of clear cases automatically. The remaining 5% (borderline cases) go to Augmented AI (A2I) for human review. DynamoDB tracks moderation status. Step Functions enables easy workflow changes as policies evolve.
6. Real-Time Voice/Speech Analytics
Problem
A contact center receives 50K calls/day. Need to analyze sentiment, detect escalation keywords, and generate transcript summaries in real-time to help agents.
Solution
Services: Amazon Connect + Transcribe + Comprehend + Lambda + DynamoDB + Bedrock + QuickSight
Call audio --[Connect]--> Transcribe (speech-to-text)
|--> Comprehend (real-time sentiment)
|--> Bedrock (summaries & suggested responses)
|--> QuickSight (manager dashboards)
Why This Solution
Transcribe provides real-time speech-to-text with speaker diarization. Comprehend performs real-time sentiment analysis. Bedrock generates call summaries and suggested responses. Lambda triggers actions based on detected keywords (e.g. "I want to cancel" → notify retention team). Managers get real-time dashboards in QuickSight.
7. Predictive Maintenance for Industrial Equipment
Problem
A manufacturing plant has 10K machines. Unexpected failures cause $1M/hour in downtime. Current maintenance is schedule-based, replacing parts too early or too late.
Solution
Services: IoT Core + SageMaker + Timestream + Grafana + Lambda + S3 + EventBridge
Sensors --[IoT Core]--> Timestream (time-series store)
|--> SageMaker (anomaly detection)
|--> Grafana (real-time dashboards)
|--> EventBridge (alerts)
Why This Solution
IoT Core ingests sensor data. SageMaker trains anomaly detection models on historical failure data. Timestream stores time-series sensor data with built-in analytics functions. Grafana provides real-time dashboards. The system predicts failures 48–72 hours in advance with >90% accuracy. SageMaker's built-in algorithms (Random Cut Forest) are purpose-built for anomaly detection.
8. MLOps Continuous Training Pipeline
Problem
A recommendation model's accuracy degrades 5% per week as user behavior shifts. Manual retraining takes 3 days and requires engineer intervention each time.
Solution
Services: SageMaker Pipelines + Lambda + CodeCommit + Step Functions + Model Registry + CloudWatch
Schedule/Drift detection --[Step Functions]--> SageMaker Pipelines
|--> Model Registry (approval gate)
|--> Canary deployment
|--> CloudWatch (monitoring)
Why This Solution
SageMaker Pipelines automates the entire ML workflow: data validation, training, evaluation, and deployment. Step Functions triggers pipelines on schedule or when data drift is detected. Model Registry tracks model versions with approval gates. Canary deployments automatically roll back if metrics degrade. Automated retraining keeps accuracy stable without human intervention.
9. Intelligent Document Search and Discovery
Problem
A law firm has 5M legal documents. Keyword search misses relevant results 40% of the time. Associates spend hours manually finding related cases.
Solution
Services: OpenSearch (k-NN) + Bedrock + Lambda + S3 + Textract + Neptune
Document --[Textract]--> Bedrock (embeddings) --> OpenSearch k-NN
--[Textract]--> Bedrock (entities) --> Neptune (knowledge graph)
Query --[OpenSearch k-NN]--> (semantic results)
--[Neptune]--> (graph traversal results)
Why This Solution
k-NN in OpenSearch enables semantic search using embeddings. Bedrock generates embeddings and summarizes results. Textract extracts text from scanned documents. Neptune builds a knowledge graph of document relationships, citations, and entities. Combined vector + graph search provides much better recall than keyword search alone.
10. Generative AI Content Creation Pipeline
Problem
A marketing agency needs to produce 1000 personalized email campaigns/day. Manual copywriting takes 30 min/email. Need consistent brand voice across all content.
Solution
Services: Bedrock + Step Functions + S3 + DynamoDB + Lambda + API Gateway + Comprehend
Trigger --[API Gateway]--> Step Functions
|--> Lambda: fetch customer segments
|--> Bedrock: generate personalized copy
|--> Comprehend: quality & sentiment check
|--> SNS: human approval (high-value sends)
|--> SES: send campaign
Why This Solution
Bedrock (Claude) generates personalized email copy based on customer segments using prompt templates. Step Functions orchestrates the pipeline: fetch customer data → generate content → check quality with Comprehend sentiment → human approval for high-value sends. DynamoDB stores generation metadata. S3 stores templates and assets. The pipeline generates 1000 emails in <5 minutes with consistent brand voice.