Retrieval-Augmented Generation (RAG): Framework and Best Practices
Retrieval-Augmented Generation (RAG) is a technique that gives Large Language Models (LLMs) access to external, domain-specific data without the need for expensive fine-tuning.
1. The RAG Workflow
graph TD
A[User Query] --> B[Embedding Model]
B --> C[Vector Database]
C -->|Semantic Search| D[Top-K Documents]
D --> E[LLM Prompt]
A --> E
E --> F[final Answer]
- Ingestion: Documents are split into Chunks, converted into Embeddings (vectors), and stored in a Vector DB.
- Retrieval: When a user asks a question, the query is converted into a vector. The system finds the most similar chunks in the Vector DB.
- Augmentation: The retrieved chunks are added to the LLM prompt as "context".
- Generation: The LLM answers the question based only on the provided context.
2. Key Components
📦 Vector Databases
- Pinecone: Managed, high-performance.
- Milvus / Weaviate: Open-source, scalable.
- Chroma: Lightweight, ideal for local development.
✂️ Chunking Strategies
- Fixed-size: Simple, but can cut off sentences mid-way.
- Recursive Character: Splits on paragraphs, then sentences, then words to keep context intact.
- Semantic Chunking: Uses an LLM to find logical breaks in the text.
🧠 Embedding Models
- OpenAI (text-embedding-3-small): Industry standard.
- HuggingFace (BGE-M3): Excellent open-source alternative.
3. Challenges and Solutions
| Challenge | Solution |
|---|---|
| Lost in the Middle | Re-rankers (like Cohere) help prioritize the most relevant chunks. |
| Hallucinations | Use "Strict Prompting" (e.g., "Answer only using the provided context"). |
| Stale Data | Implement a refresh pipeline to update vectors whenever source docs change. |
| Bad Retrieval | Use Hybrid Search (Semantic + Keyword/BM25). |
4. Evaluation: The RAGAS Framework
How do you know if your RAG is good? Use these metrics: - Faithfulness: Is the answer derived only from the context? - Answer Relevance: Does the answer actually address the user's question? - Context Precision: Were the retrieved chunks actually relevant?
5. Implementation Example (Python)
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
# 1. Load and Chunk
docs = load_documents("my_data.pdf")
chunks = split_text(docs, chunk_size=1000)
# 2. Create Vector DB
db = Chroma.from_documents(chunks, OpenAIEmbeddings())
# 3. Retrieve
query = "What is the return policy?"
context = db.similarity_search(query, k=3)
# 4. Generate (Prompt)
prompt = f"Context: {context}\n\nQuestion: {query}"
response = llm.invoke(prompt)