Retrieval-Augmented Generation (RAG): Framework and Best Practices

Retrieval-Augmented Generation (RAG) is a technique that gives Large Language Models (LLMs) access to external, domain-specific data without the need for expensive fine-tuning.

1. The RAG Workflow

graph TD
    A[User Query] --> B[Embedding Model]
    B --> C[Vector Database]
    C -->|Semantic Search| D[Top-K Documents]
    D --> E[LLM Prompt]
    A --> E
    E --> F[final Answer]

Ingestion: Documents are split into Chunks, converted into Embeddings (vectors), and stored in a Vector DB.
Retrieval: When a user asks a question, the query is converted into a vector. The system finds the most similar chunks in the Vector DB.
Augmentation: The retrieved chunks are added to the LLM prompt as "context".
Generation: The LLM answers the question based only on the provided context.

2. Key Components

📦 Vector Databases

Pinecone: Managed, high-performance.
Milvus / Weaviate: Open-source, scalable.
Chroma: Lightweight, ideal for local development.

✂️ Chunking Strategies

Fixed-size: Simple, but can cut off sentences mid-way.
Recursive Character: Splits on paragraphs, then sentences, then words to keep context intact.
Semantic Chunking: Uses an LLM to find logical breaks in the text.

🧠 Embedding Models

OpenAI (text-embedding-3-small): Industry standard.
HuggingFace (BGE-M3): Excellent open-source alternative.

3. Challenges and Solutions

Challenge	Solution
Lost in the Middle	Re-rankers (like Cohere) help prioritize the most relevant chunks.
Hallucinations	Use "Strict Prompting" (e.g., "Answer only using the provided context").
Stale Data	Implement a refresh pipeline to update vectors whenever source docs change.
Bad Retrieval	Use Hybrid Search (Semantic + Keyword/BM25).

4. Evaluation: The RAGAS Framework

How do you know if your RAG is good? Use these metrics: - Faithfulness: Is the answer derived only from the context? - Answer Relevance: Does the answer actually address the user's question? - Context Precision: Were the retrieved chunks actually relevant?

5. Implementation Example (Python)

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# 1. Load and Chunk
docs = load_documents("my_data.pdf")
chunks = split_text(docs, chunk_size=1000)

# 2. Create Vector DB
db = Chroma.from_documents(chunks, OpenAIEmbeddings())

# 3. Retrieve
query = "What is the return policy?"
context = db.similarity_search(query, k=3)

# 4. Generate (Prompt)
prompt = f"Context: {context}\n\nQuestion: {query}"
response = llm.invoke(prompt)