Phase 5: Generative AI (GenAI) & LLMs
Generative AI refers to models that do not just classify or predict existing data, but create net-new high-quality text, images, or audio. Large Language Models (LLMs) like GPT-4, Llama 3, and Claude are the dominant subset of GenAI dealing with text.
1. The Transformer Backbone & Prompt Engineering
Every modern LLM is built on the Transformer architecture. Instead of reading text word-by-word, Transformers read the entire input simultaneously, calculating mathematical "attention" between every word to understand deep context.
Prompt Engineering Since we interact with LLMs via text rather than code, knowing how to structure that text is crucial. - Role Prompting: "You are a senior Python engineer..." - Few-Shot Prompting: Providing 2 or 3 examples of the desired output format before asking the question. - Chain-of-Thought (CoT): Asking the model to "think step by step." This forces the model to generate intermediate reasoning tokens, significantly improving math and logic performance.
Example 1: Advanced Prompting with the OpenAI API
import os
from openai import OpenAI
# Initialize client (requires OPENAI_API_KEY in environment variables)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def analyze_customer_feedback(review_text: str):
"""
Uses Few-Shot and Chain-of-Thought prompting to extract structured data.
"""
system_prompt = """
You are an expert customer success analyst.
Extract the sentiment, the core product mentioned, and actionable advice.
Think step-by-step before providing the final JSON output.
Example 1:
Input: "The new battery life on the ProMax phone is awful, it died in 4 hours."
Thought: The user is angry about the battery life of the ProMax. Actionable advice would be to investigate the battery drain issue.
JSON: {"sentiment": "Negative", "product": "ProMax phone", "actionReq": "Investigate battery drain"}
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": review_text}
],
temperature=0.2 # Low temperature for more deterministic, factual output
)
return response.choices[0].message.content
# Example Usage
# review = "I absolutely love the noise cancellation on the Studio headphones, but the hinge feels flimsy."
# print(analyze_customer_feedback(review))
2. RAG (Retrieval-Augmented Generation)
LLMs have two major flaws: they hallucinate (make things up confidently) and their knowledge is frozen in time at their training cutoff date. RAG solves both.
The RAG Architecture: 1. Ingest: Convert proprietary company documents (PDFs, Confluence pages) into mathematical Vectors (Embeddings) and store them in a Vector Database (like ChromaDB or Pinecone). 2. Retrieve: When a user asks a question, convert the question into a vector, and mathmatically search the DB for the most "similar" document chunks. 3. Generate: Pass the User's Question + The Retrieved Document Chunks to the LLM and say "Answer this question strictly using the provided documents."
Example 2: Enterprise RAG using LangChain and ChromaDB
import os
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import PromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
def setup_corporate_rag(document_path: str):
"""
Builds a vector database from a document and creates a QA chain.
"""
# 1. Load Document
loader = TextLoader(document_path)
docs = loader.load()
# 2. Split Document (LLMs have context windows, we can't feed an entire book at once)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
splits = text_splitter.split_documents(docs)
# 3. Create Vector Embeddings and store in ChromaDB
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
# 4. Create Retriever (fetches top 3 most relevant chunks)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# 5. Setup LLM and Strict Prompt
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
template = """
Use the following pieces of context to answer the question at the end.
If you don't know the answer based on the context, say "I don't know".
DO NOT make up an answer.
Context: {context}
Question: {input}
Answer:"""
prompt = PromptTemplate.from_template(template)
# 6. Combine Retrieval and Generation
combine_docs_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)
return rag_chain
# Example Usage:
# Open file 'hr_policy.txt' and write some dummy policies first
# chain = setup_corporate_rag("hr_policy.txt")
# response = chain.invoke({"input": "How many vacation days do I get?"})
# print(response["answer"])
3. Local Models & Open Source Ecosystem
You don't always want to send sensitive company data to OpenAI APIs. The open-source ecosystem (led by Meta's Llama series and Mistral) allows you to run powerful LLMs completely locally.
Example 3: Running a Local LLM via HuggingFace
This requires a machine with a decent GPU (16GB+ VRAM for medium models).
import torch
from transformers import pipeline
def load_local_generator():
"""
Loads a smaller open-source model locally using HuggingFace.
"""
# Using Llama-3-8B-Instruct (Assuming you have accepted the license on HF)
# or Mistral-7B-Instruct
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# The pipeline abstracts away tokenizer and device placement
# device_map="auto" automatically pushes layers to GPU if available
# torch_dtype=torch.bfloat16 reduces memory footprint by 50% vs float32
generator = pipeline(
"text-generation",
model=model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
return generator
# Example Usage
# local_llm = load_local_generator()
#
# messages = [
# {"role": "system", "content": "You are a helpful coding assistant."},
# {"role": "user", "content": "Write a python function to compute the Fibonacci sequence."}
# ]
#
# # Generation parameters:
# # max_new_tokens limits the response length
# # do_sample=True allows for creative generation rather than strict greedy decoding
# output = local_llm(messages, max_new_tokens=256, do_sample=True, temperature=0.7)
#
# print(output[0]['generated_text'][-1]['content'])