Phase 4: Natural Language Processing (NLP)

Natural Language Processing (NLP) focuses on enabling computers to understand, interpret, and generate human language. It is the bridge between human communication and computer understanding.

1. Text Preprocessing (The Old School NLP)

Before the deep learning era, NLP involved heavy manual manipulation of text to turn string data into mathematical representations that models could understand.

Key Concepts: - Tokenization: Breaking down text into smaller pieces (words, subwords, or characters). - Stopword Removal: Deleting extremely common words ("the", "is", "in") that provide little semantic value. - Stemming / Lemmatization: Reducing words to their base root (e.g., "running", "ran", "runs" all become "run"). - TF-IDF (Term Frequency-Inverse Document Frequency): A statistical measure that evaluates how important a word is. If a word appears frequently in one specific document, but rarely across all documents, it is given a high score.

Example 1: Creating TF-IDF Embeddings using SciKit-Learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# 1. Sample Dataset
documents = [
    "I love using this new smartphone, the camera is incredible!",
    "The battery life on this device is terrible, I hate it.",
    "This laptop is okay, gets the job done but slow.",
    "An absolutely amazing experience, highly recommended."
]
labels = ["Positive", "Negative", "Neutral", "Positive"]

# 2. Create an NLP pipeline
# TfidfVectorizer handles tokenization, lowercase conversion, and TF-IDF scoring implicitly
model = make_pipeline(
    TfidfVectorizer(stop_words='english', max_features=1000), # Cap at top 1000 words
    MultinomialNB()  # Naive Bayes is excellent for simple text classification
)

# 3. Train
model.fit(documents, labels)

# 4. Predict
new_texts = ["The screen quality is amazing I love it", "Worst purchase ever, terrible."]
predictions = model.predict(new_texts)
print(f"Predictions: {list(zip(new_texts, predictions))}")

2. Word Embeddings (The Deep Learning Shift)

TF-IDF treats words as independent symbols. It doesn't know "King" and "Queen" are related. Word Embeddings solve this by representing words as dense mathematical vectors in high-dimensional space. Words with similar meanings are grouped closely together in this space.

Key Concepts: - Word2Vec & GloVe: Early, highly successful static embedding models. They mapped words to fixed 300-dimensional vectors. - The Context Problem: Static embeddings are flawed because words have multiple meanings. The word "Bank" in "Bank of America" vs. "River Bank" gets the exact same mathematical vector in GloVe. This led to the creation of Contextual Embeddings (Transformers).

Example 2: Using Pre-Trained GloVe Embeddings

# Typically requires downloading GloVe vectors:
# import gensim.downloader as api
# # Load 50-dimensional GloVe vectors (can take a minute to download)
# model = api.load("glove-wiki-gigaword-50")

# # Find words similar to "king" but explicitly moving away from "man" and towards "woman"
# # This classic example yields "Queen"
# result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
# print("Mathematical equation: King - Man + Woman =", result[0][0])

3. The Transformer Era (Modern NLP)

In 2017, Google published "Attention is All You Need", introducing the Transformer architecture. It completely replaced RNNs and LSTMs in NLP.

Key Concepts: - Self-Attention Mechanism: Instead of processing words sequentially (left to right), transformers look at the entire sentence at once. They use mathematical equations to weigh how much "Attention" every word should give to every other word in the sentence to build context. - BERT (Bidirectional Encoder Representations from Transformers): The model that changed NLP. It reads text bidirectionally, gaining massive contextual understanding. You typically take a pre-trained BERT and "Fine-Tune" it on your specific task.

Example 3: Named Entity Recognition (NER) with HuggingFace Transformers

Modern NLP relies heavily on the HuggingFace transformers library, which allows accessing massive models in three lines of code.

from transformers import pipeline

# Load a pre-trained NER pipeline.
# It automatically downloads the model, tokenizer, and configures the architecture.
ner_pipeline = pipeline("ner", grouped_entities=True)

text = "Apple Inc. announced today that CEO Tim Cook will be visiting Paris, France next Monday to open a new store."

# Process the text
entities = ner_pipeline(text)

# Print beautifully
print("Discovered Entities:")
for entity in entities:
    print(f"- {entity['word']} ({entity['entity_group']}): Confidence {entity['score']:.2f}")

# Output will correctly identify:
# Apple Inc -> Organization
# Tim Cook -> Person
# Paris, France -> Location

Example 4: Fine-Tuning a Transformer for Text Classification

This is the industry standard approach for complex text analytics tasks (e.g., sorting medical documents).

import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

# 1. Load Tokenizer & Model
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
# Asking the model to output 2 classes (Binary classification)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 2. Setup Data (Assume 'texts' and 'labels' are Python lists)
texts = ["This product is fantastic.", "This broke within five minutes, terrible.", "Highly recommend it!"]
labels = [1, 0, 1]

# Tokenize text
encodings = tokenizer(texts, truncation=True, padding=True, max_length=128, return_tensors="pt")

# Convert to PyTorch Dataset (Simplistic wrapper dataset for example purposes)
class SimpleDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

dataset = SimpleDataset(encodings, labels)

# 3. Training Arguments (Configuring hyperparameters)
training_args = TrainingArguments(
    output_dir='./results',          # Checkpoints saved here
    num_train_epochs=3,              # Train for 3 epochs
    per_device_train_batch_size=8,   # Batch size (adjust based on GPU memory)
    weight_decay=0.01,               # Regularization
    logging_dir='./logs',
)

# 4. Initialize Trainer and Start
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

# trainer.train() # Uncomment to run the fine-tuning loop!