The Transformer Architecture

Before 2017, the NLP world was dominated by Recurrent Neural Networks (RNNs) and LSTMs. They processed text sequentially (word-by-word). This inherently bottlenecked training speed (you can't process word #5 until you finish word #4) and struggled with long-range context in lengthy paragraphs.

In 2017, Google published the paper "Attention Is All You Need", introducing the Transformer. It fundamentally revolutionized AI and birthed the modern era of Large Language Models (LLMs).

1. The Core Innovation: Self-Attention

Transformers do not read sequentially. They read an entire sequence of text simultaneously in parallel.

To understand context, they use a mathematical mechanism called Self-Attention. For every single word in a sentence, the model calculates exactly how much "attention" (mathematical weight) it should pay to every other word in that sentence. Example Sentence: "The bank of the river." vs "I deposited money in the bank." The Attention mechanism looks at the word "bank", simultaneously looks at "river" or "deposited", and dynamically adjusts the mathematical meaning of "bank" based on its surroundings.

2. Encoder-Decoder Structure

The original Transformer was designed for language translation. - The Encoder: Reads the English sentence simultaneously, calculates all the attention weights, and creates a massive, context-aware matrix of numbers. - The Decoder: Takes that context matrix and generates the French translation word-by-word, continually paying "attention" back to the Encoder's matrix to ensure it doesn't lose the plot.

3. The Great Divergence (BERT vs GPT)

After the original paper, the AI industry split the Transformer in half: - Encoder-Only Models (BERT by Google): They threw away the Decoder. BERT is a master of understanding text. It is used for Search Engines, Sentiment Analysis, and Document Classification. It reads the whole sentence at once. - Decoder-Only Models (GPT by OpenAI): They threw away the Encoder. GPT is a master of generating text. It is an auto-regressive model; it looks at all previous words and simply predicts the mathematically most probable next word.

How to execute the examples:

Go to the Examples/ folder and run the script: python Transformer_HuggingFace.py