Word Embeddings
While TF-IDF is highly effective for basic document classification, it fails to capture semantic meaning. To a TF-IDF algorithm, the word "King" and "Queen" mathematically have absolutely zero relationship.
1. Dense Vectors (Word2Vec & GloVe)
Instead of assigning a single number to a word based on how often it appears, Word Embeddings represent a word as an $N$-dimensional spatial vector (usually 300 dimensions).
- The Geometry of Meaning: If you train a Neural Network to predict missing words in millions of Wikipedia articles, the network eventually learns that words appearing in similar contexts (e.g., "I drive a ___" $\rightarrow$ Car, Truck, Vehicle) must have similar mathematical representations.
- The Result: The words "Dog" and "Puppy" will physically be placed next to each other in 300-dimensional mathematical space.
2. Mathematical King - Man + Woman = Queen
Because meaning is mapped to spatial coordinates, you can perform actual arithmetic on language.
If you take the vector coordinates for [King], subtract the vector for [Man], and add the vector for [Woman], the resulting coordinates will mathematically place you almost exactly on top of the word [Queen].
These dense vectors are universally used as the primary Input Layer for Deep Learning NLP models (LSTMs).
How to execute the examples:
Go to the Examples/ folder and run the script:
python NLP_Word2Vec_GloVe.py