⬡ Hub
Skip to content

Text Preprocessing

Machine Learning models cannot understand English letters. A Neural Network operates strictly on matrices of numbers. Before any NLP task (Sentiment Analysis, Translation), the raw text must be converted into numerical formats.

1. Tokenization & Cleaning

The first step is breaking a massive paragraph into smaller pieces. - Word Tokenization: Splitting "The cat sat" into ["The", "cat", "sat"]. - Sub-word Tokenization: Modern LLMs use this (e.g., Byte-Pair Encoding). It splits "unhappiness" into ["un", "happi", "ness"], allowing the model to mathematically understand root prefixes and suffixes even on words it has never seen. - Stop Words: Removing extremely common, uninformative words like "the", "a", "is" to save memory. - Stemming/Lemmatization: Converting variations of a word into its mathematical root base (e.g., running, ran, runs $\rightarrow$ run). This prevents the model from treating them as entirely different concepts.

2. Bag of Words (BoW)

A naive mathematical representation. It creates a matrix where every column is a word in the entire English dictionary, and the rows are your documents. It puts a 1 if the word exists in the document, and a 0 if it doesn't. - Flaw: It creates massive sparse matrices (mostly zeros) and entirely destroys the order of the words. "The dog chased the cat" and "The cat chased the dog" have the exact same BoW mathematical representation!

3. TF-IDF (Term Frequency - Inverse Document Frequency)

An improvement over BoW. Instead of just counting if a word exists (0 or 1), TF-IDF assigns a mathematical weight to the word. - Term Frequency (TF): Does this word appear a lot in this specific document? - Inverse Document Frequency (IDF): Does this word appear in almost every document in our database? (If yes, it's a common word like "and", so we heavily penalize its mathematical weight).

TF-IDF allows standard machine learning algorithms (like Naive Bayes or Logistic Regression) to perform text classification effectively without massive computing power.

How to execute the examples:

Go to the Examples/ folder and run the script: python NLP_Tokenizer_TFIDF.py