NLTK: Part-of-Speech Tagging and Named Entity Recognition
Beyond basic tokenization, NLTK provides powerful tools to analyze the grammatical structure and identify semantic entities within text. Part-of-Speech (POS) Tagging and Named Entity Recognition (NER) are two such fundamental tasks that enrich text understanding.
1. Part-of-Speech (POS) Tagging
POS tagging is the process of assigning a "part of speech" tag (e.g., noun, verb, adjective, adverb) to each word in a given text. This helps in understanding the grammatical role of words in a sentence.
Tags:
Common tags include:
* NN: Noun, singular or mass
* NNS: Noun, plural
* VB: Verb, base form
* VBG: Verb, gerund or present participle
* JJ: Adjective
* RB: Adverb
* DT: Determiner
* PRP: Personal pronoun
And many more detailed tags. NLTK often uses the Penn Treebank tagset.
import nltk
from nltk.tokenize import word_tokenize
# Download necessary NLTK data if not already present
try:
nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
nltk.download('averaged_perceptron_tagger')
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt')
text = "NLTK is a powerful natural language processing library in Python."
words = word_tokenize(text)
# Perform POS tagging
pos_tags = nltk.pos_tag(words)
print("POS Tags:", pos_tags)
# Example with another sentence
text_2 = "The quick brown fox jumps over the lazy dog."
words_2 = word_tokenize(text_2)
pos_tags_2 = nltk.pos_tag(words_2)
print("\nPOS Tags for 'The quick brown fox jumps over the lazy dog.':\n", pos_tags_2)
2. Named Entity Recognition (NER)
NER is the task of identifying and classifying named entities in text into pre-defined categories such as person names, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NLTK's NER capability builds upon POS tagging and chunking.
a. Chunking
Before performing full NER, NLTK often uses "chunking" or "shallow parsing" to group words into syntactic "chunks" (e.g., noun phrases, verb phrases). This helps in identifying multi-word entities.
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# Download necessary NLTK data if not already present
try:
nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
nltk.download('averaged_perceptron_tagger')
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt')
try:
nltk.data.find('chunkers/maxent_ne_chunker')
except LookupError:
nltk.download('maxent_ne_chunker')
try:
nltk.data.find('corpora/words') # Often required for maxent_ne_chunker
except LookupError:
nltk.download('words')
text = "Barack Obama visited New York City to meet with Apple executives."
words = word_tokenize(text)
pos_tags = pos_tag(words)
print("POS Tags:\n", pos_tags)
# Define a grammar for noun phrase (NP) chunking
grammar = r"""
NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
PP: {<IN><NP>} # Chunk prepositions followed by NP
VP: {<VB.*><NP|PP|CLAUSE>*} # Chunk verbs and their arguments
CLAUSE: {<NP><VP>} # Chunk NP, VP
"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(pos_tags)
print("\nChunked Sentence (Noun Phrases):\n", result)
# result.draw() # Uncomment to draw the parse tree (requires Ghostscript)
b. Named Entity Recognition (using nltk.ne_chunk)
nltk.ne_chunk performs NER, identifying named entities based on the chunks.
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# (Assume necessary NLTK data for pos_tag and ne_chunk is downloaded from above)
text = "Google's CEO Sundar Pichai visited London last week. He met with officials from the UK government."
words = word_tokenize(text)
pos_tags = pos_tag(words)
# Perform Named Entity Recognition
# binary=True for simpler (is it a NE or not), False for specific categories (PERSON, ORGANIZATION, GPE, etc.)
named_entities = nltk.ne_chunk(pos_tags, binary=False)
print("Named Entities (Full Tree):\n", named_entities)
# named_entities.draw() # Uncomment to draw the parse tree (requires Ghostscript)
# Extracting specific named entities
print("\nExtracted Named Entities:")
for chunk in named_entities:
if hasattr(chunk, 'label'):
entity_name = ' '.join([c[0] for c in chunk])
entity_type = chunk.label()
print(f" - {entity_type}: {entity_name}")
3. Practical Applications
- Information Extraction: Automatically extracting structured information (e.g., person names, company names) from unstructured text.
- Question Answering Systems: Identifying key entities in questions to retrieve relevant answers.
- Chatbots: Understanding user intent and extracting relevant keywords.
- Search Engines: Indexing documents based on named entities.
- Sentiment Analysis: Identifying entities that sentiment is being expressed towards.
- Machine Translation: Ensuring proper translation of named entities.
Further Topics:
- Customizing POS taggers.
- More advanced chunking techniques.
- Training custom NER models (often with libraries like SpaCy, Stanford NER, or deep learning frameworks).
- Using pre-trained NLP models.
POS tagging and NER are crucial steps for deep linguistic analysis and information extraction from text, laying the groundwork for more sophisticated NLP tasks.