Practical NLP Techniques: Text Processing & Similarity
Natural Language Processing Fundamentals with NLTK
This section demonstrates fundamental Natural Language Processing (NLP) techniques using the NLTK library in Python.
Setup and Imports
import nltk
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
# !pip install nltk
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')
Sample Text for Processing
text = "Natural Language Processing is an exciting field. It involves making computers understand human languages. We will explore tokenization, stemming, and lemmatization."
Original Text:
Natural Language Processing is an exciting field. It involves making computers understand human languages. We will explore tokenization, stemming, and lemmatization.
Text Preprocessing: Tokenization & Normalization
1. Tokenization
Tokenization is the process of breaking down text into smaller units, such as sentences or words.
Sentence Tokenization
sentences = sent_tokenize(text)
Sentences:
['Natural Language Processing is an exciting field.', 'It involves making computers understand human languages.', 'We will explore tokenization, stemming, and lemmatization.']
Word Tokenization
words = word_tokenize(text.lower()) # Convert to lowercase for consistency
Words:
['natural', 'language', 'processing', 'is', 'an', 'exciting', 'field', '.', 'it', 'involves', 'making', 'computers', 'understand', 'human', 'languages', '.', 'we', 'will', 'explore', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']
2. Stemming
Stemming reduces words to their root or base form, often by removing suffixes. The resulting “stem” may not be a valid word.
porter = PorterStemmer()
stemmed_words = [porter.stem(word) for word in words]
Stemmed Words:
['natur', 'languag', 'process', 'is', 'an', 'excit', 'field', '.', 'it', 'involv', 'make', 'comput', 'understand', 'human', 'languag', '.', 'we', 'will', 'explor', 'token', ',', 'stem', ',', 'and', 'lemmat', '.']
3. Lemmatization
Lemmatization reduces words to their dictionary form (lemma), ensuring the result is a valid word. It often requires Part-of-Speech (POS) tags for better accuracy.
lemmatizer = WordNetLemmatizer()
# Note: Lemmatization can be improved by providing Part-of-Speech tags
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
Lemmatized Words (without POS tags):
['natural', 'language', 'processing', 'is', 'an', 'exciting', 'field', '.', 'it', 'involves', 'making', 'computer', 'understand', 'human', 'language', '.', 'we', 'will', 'explore', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']
Example with POS tag (simplified for demonstration):
lemmatized_words_pos = [lemmatizer.lemmatize(word, pos='v') for word in words] # Assuming all are verbs
# print("Lemmatized Words (assuming verbs):")
4. Stop Words Removal
Stop words are common words (e.g., ‘is’, ‘a’, ‘the’) that often carry little meaning and are removed to reduce noise in text analysis.
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.isalnum() and word not in stop_words] # also remove punctuation
Words after Stopword Removal:
['natural', 'language', 'processing', 'exciting', 'field', 'involves', 'making', 'computers', 'understand', 'human', 'languages', 'explore', 'tokenization', 'stemming', 'lemmatization']
5. Regular Expressions in NLP
Regular expressions (regex) are powerful tools for pattern matching and manipulation in text.
a. Search for a Pattern
pattern_search = r"\blang\w*"
found = re.findall(pattern_search, text, re.IGNORECASE)
Words matching ‘\blang\w*
‘:
['Language', 'languages']
b. Split Text by a Delimiter
pattern_split = r"[.!?]"
sentences_re = re.split(pattern_split, text)
Sentences split by regex:
['Natural Language Processing is an exciting field', 'It involves making computers understand human languages', 'We will explore tokenization, stemming, and lemmatization']
c. Substitution
pattern_sub = r"\bNLP\b"
substituted_text = re.sub(pattern_sub, "Natural Language Processing", text)
Text after substitution:
Natural Language Processing is an exciting field. It involves making computers understand human languages. We will explore tokenization, stemming, and lemmatization.
Processing a Sample Paragraph
This section applies the learned preprocessing techniques to another sample paragraph.
paragraph = """Tokenization is the first step. Stemming reduces words to their root form, which may not be a valid word.
Lemmatization, on the other hand, reduces words to their dictionary form (lemma).
Stop words like 'is', 'a', 'the' are often removed."""
Processing Sample Paragraph:
words_p = word_tokenize(paragraph.lower())
print("Tokens:", words_p)
Tokens:
['tokenization', 'is', 'the', 'first', 'step', '.', 'stemming', 'reduces', 'words', 'to', 'their', 'root', 'form', ',', 'which', 'may', 'not', 'be', 'a', 'valid', 'word', '.', 'lemmatization', ',', 'on', 'the', 'other', 'hand', ',', 'reduces', 'words', 'to', 'their', 'dictionary', 'form', '(', 'lemma', ')', '.', 'stop', 'words', 'like', "'is", "'", ',', "'a", "'", ',', "'the", "'", 'are', 'often', 'removed', '.']
stemmed_p = [porter.stem(w) for w in words_p if w.isalnum()]
print("Stemmed:", stemmed_p)
Stemmed:
['token', 'is', 'the', 'first', 'step', 'stem', 'reduc', 'word', 'to', 'their', 'root', 'form', 'which', 'may', 'not', 'be', 'a', 'valid', 'word', 'lemmat', 'on', 'the', 'other', 'hand', 'reduc', 'word', 'to', 'their', 'dictionari', 'form', 'lemma', 'stop', 'word', 'like', 'are', 'often', 'remov']
lemmatized_p = [lemmatizer.lemmatize(w) for w in words_p if w.isalnum()]
print("Lemmatized:", lemmatized_p)
Lemmatized:
['tokenization', 'is', 'the', 'first', 'step', 'stemming', 'reduces', 'word', 'to', 'their', 'root', 'form', 'which', 'may', 'not', 'be', 'a', 'valid', 'word', 'lemmatization', 'on', 'the', 'other', 'hand', 'reduces', 'word', 'to', 'their', 'dictionary', 'form', 'lemma', 'stop', 'word', 'like', 'are', 'often', 'removed']
filtered_p = [w for w in words_p if w.isalnum() and w not in stop_words]
print("Stop words removed:", filtered_p)
Stop words removed:
['tokenization', 'first', 'step', 'stemming', 'reduces', 'words', 'root', 'form', 'may', 'valid', 'word', 'lemmatization', 'hand', 'reduces', 'words', 'dictionary', 'form', 'lemma', 'stop', 'words', 'like', 'often', 'removed']
Text Vectorization: Bag of Words and TF-IDF
This section covers methods to convert text into numerical representations, essential for machine learning algorithms.
Setup and Sample Documents
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Sample documents
doc1 = "The cat sat on the mat."
doc2 = "The dog played in the park."
doc3 = "The cat and the dog are friends."
corpus = [doc1, doc2, doc3]
Corpus:
['The cat sat on the mat.', 'The dog played in the park.', 'The cat and the dog are friends.']
1. Bag of Words (BoW)
Bag of Words represents text as a collection of word counts, disregarding grammar and word order.
vectorizer_bow = CountVectorizer()
bow_matrix = vectorizer_bow.fit_transform(corpus)
Feature Names (Vocabulary):
['and', 'are', 'cat', 'dog', 'friends', 'in', 'mat', 'on', 'park', 'played', 'sat', 'the']
BoW Matrix (dense):
[[0 0 1 0 0 0 1 1 0 0 1 2]
[0 0 0 1 0 1 0 0 1 1 0 2]
[1 1 1 1 1 0 0 0 0 0 0 3]]
2. TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is a numerical statistic that reflects how important a word is to a document in a corpus, accounting for both its frequency within a document and its rarity across the corpus.
vectorizer_tfidf = TfidfVectorizer()
tfidf_matrix = vectorizer_tfidf.fit_transform(corpus)
Feature Names (Vocabulary):
['and', 'are', 'cat', 'dog', 'friends', 'in', 'mat', 'on', 'park', 'played', 'sat', 'the']
TF-IDF Matrix (dense):
[[0. 0. 0.4754326 0. 0. 0.
0.5900744 0.5900744 0. 0. 0.5900744 0.28088749]
[0. 0. 0. 0.4754326 0. 0.5900744
0. 0. 0.5900744 0.5900744 0. 0.28088749]
[0.4754326 0.4754326 0.38299007 0.38299007 0.4754326 0.
0. 0. 0. 0. 0. 0.45600336]]
Document Similarity with Cosine Similarity
Cosine similarity measures the cosine of the angle between two non-zero vectors, indicating how similar two documents are regardless of their size.
Setup and Sample Documents
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample documents
doc1 = "Natural Language Processing is a field of artificial intelligence"
doc2 = "Artificial intelligence includes the field of Natural Language Processing"
Steps to Compute Cosine Similarity
- Vectorize the text using TF-IDF:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([doc1, doc2])
Compute cosine similarity:cos_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
Cosine Similarity between the two documents:
0.7999999999999999
Probabilistic Context-Free Grammars (PCFG)
PCFG extends Context-Free Grammars by assigning probabilities to production rules, allowing for the disambiguation of parse trees.
Setup and Grammar Definition
import nltk
from nltk import PCFG
from nltk.parse import EarleyChartParser
# Define the PCFG grammar
grammar = PCFG.fromstring("""
S -> NP VP [1.0]
NP -> Det Noun [0.5] | ProperNoun [0.5]
VP -> Verb NP [1.0]
Det -> 'a' [0.5] | 'the' [0.5]
Noun -> 'cat' [0.5] | 'dog' [0.5]
ProperNoun -> 'Alice' [1.0]
Verb -> 'chased' [1.0]
""")
Parsing a Sentence with PCFG
# Create the parser using the grammar
parser = EarleyChartParser(grammar)
# Define the input sentence
sentence = ['Alice', 'chased', 'a', 'dog']
# Parse the sentence using the PCFG
for tree in parser.parse(sentence):
print(tree) # Print the parse tree in bracketed notation
tree.draw() # Draw the parse tree using NLTK's GUI viewer
Parse Tree Output:
(S
(NP (ProperNoun Alice))
(VP (Verb chased) (NP (Det a) (Noun dog)))) (p=1.0)