Practical NLP Techniques: Text Processing & Similarity

Posted on Aug 13, 2025 in Social Education

Natural Language Processing Fundamentals with NLTK

This section demonstrates fundamental Natural Language Processing (NLP) techniques using the NLTK library in Python.

Setup and Imports

import nltk
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
# !pip install nltk
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

Sample Text for Processing

text = "Natural Language Processing is an exciting field. It involves making computers understand human languages. We will explore tokenization, stemming, and lemmatization."

Original Text:

Natural Language Processing is an exciting field. It involves making computers understand human languages. We will explore tokenization, stemming, and lemmatization.

Text Preprocessing: Tokenization & Normalization

1. Tokenization

Tokenization is the process of breaking down text into smaller units, such as sentences or words.

Sentence Tokenization

sentences = sent_tokenize(text)

Sentences:

['Natural Language Processing is an exciting field.', 'It involves making computers understand human languages.', 'We will explore tokenization, stemming, and lemmatization.']

Word Tokenization

words = word_tokenize(text.lower()) # Convert to lowercase for consistency

Words:

['natural', 'language', 'processing', 'is', 'an', 'exciting', 'field', '.', 'it', 'involves', 'making', 'computers', 'understand', 'human', 'languages', '.', 'we', 'will', 'explore', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']

2. Stemming

Stemming reduces words to their root or base form, often by removing suffixes. The resulting “stem” may not be a valid word.

porter = PorterStemmer()
stemmed_words = [porter.stem(word) for word in words]

Stemmed Words:

['natur', 'languag', 'process', 'is', 'an', 'excit', 'field', '.', 'it', 'involv', 'make', 'comput', 'understand', 'human', 'languag', '.', 'we', 'will', 'explor', 'token', ',', 'stem', ',', 'and', 'lemmat', '.']

3. Lemmatization

Lemmatization reduces words to their dictionary form (lemma), ensuring the result is a valid word. It often requires Part-of-Speech (POS) tags for better accuracy.

lemmatizer = WordNetLemmatizer()
# Note: Lemmatization can be improved by providing Part-of-Speech tags
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

Lemmatized Words (without POS tags):

['natural', 'language', 'processing', 'is', 'an', 'exciting', 'field', '.', 'it', 'involves', 'making', 'computer', 'understand', 'human', 'language', '.', 'we', 'will', 'explore', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']

Example with POS tag (simplified for demonstration):

lemmatized_words_pos = [lemmatizer.lemmatize(word, pos='v') for word in words] # Assuming all are verbs
# print("Lemmatized Words (assuming verbs):")

4. Stop Words Removal

Stop words are common words (e.g., ‘is’, ‘a’, ‘the’) that often carry little meaning and are removed to reduce noise in text analysis.

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.isalnum() and word not in stop_words] # also remove punctuation

Words after Stopword Removal:

['natural', 'language', 'processing', 'exciting', 'field', 'involves', 'making', 'computers', 'understand', 'human', 'languages', 'explore', 'tokenization', 'stemming', 'lemmatization']

5. Regular Expressions in NLP

Regular expressions (regex) are powerful tools for pattern matching and manipulation in text.

a. Search for a Pattern

pattern_search = r"\blang\w*"
found = re.findall(pattern_search, text, re.IGNORECASE)

Words matching ‘\blang\w*‘:

['Language', 'languages']

b. Split Text by a Delimiter

pattern_split = r"[.!?]"
sentences_re = re.split(pattern_split, text)

Sentences split by regex:

['Natural Language Processing is an exciting field', 'It involves making computers understand human languages', 'We will explore tokenization, stemming, and lemmatization']

c. Substitution

pattern_sub = r"\bNLP\b"
substituted_text = re.sub(pattern_sub, "Natural Language Processing", text)

Text after substitution:

Natural Language Processing is an exciting field. It involves making computers understand human languages. We will explore tokenization, stemming, and lemmatization.

Processing a Sample Paragraph

This section applies the learned preprocessing techniques to another sample paragraph.

paragraph = """Tokenization is the first step. Stemming reduces words to their root form, which may not be a valid word.
Lemmatization, on the other hand, reduces words to their dictionary form (lemma).
Stop words like 'is', 'a', 'the' are often removed."""

Processing Sample Paragraph:

words_p = word_tokenize(paragraph.lower())
print("Tokens:", words_p)

Tokens:

['tokenization', 'is', 'the', 'first', 'step', '.', 'stemming', 'reduces', 'words', 'to', 'their', 'root', 'form', ',', 'which', 'may', 'not', 'be', 'a', 'valid', 'word', '.', 'lemmatization', ',', 'on', 'the', 'other', 'hand', ',', 'reduces', 'words', 'to', 'their', 'dictionary', 'form', '(', 'lemma', ')', '.', 'stop', 'words', 'like', "'is", "'", ',', "'a", "'", ',', "'the", "'", 'are', 'often', 'removed', '.']

stemmed_p = [porter.stem(w) for w in words_p if w.isalnum()]
print("Stemmed:", stemmed_p)

Stemmed:

['token', 'is', 'the', 'first', 'step', 'stem', 'reduc', 'word', 'to', 'their', 'root', 'form', 'which', 'may', 'not', 'be', 'a', 'valid', 'word', 'lemmat', 'on', 'the', 'other', 'hand', 'reduc', 'word', 'to', 'their', 'dictionari', 'form', 'lemma', 'stop', 'word', 'like', 'are', 'often', 'remov']

lemmatized_p = [lemmatizer.lemmatize(w) for w in words_p if w.isalnum()]
print("Lemmatized:", lemmatized_p)

Lemmatized:

['tokenization', 'is', 'the', 'first', 'step', 'stemming', 'reduces', 'word', 'to', 'their', 'root', 'form', 'which', 'may', 'not', 'be', 'a', 'valid', 'word', 'lemmatization', 'on', 'the', 'other', 'hand', 'reduces', 'word', 'to', 'their', 'dictionary', 'form', 'lemma', 'stop', 'word', 'like', 'are', 'often', 'removed']

filtered_p = [w for w in words_p if w.isalnum() and w not in stop_words]
print("Stop words removed:", filtered_p)

Stop words removed:

['tokenization', 'first', 'step', 'stemming', 'reduces', 'words', 'root', 'form', 'may', 'valid', 'word', 'lemmatization', 'hand', 'reduces', 'words', 'dictionary', 'form', 'lemma', 'stop', 'words', 'like', 'often', 'removed']

Text Vectorization: Bag of Words and TF-IDF

This section covers methods to convert text into numerical representations, essential for machine learning algorithms.

Setup and Sample Documents

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample documents
doc1 = "The cat sat on the mat."
doc2 = "The dog played in the park."
doc3 = "The cat and the dog are friends."
corpus = [doc1, doc2, doc3]

Corpus:

['The cat sat on the mat.', 'The dog played in the park.', 'The cat and the dog are friends.']

1. Bag of Words (BoW)

Bag of Words represents text as a collection of word counts, disregarding grammar and word order.

vectorizer_bow = CountVectorizer()
bow_matrix = vectorizer_bow.fit_transform(corpus)

Feature Names (Vocabulary):

['and', 'are', 'cat', 'dog', 'friends', 'in', 'mat', 'on', 'park', 'played', 'sat', 'the']

BoW Matrix (dense):

[[0 0 1 0 0 0 1 1 0 0 1 2]
 [0 0 0 1 0 1 0 0 1 1 0 2]
 [1 1 1 1 1 0 0 0 0 0 0 3]]

2. TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic that reflects how important a word is to a document in a corpus, accounting for both its frequency within a document and its rarity across the corpus.

vectorizer_tfidf = TfidfVectorizer()
tfidf_matrix = vectorizer_tfidf.fit_transform(corpus)

Feature Names (Vocabulary):

['and', 'are', 'cat', 'dog', 'friends', 'in', 'mat', 'on', 'park', 'played', 'sat', 'the']

TF-IDF Matrix (dense):

[[0.         0.         0.4754326  0.         0.         0.
  0.5900744  0.5900744  0.         0.         0.5900744  0.28088749]
 [0.         0.         0.         0.4754326  0.         0.5900744
  0.         0.         0.5900744  0.5900744  0.         0.28088749]
 [0.4754326  0.4754326  0.38299007 0.38299007 0.4754326  0.
  0.         0.         0.         0.         0.         0.45600336]]

Document Similarity with Cosine Similarity

Cosine similarity measures the cosine of the angle between two non-zero vectors, indicating how similar two documents are regardless of their size.

Setup and Sample Documents

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
doc1 = "Natural Language Processing is a field of artificial intelligence"
doc2 = "Artificial intelligence includes the field of Natural Language Processing"

Steps to Compute Cosine Similarity

Vectorize the text using TF-IDF:

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([doc1, doc2])

Compute cosine similarity:

cos_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

Cosine Similarity between the two documents:

0.7999999999999999

Probabilistic Context-Free Grammars (PCFG)

PCFG extends Context-Free Grammars by assigning probabilities to production rules, allowing for the disambiguation of parse trees.

Setup and Grammar Definition

import nltk
from nltk import PCFG
from nltk.parse import EarleyChartParser

# Define the PCFG grammar
grammar = PCFG.fromstring("""
  S -> NP VP [1.0]
  NP -> Det Noun [0.5] | ProperNoun [0.5]
  VP -> Verb NP [1.0]
  Det -> 'a' [0.5] | 'the' [0.5]
  Noun -> 'cat' [0.5] | 'dog' [0.5]
  ProperNoun -> 'Alice' [1.0]
  Verb -> 'chased' [1.0]
""")

Parsing a Sentence with PCFG

# Create the parser using the grammar
parser = EarleyChartParser(grammar)

# Define the input sentence
sentence = ['Alice', 'chased', 'a', 'dog']

# Parse the sentence using the PCFG
for tree in parser.parse(sentence):
    print(tree)       # Print the parse tree in bracketed notation
    tree.draw()       # Draw the parse tree using NLTK's GUI viewer

Parse Tree Output:

(S
  (NP (ProperNoun Alice))
  (VP (Verb chased) (NP (Det a) (Noun dog)))) (p=1.0)

Practical NLP Techniques: Text Processing & Similarity

Natural Language Processing Fundamentals with NLTK

Setup and Imports

Sample Text for Processing

Text Preprocessing: Tokenization & Normalization

1. Tokenization

Sentence Tokenization

Word Tokenization

2. Stemming

3. Lemmatization

4. Stop Words Removal

5. Regular Expressions in NLP

a. Search for a Pattern

b. Split Text by a Delimiter

c. Substitution

Processing a Sample Paragraph

Text Vectorization: Bag of Words and TF-IDF

Setup and Sample Documents

1. Bag of Words (BoW)

2. TF-IDF (Term Frequency-Inverse Document Frequency)

Document Similarity with Cosine Similarity

Setup and Sample Documents

Steps to Compute Cosine Similarity

Probabilistic Context-Free Grammars (PCFG)

Setup and Grammar Definition

Parsing a Sentence with PCFG

Recent Notes

Subjects

Publicidad