NLP with Python: Morphological Analysis and N-Grams
NLP Tasks with NLTK
Aim
Write a Python program to:
- Perform morphological analysis using the NLTK library.
- Generate n-grams using the NLTK n-grams library.
- Implement n-gram smoothing.
Description
- Morphological Analysis: This involves analyzing word structures to understand meaning and grammatical properties. NLTK provides tools like stemming and lemmatization for this purpose.
- N-Grams Generation: N-grams are contiguous sequences of n items from a text. NLTK provides functions to generate these from tokenized lists.
- N-Grams Smoothing: This technique addresses the sparsity problem in language models by assigning non-zero probabilities to unseen n-grams. We implement add-one (Laplace) smoothing.
Python Implementation
import nltk
from nltk.util import ngrams
from nltk.lm import Laplace
from nltk.tokenize import word_tokenize
def morphological_analysis(word):
"""Performs morphological analysis using NLTK's WordNet Lemmatizer."""
lemmatizer = nltk.WordNetLemmatizer()
return lemmatizer.lemmatize(word)
def generate_ngrams(text, n):
"""Generates n-grams from the given text."""
tokens = nltk.word_tokenize(text)
return list(ngrams(tokens, n))
def ngram_smoothing(ngrams_list):
"""Implements Laplace (add-one) smoothing for n-grams."""
vocab = nltk.lm.Vocabulary(ngrams_list)
laplace = Laplace(order=len(ngrams_list[0]), vocabulary=vocab)
laplace.fit([ngrams_list])
return laplace
def main():
word = "running"
print(f"Morphological analysis of '{word}': {morphological_analysis(word)}")
text = "The quick brown fox jumps over the lazy dog"
n = 3
trigrams = generate_ngrams(text, n)
print(f"\nGenerated {n}-grams: {trigrams}")
laplace_model = ngram_smoothing(trigrams)
print("\nN-gram probabilities after Laplace smoothing:")
print(laplace_model)
if __name__ == "__main__":
main()Output
Morphological analysis of ‘running’: running
Original text: The quick brown fox jumps over the lazy dog
Generated 3-grams: [(‘The’, ‘quick’, ‘brown’), (‘quick’, ‘brown’, ‘fox’), (‘brown’, ‘fox’, ‘jumps’), (‘fox’, ‘jumps’, ‘over’), (‘jumps’, ‘over’, ‘the’), (‘over’, ‘the’, ‘lazy’), (‘the’, ‘lazy’, ‘dog’)]
N-gram probabilities after Laplace smoothing: <NgramModel with 1 3-grams>
