N-Gram Models and Statistical Language Modeling
N-Gram Models: Unigram, Bigram, and Trigram
An N-Gram Model is a statistical language model that predicts the next word based on the previous
words.
Types of N-Grams
Unigram (N=1)
Considers each word independently.
Example sentence: I love NLP
Unigrams:
- I
- love
- NLP
Probability:
Bigram (N=2)
Considers one previous word.
Bigrams:
- I love
- love NLP
Probability:
Trigram (N=3)
Considers two previous words.
Trigrams:
- I love NLP
Probability:
Applications of N-Grams
- Text prediction
- Spell checking
- Speech recognition
- Machine translation
- Chatbots
Statistical Language Models in NLP
A Statistical Language Model (SLM) assigns probabilities to sequences of words and predicts the likelihood of a sentence.
Probability Formula
Example:
- Sentence A: I like NLP
- Sentence B: NLP like I
Sentence A gets a higher probability because it is grammatically correct.
Importance of SLMs
- Predicts next words.
- Improves machine translation.
- Supports speech recognition.
- Helps text generation.
- Used in autocomplete systems.
Evaluation Techniques and Perplexity
Common Evaluation Metrics
- Accuracy
- Cross-Entropy
- Perplexity
Understanding Perplexity
Perplexity measures how well a language model predicts text.
Formula:
Where:
- N = Number of words
- P(W) = Probability of the sentence
Interpretation of Results
- Lower Perplexity: Better Model
- Higher Perplexity: Poor Prediction
Example Comparison:
| Model | Perplexity |
|---|---|
| Model A | 50 |
| Model B | 120 |
In this case, Model A performs better.
Importance of Perplexity
- Standard evaluation metric.
- Measures prediction quality.
- Compares different language models.
Sentence Generation and Sampling
Sentence Generation Process
Language models generate sentences by predicting one word at a time.
Example:
- Start: I
- Model predicts: love
- Next prediction: NLP
- Generated sentence: I love NLP
Sampling Methods
Words are selected according to their calculated probabilities.
Example after the word “I”:
| Word | Probability |
|---|---|
| love | 0.6 |
| like | 0.3 |
| hate | 0.1 |
Random sampling chooses words based on these specific probabilities.
Applications
- Chatbots
- Text generation
- Story generation
- AI assistants
Smoothing Techniques in Language Modeling
Smoothing is a technique used to assign small probabilities to unseen words or N-grams.
Why is Smoothing Needed?
Some word combinations may never appear in training data, resulting in a probability of zero, which can break model calculations.
Common Smoothing Techniques
1. Laplace (Add-One) Smoothing
2. Add-k Smoothing
Where k < 1.
3. Good-Turing Smoothing
Adjusts counts based on the frequency of rare events.
4. Kneser-Ney Smoothing
An advanced technique widely used in modern language modeling.
Benefits of Smoothing
- Prevents zero probabilities.
- Handles unseen data effectively.
- Improves overall model performance.
Laplace Smoothing Explained with Examples
Laplace Smoothing (also known as Add-One Smoothing) adds 1 to every count in the vocabulary.
The Formula
Where:
- N = Total word count
- V = Vocabulary size
Practical Example
Training Data:
| Word | Count |
|---|---|
| NLP | 3 |
| AI | 2 |
| ML | 0 |
Total words = 5; Vocabulary size = 3.
Probability of “ML”:
- Without smoothing:
- With Laplace smoothing:
Thus, unseen words receive a non-zero probability.
Primary Advantage
It prevents zero probabilities for unseen events, ensuring the model remains functional.
Data Sparsity Challenges in Language Models
Data Sparsity occurs when many possible word combinations are absent or rarely occur in the training data.
Example of Sparsity
Training Data: “I love NLP”
Bigrams present: “I love”, “love NLP”.
The bigram “NLP is” never appears. Consequently, the probability becomes:
Associated Problems
- Zero probabilities in calculations.
- Poor predictions for new text.
- Reduced model accuracy.
Solutions to Data Sparsity
- Smoothing techniques.
- Utilizing larger datasets.
- Backoff models.
- Interpolation methods.
Probability Estimation and Markov Assumption
Probability Estimation
Probability is estimated based on word frequencies within the corpus.
- Unigram:
- Bigram:
- Trigram:
The Markov Assumption
The Markov Assumption states that the probability of a word depends only on a limited number of previous words rather than the entire history.
Bigram Assumption
Trigram Assumption
Advantages
- Significantly reduces computation requirements.
- Simplifies the language modeling process.
- Makes real-time prediction practical.
