N-Gram Models and Statistical Language Modeling

Posted on Jun 25, 2026 in Computer Engineering

N-Gram Models: Unigram, Bigram, and Trigram

An N-Gram Model is a statistical language model that predicts the next word based on the previous LAAAAAElFTkSuQmCC words.

Types of N-Grams

Unigram (N=1)

Considers each word independently.

Example sentence: I love NLP

Unigrams:

I
love
NLP

Probability: gBxfY6t0jGNhAAAAABJRU5ErkJggg==

Bigram (N=2)

Considers one previous word.

Bigrams:

I love
love NLP

Probability: gBzvb3oA4TlZQAAAABJRU5ErkJggg==

Trigram (N=3)

Considers two previous words.

Trigrams:

I love NLP

Probability: hskMqpvnRhmBjEBGoD8EOvyVwv6E55EyAhmBr4rAP2qJHxq01dUFAAAAAElFTkSuQmCC

Applications of N-Grams

Text prediction
Spell checking
Speech recognition
Machine translation
Chatbots

Statistical Language Models in NLP

A Statistical Language Model (SLM) assigns probabilities to sequences of words and predicts the likelihood of a sentence.

Probability Formula

VskSfPZicqPRlGKClhkSYRfvtiKS3ak6sMZW4Zu88zUofUqGz5uveUPcf4lSPEhiGwkPKxUwbjLjzPwh2lRN9FPFuixgMVXnZM9Jwv0WeAPcpXjyO4s1K0AAAAASUVORK5CYII=

Example:

Sentence A: I like NLP
Sentence B: NLP like I

Sentence A gets a higher probability because it is grammatically correct.

Importance of SLMs

Predicts next words.
Improves machine translation.
Supports speech recognition.
Helps text generation.
Used in autocomplete systems.

Evaluation Techniques and Perplexity

Common Evaluation Metrics

Accuracy
Cross-Entropy
Perplexity

Understanding Perplexity

Perplexity measures how well a language model predicts text.

Formula: 1TlZhxCLq0+AAAAAElFTkSuQmCC

Where:

N = Number of words
P(W) = Probability of the sentence

Interpretation of Results

Lower Perplexity: Better Model
Higher Perplexity: Poor Prediction

Example Comparison:

Model	Perplexity
Model A	50
Model B	120

In this case, Model A performs better.

Importance of Perplexity

Standard evaluation metric.
Measures prediction quality.
Compares different language models.

Sentence Generation and Sampling

Sentence Generation Process

Language models generate sentences by predicting one word at a time.

Example:

Start: I
Model predicts: love
Next prediction: NLP
Generated sentence: I love NLP

Sampling Methods

Words are selected according to their calculated probabilities.

Example after the word “I”:

Word	Probability
love	0.6
like	0.3
hate	0.1

Random sampling chooses words based on these specific probabilities.

Applications

Chatbots
Text generation
Story generation
AI assistants

Smoothing Techniques in Language Modeling

Smoothing is a technique used to assign small probabilities to unseen words or N-grams.

Why is Smoothing Needed?

Some word combinations may never appear in training data, resulting in a probability of zero, which can break model calculations.

Common Smoothing Techniques

1. Laplace (Add-One) Smoothing

Vjq+ZjuCjSDl6urrYtKmW5PZnaVHkgGUp1tdwBK6dgCfDtznl4aNIOWhOdhd7M7NpVPbu9mVaVAO6090f10CsVPZQpFw0J6sLwctIbAFiB3OGYPkIzXozqiay3x63UTZFyktzQluqkyqDXH0R3Wt3F0Nul4VkHOBp8CV4mxKiHooUjt6mOaEfhHecNiV9Ypoy2hTQotIAedPtVjShC2OYIH5gl0sNW3orRcpJc9K0KfqngPyCTYwanhbl+EOAheeRHzXESa5Zxj8ALSIJ5ps3QwAAAABJRU5ErkJggg==

2. Add-k Smoothing

wO0h9evw7wX7QAAAABJRU5ErkJggg==

Where k < 1.

3. Good-Turing Smoothing

Adjusts counts based on the frequency of rare events.

4. Kneser-Ney Smoothing

An advanced technique widely used in modern language modeling.

Benefits of Smoothing

Prevents zero probabilities.
Handles unseen data effectively.
Improves overall model performance.

Laplace Smoothing Explained with Examples

Laplace Smoothing (also known as Add-One Smoothing) adds 1 to every count in the vocabulary.

The Formula

bUB8NAAAAABJRU5ErkJggg==

Where:

N = Total word count
V = Vocabulary size

Practical Example

Training Data:

Word	Count
NLP	3
AI	2
ML	0

Total words = 5; Vocabulary size = 3.

Probability of “ML”:

Without smoothing:
With Laplace smoothing:

Thus, unseen words receive a non-zero probability.

Primary Advantage

It prevents zero probabilities for unseen events, ensuring the model remains functional.

Data Sparsity Challenges in Language Models

Data Sparsity occurs when many possible word combinations are absent or rarely occur in the training data.

Example of Sparsity

Training Data: “I love NLP”

Bigrams present: “I love”, “love NLP”.

The bigram “NLP is” never appears. Consequently, the probability becomes: rN+1QAAAABJRU5ErkJggg==

Associated Problems

Zero probabilities in calculations.
Poor predictions for new text.
Reduced model accuracy.

Solutions to Data Sparsity

Smoothing techniques.
Utilizing larger datasets.
Backoff models.
Interpolation methods.

Probability Estimation and Markov Assumption

Probability Estimation

Probability is estimated based on word frequencies within the corpus.

Unigram:
Bigram:
Trigram:

The Markov Assumption

The Markov Assumption states that the probability of a word depends only on a limited number of previous words rather than the entire history.

Bigram Assumption

AhD2VarbcSwCFgFAQOOfvxY3i4BFoA2Bv9wZJ0VPpHMxAAAAAElFTkSuQmCC

Trigram Assumption

CKLKxqESVWM8uEoZzQ9sFfNqsUaxWYNVWJmtfS5JhBBP31Wuy41gELAKmIaDxP15MM8HqYxGwCDQi8A94rGLRylFofgAAAABJRU5ErkJggg==

Advantages

Significantly reduces computation requirements.
Simplifies the language modeling process.
Makes real-time prediction practical.