N-Gram Models and Statistical Language Modeling

N-Gram Models: Unigram, Bigram, and Trigram

An N-Gram Model is a statistical language model that predicts the next word based on the previous LAAAAAElFTkSuQmCC words.

Types of N-Grams

Unigram (N=1)

Considers each word independently.

Example sentence: I love NLP

Unigrams:

  • I
  • love
  • NLP

Probability: gBxfY6t0jGNhAAAAABJRU5ErkJggg==

Bigram (N=2)

Considers one previous word.

Bigrams:

  • I love
  • love NLP

Probability: gBzvb3oA4TlZQAAAABJRU5ErkJggg==

Trigram (N=3)

Considers two previous words.

Trigrams:

  • I love NLP

Probability: hskMqpvnRhmBjEBGoD8EOvyVwv6E55EyAhmBr4rAP2qJHxq01dUFAAAAAElFTkSuQmCC

Applications of N-Grams

  • Text prediction
  • Spell checking
  • Speech recognition
  • Machine translation
  • Chatbots

Statistical Language Models in NLP

A Statistical Language Model (SLM) assigns probabilities to sequences of words and predicts the likelihood of a sentence.

Probability Formula

VskSfPZicqPRlGKClhkSYRfvtiKS3ak6sMZW4Zu88zUofUqGz5uveUPcf4lSPEhiGwkPKxUwbjLjzPwh2lRN9FPFuixgMVXnZM9Jwv0WeAPcpXjyO4s1K0AAAAASUVORK5CYII=

Example:

  • Sentence A: I like NLP
  • Sentence B: NLP like I

Sentence A gets a higher probability because it is grammatically correct.

Importance of SLMs

  • Predicts next words.
  • Improves machine translation.
  • Supports speech recognition.
  • Helps text generation.
  • Used in autocomplete systems.

Evaluation Techniques and Perplexity

Common Evaluation Metrics

  1. Accuracy
  2. Cross-Entropy
  3. Perplexity

Understanding Perplexity

Perplexity measures how well a language model predicts text.

Formula: 1TlZhxCLq0+AAAAAElFTkSuQmCC

Where:

  • N = Number of words
  • P(W) = Probability of the sentence

Interpretation of Results

  • Lower Perplexity: Better Model
  • Higher Perplexity: Poor Prediction

Example Comparison:

ModelPerplexity
Model A50
Model B120

In this case, Model A performs better.

Importance of Perplexity

  • Standard evaluation metric.
  • Measures prediction quality.
  • Compares different language models.

Sentence Generation and Sampling

Sentence Generation Process

Language models generate sentences by predicting one word at a time.

Example:

  • Start: I
  • Model predicts: love
  • Next prediction: NLP
  • Generated sentence: I love NLP

Sampling Methods

Words are selected according to their calculated probabilities.

Example after the word “I”:

WordProbability
love0.6
like0.3
hate0.1

Random sampling chooses words based on these specific probabilities.

Applications

  • Chatbots
  • Text generation
  • Story generation
  • AI assistants

Smoothing Techniques in Language Modeling

Smoothing is a technique used to assign small probabilities to unseen words or N-grams.

Why is Smoothing Needed?

Some word combinations may never appear in training data, resulting in a probability of zero, which can break model calculations.

Common Smoothing Techniques

1. Laplace (Add-One) Smoothing

Vjq+ZjuCjSDl6urrYtKmW5PZnaVHkgGUp1tdwBK6dgCfDtznl4aNIOWhOdhd7M7NpVPbu9mVaVAO6090f10CsVPZQpFw0J6sLwctIbAFiB3OGYPkIzXozqiay3x63UTZFyktzQluqkyqDXH0R3Wt3F0Nul4VkHOBp8CV4mxKiHooUjt6mOaEfhHecNiV9Ypoy2hTQotIAedPtVjShC2OYIH5gl0sNW3orRcpJc9K0KfqngPyCTYwanhbl+EOAheeRHzXESa5Zxj8ALSIJ5ps3QwAAAABJRU5ErkJggg==

2. Add-k Smoothing

wO0h9evw7wX7QAAAABJRU5ErkJggg==

Where k < 1.

3. Good-Turing Smoothing

Adjusts counts based on the frequency of rare events.

4. Kneser-Ney Smoothing

An advanced technique widely used in modern language modeling.

Benefits of Smoothing

  • Prevents zero probabilities.
  • Handles unseen data effectively.
  • Improves overall model performance.

Laplace Smoothing Explained with Examples

Laplace Smoothing (also known as Add-One Smoothing) adds 1 to every count in the vocabulary.

The Formula

bUB8NAAAAABJRU5ErkJggg==

Where:

  • N = Total word count
  • V = Vocabulary size

Practical Example

Training Data:

WordCount
NLP3
AI2
ML0

Total words = 5; Vocabulary size = 3.

Probability of “ML”:

  • Without smoothing: ZBq4vjfGEgcAAAAASUVORK5CYII=
  • With Laplace smoothing: wMGhWipt96SAAAAAABJRU5ErkJggg==

Thus, unseen words receive a non-zero probability.

Primary Advantage

It prevents zero probabilities for unseen events, ensuring the model remains functional.

Data Sparsity Challenges in Language Models

Data Sparsity occurs when many possible word combinations are absent or rarely occur in the training data.

Example of Sparsity

Training Data: “I love NLP”

Bigrams present: “I love”, “love NLP”.

The bigram “NLP is” never appears. Consequently, the probability becomes: rN+1QAAAABJRU5ErkJggg==

Associated Problems

  • Zero probabilities in calculations.
  • Poor predictions for new text.
  • Reduced model accuracy.

Solutions to Data Sparsity

  • Smoothing techniques.
  • Utilizing larger datasets.
  • Backoff models.
  • Interpolation methods.

Probability Estimation and Markov Assumption

Probability Estimation

Probability is estimated based on word frequencies within the corpus.

  • Unigram: Zd9cr1A0S50AAAAASUVORK5CYII=
  • Bigram: LiC3xgki7DEAAAAASUVORK5CYII=
  • Trigram: AfSLrJsXGTkHQAAAABJRU5ErkJggg==

The Markov Assumption

The Markov Assumption states that the probability of a word depends only on a limited number of previous words rather than the entire history.

Bigram Assumption

AhD2VarbcSwCFgFAQOOfvxY3i4BFoA2Bv9wZJ0VPpHMxAAAAAElFTkSuQmCC

Trigram Assumption

CKLKxqESVWM8uEoZzQ9sFfNqsUaxWYNVWJmtfS5JhBBP31Wuy41gELAKmIaDxP15MM8HqYxGwCDQi8A94rGLRylFofgAAAABJRU5ErkJggg==

Advantages

  • Significantly reduces computation requirements.
  • Simplifies the language modeling process.
  • Makes real-time prediction practical.