Natural Language Processing (NLP): Definitions, Applications & Techniques

NLP — Definition & Applications

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that enables computers to understand, analyze, and generate human language (text or speech). NLP serves as a bridge between human language and computer language.

Applications of NLP (Explain Any Four)

1. Machine Translation

Machine translation automatically converts text from one language to another using NLP techniques.

Example: Google Translate
Use: Helps people communicate across languages.

2. Sentiment Analysis

Sentiment analysis identifies emotion or opinion in text.

Types: Positive / Negative / Neutral
Example: Product and movie reviews
Use: Customer feedback analysis and reputation monitoring.

3. Chatbots / Virtual Assistants

Chatbots and virtual assistants interact with users using natural language.

Example: Customer support chatbots, Alexa, Google Assistant
Use: 24×7 support; reduces manual work and improves response speed.

4. Speech Recognition

Speech recognition converts spoken words into text using NLP techniques.

Example: Voice typing, voice assistants
Use: Hands-free interaction and accessibility.

NLP in Real-World Applications

NLP is widely used in translation, chatbots, sentiment analysis, and many other applications, making human-computer interaction easier and more efficient.

Purpose of Text Normalization in NLP

Text normalization is a preprocessing step used to convert raw text into a standard, clean form so machines can process it reliably. Real-world text is often noisy and inconsistent; normalization reduces variation, removes unnecessary information, and improves the accuracy of NLP models.

Main purposes:

  • Make text uniform and consistent
  • Reduce vocabulary size
  • Improve performance of NLP algorithms
  • Remove noise from text data

Common Tasks in Text Normalization

1. Tokenization

Tokenization is the process of breaking text into smaller units called tokens.

Example: Sentence: “I love NLP” — Tokens: I | love | NLP

Purpose:

  • Helps in word-level analysis
  • Forms the base for further NLP tasks

2. Stop Word Removal

Stop words are common words that do not add much meaning to a sentence.

Examples: is, am, the, a, an, and, of

Purpose:

  • Reduces text size
  • Improves processing speed
  • Removes unnecessary words

3. Stemming / Lemmatization

Stemming and lemmatization reduce words to their base or root form.

Examples: running → run, studies → study

Difference (one line): Stemming trims words to a root form; lemmatization returns a meaningful root word.

Purpose:

  • Reduce vocabulary size
  • Improve matching of similar words

Text normalization plays an important role in NLP by cleaning and standardizing text data, which improves the accuracy and efficiency of NLP systems.

Minimum Edit Distance Algorithm

The Minimum Edit Distance (MED) algorithm computes the minimum number of operations required to convert one string into another.

Allowed operations: Insertion — add a character; Deletion — remove a character; Substitution — replace a character. Each operation has cost = 1.

Purpose of Minimum Edit Distance: Spell checking, autocorrect systems, machine translation, and measuring similarity between words.

Algorithm (Working of MED)

  1. Create a matrix of size (m+1) × (n+1), where m = length of source word and n = length of target word.
  2. Initialize the first row from 0 to n and the first column from 0 to m.
  3. Fill the matrix using:

Deletion → top cell + 1
Insertion → left cell + 1
Substitution → diagonal cell + cost
cost = 0 (if characters same), cost = 1 (if characters different)

Final cell gives the Minimum Edit Distance.

Compute Minimum Edit Distance (Example)

Given words: Word 1: EXECUTION Word 2: INTENTION

+XSFMBAAAABklEQVQDAAWe6Qj3Vs4xAAAAAElFTkSuQmCC Mini Edit Dist between EXECUTION and INTENTION = 5

Explain Vector Space Model of Information Retrieval

The Vector Space Model (VSM) is used in Information Retrieval (IR) to represent documents and queries as vectors in a multi-dimensional space. Each dimension corresponds to a term from the collection, and term values are typically computed using TF-IDF weighting.

Working of Vector Space Model

  1. Convert all documents into vectors.
  2. Each word becomes a dimension in vector space.
  3. Assign weights to terms using TF-IDF.
  4. Convert the user query into a vector.
  5. Calculate similarity between document and query using cosine similarity.

Cosine similarity measures the angle between two vectors: a smaller angle indicates more similarity, and documents with higher similarity scores are ranked higher.

Advantages of Vector Space Model

  • Simple and easy to understand
  • Supports partial matching
  • Provides a ranked result list

Disadvantages of Vector Space Model

  • Ignores word order
  • Does not capture semantic meaning
  • High-dimensional vector space

The Vector Space Model is widely used in search engines because it provides efficient and ranked retrieval of documents based on similarity to the user query.

What Is a Language Model? N-Gram Language Model

Language Model: A probabilistic model in NLP that predicts the next word in a sequence based on previous words. It assigns probabilities to word sequences and helps machines understand how language is formed.

Need of a Language Model

  • Speech recognition
  • Machine translation
  • Text generation
  • Autocomplete systems

N-Gram Language Model

The N-gram model predicts a word using the previous (n-1) words. Here, N is the number of words considered.

Types of N-Gram Models

Unigram (N=1): considers one word
Bigram (N=2): considers the previous one word
Trigram (N=3): considers the previous two words

Working of N-Gram Model

The probability of a word depends only on the last (n-1) words and uses frequency counts from training data (e.g., P(word | previous words)).

Advantages of N-Gram Model

  • Simple and easy to implement
  • Fast computation
  • Works well for small datasets

Limitations of N-Gram Model

  • Requires large data for larger n
  • Suffers from data sparsity
  • Cannot capture long-term context

The N-gram model is a basic but important approach in NLP and serves as a foundation for more advanced language models.

Relevance Ranking Algorithm

Relevance ranking algorithms in Information Retrieval sort documents by their relevance to a user query. When a user submits a query, many documents may match; ranking algorithms compute similarity and display the most relevant documents first.

Purpose: Provide the most relevant results to users, sort documents by importance, and improve search accuracy and efficiency.

Working: Query preprocessing (tokenization, stop word removal) → represent documents using TF-IDF or Vector Space Model → compute similarity (e.g., cosine similarity) → rank documents by score.

Advantages: Produces ranked results and improves search accuracy; widely used in search engines.


Parse Tree Example (Top-Down Expansion)

Start with the start symbol: S → NP VP

Expand NP: NP → ART N (choose this because the sentence is The dogs cried)

So, NP → ART N → The dogs

Expand VP: VP → V → cried

Combine

S
├── NP
│   ├── ART → The
│   └── N → dogs
└── VP
    └── V → cried

Explanation: Start from S (sentence). Expand NP first (top-down). NP → ART + N matches “The dogs”. Then expand VP → V to match “cried”. Combine NP and VP to complete the sentence parse: “The dogs cried.”

Importance of Sentiment Analysis

Sentiment analysis identifies the emotion or opinion in text (positive, negative, or neutral) and helps extract insights from large volumes of text data.

1. Customer Feedback Analysis

  • Helps companies analyze customer reviews
  • Identifies customer satisfaction levels
  • Improves product and service quality

2. Business Decision Making

  • Supports better organizational decisions
  • Helps understand market response
  • Reduces business risk

3. Brand Reputation Monitoring

  • Tracks public opinion about a brand
  • Identifies negative feedback early
  • Helps maintain brand image

4. Social Media Analysis

  • Analyzes comments and posts
  • Understands public mood
  • Helps in trend analysis

5. Product and Movie Reviews

  • Classifies reviews as positive or negative
  • Helps users make decisions
  • Supports product improvements

Sentiment analysis is widely used across business, marketing, and social media to understand opinions and emotions in text data.

Text Classification

Text classification is a process in NLP where text documents are automatically assigned to predefined categories. It is used in spam detection, sentiment analysis, news classification, and email filtering. The goal is to make machines understand and correctly classify text.

How Text Classification Works

  1. Text collection: Gather the text data to classify (emails, reviews, articles).
  2. Text preprocessing: Clean text by lowercasing, removing punctuation, tokenizing, removing stop words, and applying stemming or lemmatization.
  3. Feature extraction: Convert text into numeric features using Bag of Words (BoW), TF-IDF, or word embeddings.
  4. Model training: Train a machine learning model (Naive Bayes, SVM, Logistic Regression, or deep learning) on labeled data.
  5. Classification/prediction: Use the trained model to predict categories for new text.
  6. Evaluation: Evaluate performance with accuracy, precision, recall, and F1-score.

Applying Logistic Regression for Text Classification

  1. Data collection: Collect and split data into training and testing sets.
  2. Text preprocessing: Lowercase, remove punctuation and numbers, remove stop words, tokenize, and apply stemming or lemmatization.
  3. Feature extraction: Convert text into numeric features with BoW or TF-IDF.
  4. Model training: Train a Logistic Regression model to learn the relationship between features and categories. Effective for binary and multi-class classification.
  5. Prediction: Model outputs probabilities for each class and assigns the most likely label.
  6. Evaluation: Measure accuracy, precision, recall, and F1-score; refine preprocessing or features if needed.

NLP Text Summarization

Text summarization converts a long document into a shorter summary while preserving the main meaning and important information. The primary aims are to save time, reduce text length, and help users quickly understand content.

Types of Text Summarization: There are two main approaches:

1. Extractive Text Summarization

  • Selects important sentences directly from the original text.
  • Does not create new sentences.
  • Works by finding sentences with high importance or score.
  • Common techniques: TF-IDF, TextRank.
  • Simple and reliable, though summaries can sometimes lack smoothness.

Example: Selecting three important sentences from a long paragraph.

2. Abstractive Text Summarization

  • Generates new sentences to express the core meaning.
  • Understands the content and then produces a concise summary.
  • Similar to how humans write summaries.
  • Advanced models used: Seq2Seq, Transformer models (T5, BART, Pegasus).
  • Produces more fluent and meaningful summaries but is more complex.

Text summarization is a useful NLP technique for handling large volumes of text efficiently.