Natural Language Processing Fundamentals and Applications

Understanding Ambiguity in NLP

Ambiguity occurs when a word, phrase, or sentence has more than one possible meaning. It is present at all levels of NLP (lexical, syntactic, semantic, discourse, and pragmatic).

  • Example 1: “The chicken is ready to eat” – chicken (food) or chicken (bird).
  • Example 2: “The man saw the girl with the telescope” – who has the telescope?

Types of Ambiguity

  1. Lexical Ambiguity – A word having multiple meanings (e.g., bat, bank).
  2. Syntactic (Structural) AmbiguityMultiple possible parse structures (attachment and scope ambiguity).
  3. Semantic Ambiguity – Even after syntax is resolved, sentences have multiple interpretations.
  4. Discourse Ambiguity – Ambiguity in pronoun/reference resolution (“The horse ran up the hill. It was steep.”).
  5. Pragmatic Ambiguity – Meaning depends on context, intention, beliefs, and real-world knowledge.

Ambiguity is one of the major challenges in NLP and needs techniques like POS-tagging, probabilistic models, and Word Sense Disambiguation (WSD) to resolve.

Text Classification and Categorization

Text Classification (also called text categorization) is the task of automatically assigning documents to predefined categories.

Examples of Text Classification

  • Spam vs. non-spam email
  • News categorization (sports, politics, entertainment)
  • Sentiment classification (positive/negative)

Uses of Text Classification

  1. Filtering content
  2. Spam filtering
  3. Survey coding
  4. Topic spotting
  5. Document identification

It is commonly used in information retrieval, sentiment analysis, and many NLP applications.

Finite State Transducers (FST)

A Finite State Transducer (FST) is an extension of a finite automaton that maps between two levels of representation. It processes input strings and produces corresponding output strings.

Key Features of FST

  • Used in morphological parsing, stemming, phonological rules, etc.
  • Consists of states, transitions, input symbols, and output symbols.
  • Each transition consumes one input symbol and produces one output symbol.

Examples of FST Applications

  • Converting catscat + plural
  • Mapping base form to inflected form or vice-versa
  • Lexicon construction and Porter stemming implementation

FSTs are powerful tools in NLP for analyzing and generating word forms.

Stemming Techniques in Text Processing

Stemming is the process of reducing a word to its root form by removing prefixes or suffixes without considering the context. It is an approximation method and may produce non-dictionary words.

Examples of Stemming

  • studiesstudi
  • writingwrit

Characteristics of Stemming

  • Fast and simple
  • Accuracy is lower than lemmatization
  • Works well when exact meaning is not important (e.g., spam detection)

Common Algorithm: The Porter Stemmer uses a series of rules to strip suffixes.

Key Applications of Natural Language Processing

NLP has a wide range of applications as listed in the study material:

  1. Machine Translation – e.g., Google Translate
  2. Information Retrieval – search engines like Google, Yahoo
  3. Text Categorization – spam filtering, content filtering
  4. Information Extraction – extracting structured data from unstructured text
  5. Grammar Checking – spelling and grammar correction (e.g., MS Word)
  6. Sentiment Analysis – detecting emotions/opinions in text
  7. Question Answering Systems – systems that answer natural language questions
  8. Spam Detection – identifying unwanted emails
  9. Chatbots – customer service bots
  10. Speech Recognition – converting speech to text
  11. Text Summarization – generating summaries of long documents

These applications demonstrate the importance of NLP in real-world systems.

N-Gram Statistical Language Models

An N-Gram model is a statistical language model that predicts the next word in a sequence based on the previous N–1 words. It is used to estimate the probability of word sequences.

Key Points of N-Grams

  • An N-gram is a contiguous sequence of n items (words or characters).
  • Unigram (1-gram): P(w₁)
  • Bigram (2-gram): P(w₂ | w₁)
  • Trigram (3-gram): P(w₃ | w₁ w₂)
  • Used in language modeling, spelling correction, text generation, speech recognition, etc.

Purpose: To compute probabilities of sentences and help predict the most likely word sequence.

Prepositional Phrases and Syntactic Structure

A Prepositional Phrase (PP) is a syntactic structure consisting of a preposition (P) followed by a noun phrase (NP).

Structure: PP = P + D + N (Preposition + Determiner + Noun).
Example: “on the table”, “in the city”.

Role in Sentences

  • Adds additional information (place, time, manner).
  • Attaches either to the noun phrase or verb phrase, often causing attachment ambiguity.

Example: “The man saw the girl with the telescope.” – “with the telescope” can attach to man or girl. Prepositional phrases play a major role in syntactic analysis and semantic interpretation.

Text Summarization Methods

Text summarization is the process of producing a short and meaningful summary of a longer document while preserving its core content.

Types of Summarization

  • Extractive Summarization – selecting important sentences from the text.
  • Abstractive Summarization – generating new sentences based on an understanding of the content.

Methods and Algorithms

  • LEXRANK – a graph-based algorithm for extractive summarization.
  • Optimization-based approaches – optimize objective functions to produce concise summaries.

Applications of Summarization

  • News summarization
  • Research paper summaries
  • Search engine snippets
  • Email or report reduction

The goal is to reduce reading time while keeping the essential meaning intact.

Sentiment Analysis and Opinion Mining

Sentiment Analysis, also called Opinion Mining, is the process of identifying and classifying the emotional tone expressed in a text. It determines whether the sentiment is positive, negative, or neutral.

Key Points of Sentiment Analysis

  • Used to analyze behavior, attitude, and the emotional state of the user.
  • Implemented using a combination of NLP and statistics.
  • Works by assigning polarity values (positive/negative/neutral) to words, phrases, or sentences.
  • Uses affective lexicons, machine learning models, or hybrid approaches.

Applications of Sentiment Analysis

  • Product review analysis
  • Social media monitoring
  • Customer feedback processing
  • Market analysis

Sentiment Analysis helps machines understand human emotions within text.

Porter’s Stemming Algorithm

Porter’s Stemmer is the most widely used stemming algorithm in English. It removes common suffixes from words to convert them into their root or stem form.

Key Points of Porter’s Stemmer

  • Works using a sequence of five phases of rule-based suffix stripping.
  • Each phase applies a set of conditions to remove or replace endings like -ing, -ed, -ly, -ation, etc.
  • Produces stems that may not always be real dictionary words (because stemming is approximate).
  • Example: “studies” → “studi”, “writing” → “writ”.
  • Used where exact word meaning is not needed, such as information retrieval, spam detection, and search engines.

The Porter Stemmer is efficient, simple, and helps reduce morphological variations of words.