Fundamentals of NLP: From Tokenization to Semantics

Part-of-Speech Tagging in NLP

Part-of-Speech (POS) Tagging is the process of assigning a specific grammatical category (such as noun, verb, adjective, or adverb) to each word in a text, based on its definition and context. Since many words function as different parts of speech depending on usage (e.g., “book” as a noun vs. a verb), POS tagging is essential for disambiguation.

The Need for POS Tagging

POS tagging serves as a foundational preprocessing step for complex language tasks:

  • Word Sense Disambiguation: Identifies a word’s role to clarify meaning (e.g., “light” as a noun, adjective, or verb).
  • Syntactic Parsing: A prerequisite for dependency parsing and building parse trees.
  • Feature Engineering: Acts as a critical feature for Sentiment Analysis, Named Entity Recognition (NER), and Question Answering.
  • Information Extraction: Allows systems to filter for specific types of information, such as prioritizing nouns and noun phrases.
  • Text-to-Speech (TTS): Ensures correct pronunciation (e.g., “record” as a noun vs. a verb).

Tokenization and Its Types

Tokenization is the process of breaking down a sequence of strings into smaller units called tokens. These tokens are the basic building blocks that an NLP model processes. This step transforms unstructured text into a structured format for numerical analysis.

Types of Tokenization

  • Word Tokenization: Splits text based on delimiters like spaces or punctuation. It is simple but can lead to a massive vocabulary size.
  • Character Tokenization: Breaks text into individual characters. It results in a very small vocabulary but loses semantic meaning and creates long sequences.
  • Sub-word Tokenization: The standard for models like BERT and GPT. It breaks rare words into meaningful chunks, balancing vocabulary size and semantic meaning.

What is Natural Language Processing (NLP)?

Natural Language Processing (NLP) is a field of Artificial Intelligence focused on the interaction between computers and human language. Its goal is to enable machines to interpret and generate language in a contextually relevant way.

Key Stages of NLP

  1. Morphological & Lexical Analysis: Involves analyzing individual words (lexemes) and their structure, including tokenization and lemmatization.
  2. Syntactic Analysis (Parsing): Checks grammar and word arrangement to map sentence structure via Parse Trees.

Challenges in Natural Language Processing

  1. Ambiguity: Words or sentences with multiple meanings (e.g., “The bat flew away”).
  2. Context and Coreference Resolution: Linking pronouns back to the correct noun is computationally expensive.
  3. Slang, Idioms, and Sarcasm: Models struggle with informalities and non-literal meanings.
  4. Named Entity Recognition (NER) Complexity: Difficulty distinguishing between proper nouns and common nouns (e.g., “Apple” the company vs. the fruit).
  5. Data Sparsity: Lack of digitized text for low-resource languages.
  6. Cultural Nuance and Domain Specificity: Language varies significantly across cultures and professional fields.
FeatureNatural LanguageProgramming Language
OriginEvolved naturallyMan-made/Designed
AmbiguityHigh (Context matters)Low (Always literal)
RedundancyHigh (Extra words help)Low (Concise)
Syntax/RulesFlexibleRigid
GoalExpressionTask execution

Derivational and Inflectional Morphology

1. Inflectional Morphology

Adds an affix to express grammatical properties like tense or number without changing the word class (e.g., Cat to Cats).

2. Derivational Morphology

Adds an affix to create a new word, often changing the part of speech or core meaning (e.g., Happy to Happiness).

Context-Free Grammar (CFG)

CFG is a set of recursive rewriting rules used to generate strings in a language. It is defined by a 4-tuple: G = (V, Σ, R, S), representing variables, terminals, production rules, and the start symbol.

Semantic Relationships

  • Homonymy: Same spelling/pronunciation, unrelated meanings (e.g., Bank).
  • Polysemy: Same word, related meanings (e.g., Head).
  • Synonymy: Different words, same meaning (e.g., Big/Large).
  • Hyponymy: Specific instance of a general word (e.g., Rose/Flower).
  • Antonymy: Opposite meanings (e.g., Hot/Cold).
  • Hypernymy: Broader category (e.g., Color/Red).
  • Meronymy: Part-of relationship (e.g., Wheel/Car).

Techniques for Semantic Analysis

  1. Lexical Semantics: Word-level analysis including Word Sense Disambiguation and NER.
  2. Formal Semantics: Uses logic and Lambda Calculus to determine meaning based on constituent parts.
  3. Semantic Role Labeling (SRL): Identifies the predicate-argument structure (Who did what to whom).

Morphology: Stems and Affixes

Morphology studies the internal structure of words. Morphemes are divided into:

  • The Stem: The primary part carrying semantic content. Can be free (stands alone) or bound (requires an affix).
  • The Affix: A bound morpheme that modifies the stem’s meaning or category.