NLP Fundamentals: Morphology, Semantics, and Parsing
Word Structure and Components in NLP
In linguistics and Natural Language Processing (NLP), a structured word (or word structure) refers to how a word is internally organized using meaningful building blocks. Words are not always indivisible; many are formed by combining smaller units called morphemes, which are the smallest units of meaning.
Components of Word Structure
- Root / Base: The core element carrying the primary meaning. Example: play in replay, player, and playful.
- Stem: The form to which affixes attach. It may be just the root or a root plus a derivational affix. Example: play is a stem; player is the stem for players.
- Affixes: Bound morphemes added to modify meaning or grammar.
- Prefix (before stem): unhappy, replay
- Suffix (after stem): happiness, played
- Infix (inside stem): Rare in English.
- Derivational Morphemes: Create new words or change the word class. Example: happy (adjective) → happiness (noun).
- Inflectional Morphemes: Modify grammatical features without changing the core meaning or class. Example: play → played → playing.
Example Breakdown: Unhappiness
- un- (prefix, negation)
- happy (root)
- -ness (suffix, noun formation)
Importance in NLP
Understanding word structure is essential for:
- Stemming and lemmatization
- Morphological analysis
- Machine translation
Morphological Models and Illustrations
Dictionary Lookup Model
The Dictionary Lookup Model stores all valid word forms in a lexicon (dictionary). Morphological analysis is performed by directly matching the input word with entries in the dictionary.
How it Works
- Input word is received.
- System searches the dictionary.
- If found, it retrieves features (lemma, POS, tense, etc.).
- If not found, it is marked as unknown or an error.
Illustration
Input: went
Lookup Result: Lemma → go; POS → Verb; Tense → Past
Input: cats
Lookup Result: Lemma → cat; POS → Noun; Number → Plural
Pros and Cons
- Advantages: Simple, fast, and accurate for known words.
- Limitations: Requires large storage, cannot handle unseen words, and lacks generalization.
Unification-Based Morphology
This model represents words using feature structures (attribute–value pairs). Morphological rules combine stems and affixes through unification (feature matching).
Core Idea
Words are built by merging compatible features.
Illustration
Lexical Entry (Stem): play (Category: Verb, Tense: Base)
Suffix Rule (-ed): Requires: Verb, Adds: Tense = Past
Unification Process: play + ed → played (Verb, Past)
Stem: cat (Category: Noun, Number: Singular)
Suffix (-s): Requires: Noun, Adds: Number = Plural
Result: cats (Noun, Plural)
Pros and Cons
- Advantages: Handles unseen forms and offers compact representation.
- Limitations: Computationally complex and requires detailed grammar rules.
Morphology in Language Modeling and Semantics
How Morphological Structure Helps Language Modeling
Morphological structure improves language models by capturing internal word patterns rather than treating words as isolated tokens.
- Reduces Data Sparsity: Different word forms share the same root (e.g., play, plays, played).
- Better Generalization: Models can understand unseen forms (e.g., inferring walked from walk).
- Improved Probability Estimation: Models learn shared features instead of independent probabilities.
- Handles Morphologically Rich Languages: Essential for languages like Hindi, Turkish, and Finnish.
- Efficient Vocabulary Usage: Breaking words into morphemes results in a smaller vocabulary.
Handling Semantics in NLP
Semantics deals with meaning interpretation through various techniques:
- Lexical Semantics: Understanding word relations like synonymy, antonymy, and polysemy (e.g., WordNet).
- Distributional Semantics: Meaning from context using word embeddings (e.g., king – man + woman ≈ queen).
- Compositional Semantics: Meaning of phrases from their parts (e.g., “red apple”).
- Semantic Role Labeling (SRL): Identifies “who did what to whom” (Agent, Action, Object).
- Named Entity Recognition (NER): Identifies real-world entities like people or locations.
- Contextual Models: Modern transformers (BERT, GPT) capture context-dependent meaning.
Multilingual Tokenization and Parsing Challenges
Tokenization and parsing are complex in multilingual content due to differences in script, grammar, and morphology.
Tokenization Challenges
- Script Diversity: Different writing systems (Latin, Devanagari, Arabic).
- Word Boundary Ambiguity: Languages like Chinese (我喜欢学习) lack spaces.
- Agglutinative Languages: Long complex words formed by many morphemes (e.g., Turkish).
- Clitics and Contractions: Examples like l’amour or don’t.
- Code-Switching: Mixing languages (e.g., “Kal meeting hai”).
- Entities and Emojis: Ensuring @OpenAI or emojis are not split incorrectly.
Parsing Challenges
- Grammatical Variations: Different word orders like SVO (English) vs. SOV (Hindi).
- Rich Morphology: Case, gender, and tense encoded in suffixes.
- Structural Ambiguity: One sentence resulting in multiple parse trees.
- Resource Scarcity: Lack of annotated corpora for many languages.
- Idioms: Literal parsing fails for expressions like “kick the bucket.”
Predicate-Argument Structure Examples
A predicate expresses an action or state, while arguments are the participating entities.
- Example 1: Ram ate a mango.
Predicate: ate; Arguments: Agent (Ram), Theme (mango).
Structure:eat (Ram, mango) - Example 2: She gave him a book.
Predicate: gave; Arguments: Agent (She), Recipient (him), Theme (book).
Structure:give (She, him, book) - Example 3: The boy kicked the ball.
Predicate: kicked; Arguments: Agent (boy), Theme (ball).
Structure:kick (boy, ball) - Example 4 (State): The sky is blue.
Predicate: is; Arguments: Theme (sky), Attribute (blue).
Structure:be (sky, blue) - Example 5 (Location): They live in Delhi.
Predicate: live; Arguments: Agent (They), Location (Delhi).
Structure:live (They, Delhi)
