NLP Applications, Challenges, and Morphological Models Explained

SET-2

1. Applications and Challenges in NLP

Applications of Natural Language Processing (NLP)

Natural Language Processing is widely used across industries to enable machines to understand and generate human language. Major applications include:

  • Machine Translation: Converts text between languages.
  • Sentiment Analysis: Used to detect opinions in reviews and social media.
  • Chatbots and Virtual Assistants: For automated customer support.
  • Speech Recognition: In voice-controlled systems.
  • Text Summarization: To condense long documents.
  • Information Retrieval: In search engines.
  • Question Answering Systems.
  • Spam Filtering and Text Classification.
  • Named Entity Recognition.
  • Healthcare NLP: For analyzing clinical records.

Challenges in NLP

Despite advancements, NLP faces several difficulties:

  1. Ambiguity: Lexical, syntactic, and semantic ambiguity makes interpretation complex.
  2. Context Understanding: Pragmatics remain hard for machines.
  3. Figurative Language: Handling sarcasm, idioms, and figurative language is challenging.
  4. Multilingual Processing: Requires managing diverse grammars and scripts.
  5. Data Scarcity: For low-resource languages, limiting performance.
  6. Text Noise: Spelling errors and slang affect accuracy.
  7. Other Issues: Bias in training data, domain adaptation, coreference resolution, world knowledge integration, and maintaining privacy and ethics.

2. Explain One Morphological Model: Unification-Based Model

A. Unification-Based Morphological Model

A Unification-Based Morphological Model represents words using feature structures rather than simple string rules. Each morpheme (root, prefix, suffix) is described by a bundle of grammatical features such as:

  • Category (Noun, Verb)
  • Number (Singular, Plural)
  • Tense (Past, Present)
  • Gender / Person / Case, etc.

Instead of just attaching affixes, the model checks whether the features are compatible. Word formation happens through unification, a process that merges feature structures if they do not conflict.

How It Works
  1. Lexicon Entries: Each morpheme is stored with features.

    Example: play → {cat: verb, tense: base}; -ed → {requires: verb, tense: past}

  2. Unification Process: The suffix -ed can combine only if the stem has {cat: verb}. Features merge → played → {cat: verb, tense: past}.
  3. Constraint Checking: Invalid combinations are blocked.

    Example: happy (adjective) + -ed (requires verb) ❌

Further Example:

  • walk → {cat: verb}
  • -ing → {tense: progressive, requires: verb}

Unification → walking → {cat: verb, tense: progressive}

Advantages and Limitations

Advantages:

  • ✔ Captures agreement & grammar constraints.
  • ✔ Handles complex inflectional systems.
  • ✔ Prevents invalid word forms.

Limitations:

  • ✖ Computationally heavier.
  • ✖ Requires detailed feature annotation.

3. Challenging Issues of Morphological Models

A. Difficulties in Morphological Processing

Morphological models aim to analyze and generate word structures, but several challenges complicate accurate processing:

  1. Ambiguity: A single surface word may have multiple morphological interpretations.

    Example: “unlockable”un + lockable (not able to be locked) OR unlock + able (able to be unlocked). Resolving ambiguity requires contextual and syntactic clues.

  2. Allomorphy: Morphemes often appear in different surface forms.

    Example (Plural Suffix): cats (/s/), dogs (/z/), buses (/ɪz/). Models must capture phonological variation rules.

  3. Irregular Forms: Languages contain exceptions that break standard rules (e.g., go → went, mouse → mice). Rule-based systems struggle; lexicon-heavy models increase complexity.
  4. Data Sparsity: Rare word forms or low-resource languages lack sufficient training data, reducing model accuracy.
  5. Rich Morphology: Highly inflected or agglutinative languages (e.g., Turkish, Finnish) produce long complex words, increasing segmentation difficulty.
  6. Morphophonemic Changes: Word formation may alter stems (e.g., city → cities, run → running). Requires integration of phonology and morphology.
  7. Compounding: Multiple roots combine into one word (e.g., toothbrush). Boundaries are not always obvious.
  8. Domain & Language Variation: Slang, neologisms, and dialect differences challenge fixed-rule models.
  9. Computational Complexity: Feature-rich or unification-based models demand high processing resources.

4. Various Types of Parsers in NLP

A. NLP Parsing Techniques

Parsing is the process of analyzing sentence structure according to grammatical rules. Different parsers are used depending on linguistic goals and computational methods.

  • Top-Down Parser: Starts from the start symbol (S) and expands rules to match the sentence. Builds the parse tree from root → leaves. Simple but may suffer from backtracking.
  • Bottom-Up Parser: Begins with the input words and combines them into higher structures. Builds tree from leaves → root. Avoids unnecessary expansions but may create spurious parses.
  • Chart Parser: Uses a chart (table) to store intermediate results, preventing repeated computations. Efficient for ambiguous sentences (Common Algorithm: Earley Parser).
  • Dependency Parser: Focuses on word-to-word relationships (head–dependent). Useful for free word order languages. Example: In “She eats apples,” eats is the head, She & apples are dependents.
  • Constituency Parser (Phrase Structure Parser): Identifies phrases (NP, VP, PP) and produces a hierarchical tree structure.
  • Rule-Based Parser: Uses handcrafted grammar rules. Accurate but hard to scale.
  • Statistical Parser: Uses probabilities from treebanks and chooses the most likely parse among alternatives.
  • Neural Parser: Based on deep learning models (RNN, Transformer). High accuracy, handles ambiguity well, requires large data.

5. Multilingual Issues in Detail

Multilingual NLP Challenges

Multilingual NLP deals with processing multiple languages, which introduces unique linguistic and technical challenges.

  1. Linguistic Diversity: Languages differ significantly in morphology (English vs. Turkish), syntax (SVO vs. SOV), and word order flexibility. Models trained on one language often fail to generalize.
  2. Script and Encoding Variations: Different writing systems (Latin, Devanagari, Arabic, Chinese characters) introduce issues with Unicode handling, tokenization, and font normalization.
  3. Tokenization Challenges: Word boundaries vary (space-separated in English vs. no spaces in Chinese/Thai), requiring language-specific segmentation techniques.
  4. Morphological Complexity: Richly inflected languages generate many word forms, leading to data sparsity and increased model complexity.
  5. Ambiguity Differences: Lexical and syntactic ambiguity varies across languages, affecting translation and parsing accuracy.
  6. Low-Resource Languages: Many languages lack large corpora, annotated datasets, or treebanks, leading to poor performance compared to high-resource languages.
  7. Code-Switching: Mixing languages in one sentence (e.g., “I will call you kal.”) is difficult for tagging, parsing, and sentiment analysis.
  8. Translation & Alignment Issues: Word-to-word mapping is not always direct due to idioms, cultural expressions, and structural mismatches.
  9. Cultural & Semantic Variations: Meaning depends on culture and context; sentiment polarity or politeness may vary by language.
  10. Bias & Fairness: Models often favor dominant languages (English-centric bias), reducing inclusivity and accuracy for others.
  11. Evaluation Difficulties: Lack of standardized benchmarks across languages complicates fair comparison.