Understanding NLP: From Tokenization to Grammar Parsing

Posted on Mar 20, 2026 in English Studies

What is Natural Language Processing?

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that helps a computer understand, analyze, and generate human language (like English, Hindi, etc.) in the form of text or speech. NLP allows computers to communicate with humans in natural language. It is the process of computer analysis of input provided in a natural language and conversion of this input into a useful form.

Input: Text or Speech
Output: Meaningful information, response, or action

Examples of NLP in Real Life

Google Assistant / Alexa
Chatbots
Email spam detection
Language translation
Search engines

Stages of NLP

1. Lexical Analysis

Also called morphological analysis, this is the first stage of NLP. It divides text into paragraphs, sentences, and words and identifies lexemes (meaningful word units).

2. Syntactic Analysis (Parsing)

Checks grammar and sentence structure, finds relationships among words, and rejects grammatically incorrect sentences.

3. Semantic Analysis

Deals with the meaning of words and sentences. It focuses on literal or dictionary meaning to ensure a sentence is meaningful.

4. Discourse Integration

The meaning of a sentence depends on previous and next sentences. It also resolves pronouns (he, she, it).

5. Pragmatic Analysis

The last stage of NLP. It deals with real-world knowledge and intention, understanding what is actually meant rather than just what is said.

Understanding Tokenization

Tokenization is the process of breaking a given text into smaller meaningful units called tokens (sentences, words, sub-words, or characters). It is the first and most important step in NLP because machines cannot understand raw text directly.

Types of Tokenization

Sentence Tokenization: Splits text into sentences.
Word Tokenization: Splits text into individual words (most common).
Subword Tokenization: Breaks words into smaller meaningful parts (used in BERT and GPT).
Character Tokenization: Splits text into individual characters.

Natural Language vs. Programming Language

Natural Language	Programming Language
Vocabulary is very large and flexible	Vocabulary is very small and fixed
Easily understood by humans	Easily understood by machines
Inherently ambiguous	Unambiguous
Grammar rules are not strict	Grammar rules are very strict
Allows redundancy	No redundancy
Meaning understood via context	Context does not help
Evolves naturally over time	Changes only when updated

Challenges in NLP

Contextual Words and Homonyms: Words like “run” have different meanings based on context.
Synonyms: Different words with similar meanings can cause confusion.
Irony and Sarcasm: Systems struggle to detect when the literal meaning is the opposite of the intended meaning.
Ambiguity: Sentences can have multiple interpretations.
Errors: Spelling, grammar, and accent issues in human input.
Idioms and Slang: Phrases that lack literal meaning and are culture-specific.
Domain-Specific Language: Models trained for one field (e.g., medical) may fail in another (e.g., legal).
Low-Resource Languages: Lack of data for non-popular languages.

Part-of-Speech (POS) Tagging

POS tagging is the process of assigning a grammatical label (Noun, Verb, Adjective, etc.) to each word in a sentence. It is essential for understanding word roles, reducing ambiguity, and improving higher-level tasks like machine translation.

Morphology and Word Formation

Morphology is the study of how words are formed. A morpheme is the smallest unit of meaning. Morphemes are classified as Stems (free morphemes that carry core meaning) or Affixes (bound morphemes like prefixes, suffixes, infixes, and circumfixes).

Types of Morphology

Derivational: Creates new words or changes word class (e.g., happy to happiness).
Inflectional: Changes the grammatical form without changing the word class (e.g., play to played).
Compounding: Combining two independent words to form a new one (e.g., blackboard).

Lexical Relationships

Synonymy: Words with similar meanings (e.g., Big/Large).
Hyponymy: An “is-a” relationship (e.g., Rose is a Flower).
Antonymy: Words with opposite meanings (e.g., Hot/Cold).
Hypernymy: The general category (e.g., Animal is the hypernym of Dog).
Meronymy: Part-whole relationship (e.g., Wheel is part of a Car).
Homonymy: Same spelling/pronunciation, unrelated meanings (e.g., Bank).
Polysemy: One word with multiple related meanings (e.g., Head).

Context-Free Grammar (CFG)

CFG is a set of rules used to generate correct sentences. It consists of non-terminals (variables), terminals (actual words), production rules, and a start symbol. It is widely used in compiler design and NLP parsing.

Probabilistic Context-Free Grammar (PCFG)

PCFG extends CFG by assigning a probability to each production rule. This allows systems to resolve ambiguity by selecting the most likely parse tree based on real-world data, significantly improving accuracy in NLP applications like speech recognition and machine translation” }