Natural Language Processing Core Concepts and Techniques

Posted on Nov 1, 2025 in Mathematics and Computer Science

Text Preprocessing and Feature Space Reduction

Which of the following text preprocessing steps can reduce the dimensionality of a bag-of-words feature space?

A. Converting all text to lowercase
B. Removing common stop words (e.g., “the”, “and”, “of”)
D. Stemming or lemmatizing words (e.g., “running” → “run”)

Limitations of Bag-of-Words Representation

Which of the following are limitations of the bag-of-words (unigram) text representation?

A. It ignores the order of words in the text (loses word sequence information).
B. It can lead to a very high-dimensional and sparse feature space.
C. It cannot represent words that were not seen in the training corpus (out-of-vocabulary words).

Which statement about stemming and lemmatization is true?

A. Stemming is a crude heuristic that may produce non-real word forms (e.g., “computational” → “comput”), whereas lemmatization uses vocabulary and morphological analysis to return real root forms (e.g., “computational” → “compute”).

Supervised Machine Learning Fundamentals

Which of the following were identified as essential components of a supervised machine learning system for text classification?

A numerical feature representation of the input text.
A model (classification function) that maps the input features to a prediction.
An objective (loss) function to optimize during training.
An algorithm to learn the model’s parameters (e.g., training procedure like SGD).
An automatic hyperparameter tuning module built into the model. (Note: This is typically not considered an essential core component.)

Which of the following models are generative (as opposed to discriminative)?

A. An $n$-gram Language Model that assigns probabilities to sequences of words.
C. A Naive Bayes text classifier for spam detection.

Correct Answer(s): A, C

Classification Models and Optimization

Which statement(s) about logistic regression for binary text classification are true?

A. Logistic regression outputs a probability (between 0 and 1) that the input belongs to the positive class.
C. The decision boundary of a logistic regression model is linear in the feature space.

Which of the following statement(s) about Stochastic Gradient Descent (SGD) are true?

A. SGD updates model weights incrementally using one (or a few) training example(s) at a time, rather than the entire training set.
B. For large datasets, SGD can reach a good solution faster (in wall-clock time or number of updates) than batch gradient descent.

Evaluating Classification Performance

In a binary classification task with a highly imbalanced dataset (very rare positive class), which evaluation metric is more informative than raw accuracy?

B. F1-Score – the harmonic mean of precision and recall.

Correct Answer: B

What is the typical effect of increasing the decision threshold for a positive classification (making the classifier more “selective” in choosing positive)?

A. It increases precision (fewer false positives) but decreases recall (more false negatives).

The F1-score used in evaluating classifiers can be defined as:

A. The harmonic mean of the precision and recall values.

Word Embeddings and Vector Representations

Which statement best reflects the distributional hypothesis in linguistics (the basis for word embeddings)?

A. Words that occur in similar contexts tend to have similar meanings.

Word2Vec Algorithms and Training

Which of the following algorithms are part of the original Word2Vec framework for learning word embeddings?

A. Skip-gram with Negative Sampling (SGNS)
B. Continuous Bag-of-Words (CBOW)

How do the CBOW and Skip-gram models differ in the Word2Vec approach?

A. CBOW predicts a target word given its surrounding context words, whereas Skip-gram predicts the context words given a target word.

Compared to one-hot vector representations of words, learned dense word embeddings typically offer which advantages?

A. They usually have a much lower dimensionality than the size of the vocabulary, making computations more efficient.
B. They capture semantic similarities (words with similar meanings end up with vectors that are close together in space).

What is the main purpose of using negative sampling in training Word2Vec models?

A. To update the model efficiently by only considering a small random sample of “negative” words (noise examples) for each training instance instead of the entire vocabulary.

Measuring Similarity and Relationships

Cosine similarity is commonly used in the context of word embeddings. What does cosine similarity measure?

A. The orientation (angle) between two vectors, indicating how similar their directions are (regardless of magnitude).

Word analogy tasks (e.g., “king : man :: queen : woman”) demonstrate that word embeddings can capture relationships. What vector operation on embeddings is used to solve analogies like “king – man + woman ≈ queen”?

A. Vector addition and subtraction of the respective word vectors.

Language Models (LMs) and Applications

Which of the following applications typically use a language model as a component?

A. Speech recognition – to predict likely word sequences from ambiguous acoustic input.
B. Machine translation – to ensure the output is fluent and grammatically correct in the target language.
C. Spelling or autocorrect systems – to decide if a sequence of words is probable and detect errors.

N-gram Models and Smoothing Techniques

In an $n$-gram language model, what does the Markov assumption state?

A. The probability of each word depends only on the previous $n-1$ words (the recent context), rather than the entire preceding text.

Why is smoothing necessary in $n$-gram language models?

A. To avoid assigning zero probability to any sequence of words that didn’t occur in the training data (by redistributing some probability mass to unseen $n$-grams).

Which of the following is NOT a known smoothing technique for language modeling?

Add-One (Laplace) smoothing
Kneser-Ney smoothing
Interpolation or Backoff smoothing
D. Stemming smoothing (Correct Answer)

Evaluating Language Models (Perplexity)

Language Model A and Language Model B are evaluated on the same test dataset. Which result indicates that Model A is better than Model B?

A. Model A has a lower perplexity on the test set than Model B.
C. Model A achieves higher accuracy at predicting the next word (treating it like a classification task) than Model B.
D. Model A uses a larger vocabulary than Model B.

When comparing the perplexity of two language models, which condition must be met to ensure a fair comparison?

A. Evaluate both models on the same test set using the same vocabulary.

Neural Language Models and Decoding

Which of the following are advantages of neural language models (e.g., using neural networks and embeddings) compared to traditional $n$-gram models?

A. They use distributed word representations (embeddings), allowing the model to generalize to word combinations not seen during training (mitigating the sparsity problem).
D. They can potentially capture longer-range dependencies beyond a fixed context window (for example, RNN/Transformer-based LMs can incorporate a flexible history).

Which of these are common decoding strategies for generating text from a language model?

A. Greedy decoding: At each step, choose the highest-probability next word.
B. Beam search: Expand multiple candidate sequences and choose the most probable sequence overall.
C. Random sampling: Sample the next word according to the model’s predicted probability distribution (possibly with temperature adjustments).