NLP Foundations: From Text Processing to Large Language Models
Week 1: Working with Words
Tokenization:
Splitting text into discrete units (tokens), typically words or punctuation . Techniques vary (simple split on whitespace vs. Advanced tokenizers); challenges include handling punctuation, contractions, multi-
word names, and different languages (e.G., Chinese has no spaces). Good tokenization is foundational for all NLP tasks.Bag-of-Words (BoW):
Representing a document by the counts of each word in a predefined vocabulary, ignoring order . The vocabulary is all unique words in the corpus (size = V) . Each document becomes a sparse V-dimensional vector with entries giving term frequencies . Example: “the cat sat on the hat” vs. “the dog ate the cat and the hat” yield different count vectors (see figure below). BoW features can feed into classifiers but give no sense of word meaning beyond frequency.Term Frequency–Inverse Document Frequency (TF–IDF):
A weighting scheme to highlight important words in a document by balancing frequency with rarity .Term Frequency (TF)
= count of term t in document d .Document Frequency (DF)
= number of documents in corpus containing t .Inverse Document Frequency (IDF)
= inverse of DF, often idf(t) = log(N / df(t)) (N = total docs) .
TF–IDF = TF × IDF . High TF–IDF means the word is frequent in this document and rare in the corpus (thus more indicative of this document’s content) . This yields a real-valued vector for each document, often normalized.Cosine Similarity:
A measure to compare text vectors (e.G. TF–IDF vectors) by their angle. Cosine similarity between vectors A and B is cos(θ) = (A · B) / (‖A‖‖B‖), i.E. The dot product divided by the product of magnitudes . Ranges from 1 (identical direction) to 0 (orthogonal) to –1 (opposite). In text mining, high cosine similarity between TF–IDF vectors indicates two documents share similar content/terms irrespective of length .Part-of-Speech (POS) Tagging:
Assigning each token a tag indicating its grammatical role (noun, verb, adjective, etc.). POS tagging uses dictionaries and learned models to capture context (e.G., “bear” as noun vs. Verb). It reveals the sentence structure (syntax) by identifying subjects, objects, verbs, etc., which can be useful for downstream tasks (e.G., extracting relationships). For example, in “They arrived in New York City,” “They” is tagged as Pronoun (and is the subject of “arrived”), “arrived” as Verb .Lemmatization:
Reducing words to their base or dictionary form (lemma). This accounts for inflection, conjugation, etc. (e.G., “arrived” → “arrive”; “eyes” → “eye”; “making” → “make”) . Unlike stemming, the lemma is an actual word.
Lemmatization uses vocabulary and morphological analysis (e.G., WordNet lemmatizer) and is slower than crude stemming . It helps normalize text for better comparison (treat “cats” and “cat” as the same term, etc.).Named Entity Recognition (NER):
Identifying spans of text that correspond to named entities – real-world objects such as persons, locations, organizations, dates, monetary values, etc. . An NER system tags each entity mention with a category label (e.G., “Julien ROSSI” → PERSON, “New York City” → LOCATION) . NER uses context to disambiguate (e.G., “Jordan” as person vs. Country). Models are often language/domain-specific and trained on annotated data (text with entity spans labeled) .Basic Text Classification:
Assigning a text (sentence, document) to a category (spam vs. Ham, sentiment classes, topics, etc.). A simple pipeline: convert text to features (BoW or TF–IDF vectors, possibly with preprocessing like lowercasing, stopword removal) and then apply a classifier (e.G., Naïve Bayes, logistic regression, SVM). For example, for sentiment analysis on movie reviews, we might use TF–IDF features of words and train a logistic regression to predict “positive” vs “negative” . Modern approaches may use word embeddings or fine-tuned language models, but classic text classification often starts with these sparse features.
Week 2: Named Entities, Linking & Knowledge Graphs
Named Entity Recognition (NER):
(See Week 1 above.) NER is often the first step in information extraction: find entity mentions in text and classify their type (PERSON, ORG, etc.). Accuracy improves with larger models and domain-specific training, at the cost of more compute .Entity Linking (EL):
Disambiguating and linking NER mentions to canonical entries in a knowledge base . For example, after NER tags “New York” as a location, EL might link it to the specific entity New York City in Wikipedia or Wikidata (Q60) . This grounds the text span to a knowledge graph node, resolving ambiguity (“Michael Jordan” could link to the basketball player or the machine learning professor, etc.). EL systems consider context and often use an ontology or knowledge base lookup to decide the correct entity. One entity mention in text yields one knowledge base entity (or NIL if no match), and conversely one knowledge base entity can be mentioned in many texts.Knowledge Graphs:
Structured knowledge bases organizing entities and their relationships as a graph. Entities are nodes, and relations (predicates) are edges, often represented as RDF triples of the form (subject, predicate, object) . Example: (New York City) – [is a city of] → (United States) . An ontology defines the allowed relation types (edges) and entity types. Knowledge graphs (e.G., Wikidata, DBpedia) enable AI to reason about real-world facts. When text is linked to a knowledge graph, we can traverse relations (New York City → country → United States) or augment text analysis with structured data.NER & EL in Practice:
NER and EL are crucial for tasks like question answering, semantic search, and news analytics.Ambiguity handling:
EL must handle cases like “Michael Jordan” (multiple people) by using context or user input to pick the right entity .Domain adaptation:
Generic NER may use categories like PERSON, ORG, GPE (geo-political entity) , but specialized domains (medical, financial) require custom entity types and training data (e.G., drug names, ticker symbols). Models can be trained on domain-specific corpora with custom annotations .Applications – Financial Information Mining:
An example use-case is the Bloomberg Knowledge Graph for financial news . NER identifies entities in news (companies, currencies, officials), EL links them to a financial knowledge graph, and relations in the graph help categorize and interpret the news. For instance, if a news snippet mentions “the dollar” and “Sweden’s central bank deputy governor Per Jansson”, the system links “dollar” to the Currencies/Forex domain and “Sweden’s central bank” to an entity categorized under Central Banks.
This allows real-time analytics like detecting that certain news impact Forex markets or specific industries. The structured knowledge graph verticals (e.G., Agriculture, Pharma) help in routing information to traders interested in those sectors . In summary, NER and EL feed downstream analytics by turning unstructured text into linked data.
Week 3: Embeddings & Transformers
Skip-gram Word2Vec architecture: A simple neural network with one hidden layer is trained to predict context words from a target word. The input is a one-hot vector for the target word (length = vocabulary size), and the output is a probability distribution over all words for which word might appear nearby . During training, the network sees word pairs from a corpus (target word and one of its context words) and learns to assign higher output probability to true context words than to random words . No activation is used in the hidden layer, and after training, the hidden-layer weight matrix (size V × N, where N is embedding dimensions) contains the word embeddings – each row is a dense vector for a word . These vectors encode semantic relationships (e.G., “Soviet” is close to “Union”/“Russia”, not “watermelon” ) and can be used as features in NLP tasks.
Distributional Semantics:
“You shall know a word by the company it keeps.” – J.R. Firth, 1957 . The meaning of a word can be inferred from the contexts in which it appears. Word embeddings operationalize this idea: words used in similar contexts end up with similar vectors. For example, “boat” and “ship” or “Paris” and “Berlin” (with country context) get vectors that are close in the embedding space .Word Embeddings (Word2Vec):
Dense vector representations of words that capture semantic similarity. The Word2Vec algorithm (Mikolov et al. 2013) introduced two architectures:Skip-Gram:
predict surrounding words given a target word (as illustrated above).CBOW (Continuous Bag-of-Words):
predict a target word given its context words.Training uses a large corpus of text in a self-supervised manner. A common objective is to maximize the probability of observing true context words and minimize that of random “noise” words.
Negative Sampling is a technique to make training efficient: instead of full softmax over vocabulary, the model is trained on binary classification subtasks – it should output high score for a genuine (target, context) pair and low for randomly paired negatives . The negative sampling loss for a target word I and one true context word O (with k negative samples) is:
L_I = \log\,\sigma(v’O \cdot v_I)\;+\;\sum{i=1}^k \log\,\sigma(-\,v’_{i} \cdot v_I)\,,
where $v_I$ is the embedding of word I and $v’_O, v’_i$ are “output” vectors for context word O and a negative word i . By maximizing this, the model increases dot-products for real pairs and decreases them for random pairs, yielding useful vector representations. These embeddings famously exhibit linear structure: analogies like king – man + woman ≈ queen can be solved by vector arithmetic , and indeed cosine similarity on embeddings reflects semantic similarity (e.G., paris – france + netherlands ≈ amsterdam as vectors ).
Recurrent Neural Networks (RNNs):
Neural networks designed to handle sequences of arbitrary length by maintaining a hidden state that is updated at each time step. Formally, at word position k, $h_{k} = f(x_{k}, h_{k-1})$, where $x_k$ is the current word (often its embedding) and $h_{k-1}$ is the prior state . The same function (and weights) f is applied recurrently for each position . This enables the network to “remember” earlier words when processing later ones. Crucially, RNNs capture word order:
E.G., RNN(“blue paint”) ≠ RNN(“paint blue”) because the sequence of states differs . Variants like LSTM and GRU use gating mechanisms to better preserve long-term dependencies and mitigate the vanishing gradient problem (so the network can carry information across many time steps). RNNs were widely used for language modeling and translation before Transformers.Encoder–Decoder Architecture:
A design for sequence-to-sequence tasks (like translation) with two components:
Encoder:
processes an input sequence into a fixed-length vector or a set of vectors (e.G., final hidden state in an RNN encoder, or all hidden states in a Transformer encoder). Essentially, Encoder(text) = representation . For an RNN encoder, one might take the final hidden state $c = h_N$ as the sentence embedding.
For a Transformer encoder, the output is a set of contextualized token vectors (one for each position) .Decoder:
generates an output sequence (e.G., translated sentence) from the encoder’s representation, typically one token at a time. It can be an RNN that starts from the encoder’s vector $c$ and produces output words sequentially, or a Transformer decoder that attends to encoder outputs. The decoder may use attention to focus on relevant parts of the encoder output when producing each token (see below).
Transformer architecture (encoder-decoder): The Transformer (Vaswani et al. 2017) eliminates recurrence and instead uses self-
attention to handle dependencies . The encoder (left stack) consists of multiple layers; each layer transforms the input token embeddings into contextualized embeddings by Multi-Head Self-Attention (tokens attend to other tokens in the sequence) followed by position-wise feed-forward networks (with residual connections and layer normalization). The decoder (right stack) similarly has self-attention layers (masked to prevent seeing future tokens) and encoder-decoder attention that allows it to attend to the encoder’s output vectors (connecting source and target sequences). In the figure, the encoder’s self-attention mechanisms are highlighted (orange), and the decoder’s cross-attention to the encoder is shown (blue). This architecture enables efficient parallel processing of sequences (no sequential recurrence) at the cost of quadratic time/memory in sequence length. Transformers capture long-range context better than RNNs, and have become the foundation of modern NLP models.
Attention Mechanism:
A technique that allows a model to focus on relevant parts of the input when performing a task. In the context of Transformers, self-attention computes interactions between every pair of positions in a sequence to build contextual representations. For each token i, the model computes a weighted combination of all token values, with weights indicating relevance of token j to token i. Specifically, each token i is associated with a query vector $q_i$, and every token j (including i) has a key $k_j$ and value $v_j$. The attention weights for i→j are obtained from the scaled dot product of $q_i$ and $k_j$:\text{score}_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}}\,,
where $d_k$ is the key/query vector dimension . These scores are normalized with a softmax to produce attention weights $\alpha_{ij}$ . Finally, the output for token i is a weighted sum of the value vectors: $y_i = \sum_j \alpha_{ij},v_j$ . The “multi-head” variant computes this process in parallel multiple times with different learned projections, allowing the model to attend to different aspects of the input.
Cross-attention (in a decoder) works similarly, except queries come from the decoder sequence and keys/values come from the encoder outputs. Intuitively, attention enables the model to dynamically focus on relevant words: e.G., in translating a sentence, at the moment of producing a particular target word, the decoder can attend to the specific source word or phrase it corresponds to, rather than just relying on a fixed-size context vector.Transformers vs RNNs:
Transformers have no recurrent state and can examine all positions simultaneously with self-attention, which leads to better parallelization and often richer context representations . They do have a limitation on input length (e.G., 512 tokens for BERT–
Base) due to positional encoding and memory use . RNNs process tokens sequentially and have difficulty with very long-range dependencies (though LSTMs mitigate this). In practice, Transformers (and their derivatives) have largely replaced RNNs in NLP due to their superior performance on tasks like translation, language modeling, and others.BERT (Bidirectional Encoder Representations from Transformers):
A landmark pre-trained language model introduced by Devlin et al. (2018). BERT uses the Transformer encoder architecture (12–24 layers) to learn deep bidirectional representations of text. Its training tasks are:
Masked Language Modeling (MLM): Randomly mask out some tokens in the input and train the model to predict them . This forces the model to learn bidirectional context (both left and right of a word).
Next Sentence Prediction (NSP): (Used in original BERT) Train the model to predict if one sentence follows another, to encourage understanding of sentence relationships. (Later variants like RoBERTa removed NSP.)
BERT’s pre-training is done on massive text (Wikipedia, books) and produces a model that can be fine-tuned for specific tasks. BERT provides a powerful initialization for NLP tasks, often yielding one of the contextual token vectors or a special [CLS] token vector as an aggregate sequence representation .
Fine-tuning
BERT has achieved state-of-the-art results on tasks like QA, classification, NER, etc., because the model already encodes a lot of linguistic knowledge from pre-training.Fine-Tuning vs. Pre-Training:
Pre-training refers to training a model on a general task or large corpus (typically self-supervised, like language modeling) to learn broad language patterns .
Fine-tuning means taking that pre-trained model and training it further on a specific task with labeled data . For example, BERT is pre-trained once on a huge corpus; then for a sentiment classification task, we fine-tune it on a smaller movie reviews dataset by adding a classification layer. During fine-tuning, the model’s weights are updated (often slightly) and the new task-specific layer is learned .
Pre-training provides a “fast learner” that requires fewer task-specific examples to achieve high performance . This two-phase training is now standard in NLP.
Week 4: Language Models & Generation
Language Models (LMs):
Models that assign probabilities to sequences of words. A language model computes $P(w_1, w_2, …, w_N)$ – the likelihood of a sequence – and can also yield probabilities for the next word given a history, $P(w_{n} | w_{1}\dots w_{n-1})$.
N-gram models are classic LMs that use the Markov assumption: approximate $P(\text{word} | \text{history})$ by $P(w_n | w_{n-(N-1)}, …, w_{n-1})$ (i.E., only the last N−1 words) . They count occurrences in a large corpus to estimate these probabilities. For example, a 3-gram model might estimate $P(\text{“is”} | \text{“the cat”}) = 0.6$ by looking at how often “the cat is” occurs vs. Other completions .N-gram
LMs suffer from data sparsity – many possible word combinations are never seen, especially for large N. Higher N yields more contextually accurate predictions but demands exponentially more data to cover all combinations (sparsity can be alleviated with smoothing techniques, interpolation, or using skip-grams).Evaluating LMs:
One metric is perplexity, which is essentially the inverse probability of the test set normalized by number of words (the lower, the better – meaning the model assigned higher likelihood to the test data). If an LM predicts real text well, it will have low perplexity on it. N-gram models often serve as a baseline; modern Transformer LMs achieve much lower perplexities on large corpora due to their capacity to model long-range dependencies.Transformer-Based LMs:
Modern LMs are usually large Transformer networks trained to predict the next token. For example, GPT (Generative Pre-trained Transformer)
models use a Transformer decoder architecture (so they attend only to past context, with masked self-attention) and are trained on huge amounts of text to model $P(\text{next token} | \text{previous tokens})$. These models, with billions of parameters, can capture complex language patterns and world knowledge.Large Language Model (LLM)
has come to denote extremely large Transformers (with dozens of layers and very large hidden sizes) trained on massive corpora . LLMs like GPT-3, GPT-4, PaLM, etc., have demonstrated emergent abilities (solving tasks they weren’t explicitly trained for) and often undergo an extra fine-tuning step on instructions.Pre-training & Fine-tuning (in LLMs):
Follows the same concept as with BERT.
Pre-training is usually a self-supervised learning on unlabeled text – e.G., next-word prediction for GPT or masked-word prediction for BERT – done once on billions of tokens .
Fine-tuning can happen multiple times: e.G., OpenAI’s GPT-3 was fine-tuned with Reinforcement Learning from Human Feedback (RLHF) to align with user intentions (ChatGPT). Fine-tuning adapts the general LM to specific domains or to follow instructions. The flexibility of fine-tuning (or even prompting, see below) means one pre-trained model can power many applications.Text Generation:
Using a language model to generate new text, one token at a time. Given a prompt (starting text), the LM produces a probability distribution over the vocabulary for the next token. A token is sampled (possibly with techniques to control randomness like temperature or top-k/top-p sampling), then appended to the prompt, and the process repeats. For example, if a model gives $P(\text{“is”}| \text{“The cat”})=0.6$ and $P(\text{“will”}| \text{“The cat”})=0.3$, etc., it likely chooses “is” next. Over many steps, this generates a continuation.
Beam search can be used for more deterministic outputs (particularly in tasks like translation). Contemporary systems (e.G., chatbots) often have safeguards or additional coherence checks in the generation loop. Key issues in text generation include maintaining coherence, avoiding repetition, and controlling style or factual accuracy.Prompt Engineering:
The process of crafting the input text (prompt) to guide an LLM to produce the desired output . Because LLMs can follow natural language instructions, how you ask matters a great deal. Prompt engineering involves techniques like:
Providing clear instructions (e.G., “Explain the following in simple terms:”).
Giving context or role (e.G., “You are an expert translator. Translate this sentence…”).
Using few-shot examples in the prompt to illustrate the task (the model then continues in the pattern).
Specifying format or constraints (e.G., “Answer in JSON.”).
The goal is to maximize the quality of the output without changing the model’s weights. This has become an important skill, since a well-designed prompt can significantly improve the relevance and correctness of the model’s response. Prompt engineering is iterative and often empirical, requiring testing and refining prompts to get optimal results . As LLMs are black-box in deployment, prompt design is how a user “programs” them.
Retrieval-Augmented Generation (RAG):
A technique to enhance LLMs with up-to-date or specialized knowledge by integrating a retrieval step.
In RAG systems, when a query comes in, the system first uses a retriever (e.G., a vector similarity search) to fetch relevant documents or facts from an external database, and then the generative model produces an answer conditioned on both the query and the retrieved context . In practice, this often means:Encoder (Retrieval):
Use an embedding model (often the LLM’s own encoder) to convert the user query and a large set of documents into vectors . Find the nearest documents to the query in this vector space (using a vector database like FAISS or ElasticSearch) .Augment Prompt:
Insert the retrieved text (or a summary of it) into the prompt along with the user’s question . For example: “Document 1 says … Document 2 says … Question: … Answer:”.Decoder (Generation):
The LLM (often fine-tuned for handling such inputs) generates an answer that incorporates the provided documents .
RAG is essentially “open-book” generation – the model isn’t limited to its parametric knowledge and can cite specific sources. This improves factual accuracy and allows updates (just update the database without retraining the model). It’s used in applications like customer support (the bot retrieves relevant articles from a knowledge base), search engines with dialogue, or any “chat with your documents” scenario . A key component is the retriever’s quality – often optimized by fine-tuning embeddings to bring related Q&A closer in vector space . The takeaway: RAG combines the strengths of information retrieval (precision, up-to-date info) with generative flexibility of LLMs, enabling more reliable and informed responses.
5