Natural Language Processing (NLP) Techniques and Models Overview

Posted on May 10, 2024 in Computers

DSA4213 – Natural Language Processing (NLP)

Text Preprocessing Pipeline: Word Tokenization -> POS Tagging -> Noun Phrases, Named Entity Recognition -> Feature Engineering (BOW, N-grams, TFIDF, Word2Vec, Skip grams, CBOW)

Word Tokenization
-> split up further into words for deeper analysis
- Word-based models
  with subword tokenization. Apply tokenization based learning algorithms for each input:
  - Byte Pair Encoding (BPE): iteratively replace most frequent pair of bytes in a dataset with single unused byte, reducing the size of the data used for tokenizing text into subwords
    - Example: with corpus ‘hug’ ‘bug’ ‘pun’, we create a vocab with ‘b’, ‘h’, ‘g’, ‘n’, ‘ug’, ‘un’, ‘hug’, ‘hun’… mix and matching with tokens in corpus. Could result in ambiguous final token vocab, but can handle rare and out-of-vocab words
  - WordPiece: Starts from a small vocab including special tokens from model and initial alphabet. Each word is initially split by adding prefix to each character in the word. And learns merge rules by computing a score for each pair using formula (pair frequency) / (first ele’s freq x second ele’s freq) and iteratively merged like BPE until we reach desired vocab size, and only saves the final vocab
  - Unigrams: Apply BPE on the initial corpus with large vocab size. Done by computing loss over corpus given the current vocabulary and sees how much the overall loss would increase less if a specific symbol was removed -> less needed so removed
  - Limitations: Have to remove stop words, perform cleaning on HTML or special characters, standardization (slangs to formal), normalization (words to base form using Parts Of Speech)
Feature Engineering for Classical ML used in NLP: BOW representations, N-grams, TF-IDF, Word2Vec, Skip grams, CBOW
- BOW representations: From a given sentence, create a vector as the count of the word in generated vocabulary (E.g. I am on the top of the world. -> Vector for I, am, on, the, top, of, world = [1, 1, 1, 2, 1, 1, 1] -> sparse vector based on frequency)
  - This featurization vector would be stored as a sparse vector of (word_id, frequency).
- N-grams: captures word order, especially the order of nearby words. This is done by modelling tuples of consecutive words. The length of word orders measured depends on the N chosen. If N = 2, ‘the-cat’, ‘cat-sat’, If N = 3, ‘the-cat-sat’ etc
  - Unigrams have higher counts and can detect weaker influences, while bigrams and trigrams are more specific. Limitations: Struggles with feature set size – for original set size |V|, number of N-grams is |V|^N.
- TF-IDF (Term Frequency – Inverse Document Frequency): TF of term t in document d = (# appearances in doc) / (# terms in docu), IDF of term t in list of documents = log((# documents in corpora) / (# docs per corpora with term ‘t’))
- Skip grams: Train context by taking pairs of a word in the sentence, with other words around it. (E.g. (the quick brown fox jumps over…) would have (the, quick), (the, brown), (fox, quick), etc. Learns Syntactic and Semantic Relationships

Topic Modelling:

Unsupervised method of finding topics in collections of documents

Divides a corpus of documents into a list with all topics covered by the documents, from the corpus based on the topics covered. To extract hidden topical patterns, annotate documents according to topics, and organize/search/summarize texts
Different Methods of Topic Modeling:
- Latent Dirichlet Allocation (LDA) – Allocate Words to Topics and Documents consist of Words based on a distribution of distributions. Fast to run for long documents, generalize topics but requires a lot of fine-tuning for human topics. Choose the top x words to represent the topic.
  - Bayesian Statistics
  - Matrix Factorization -> determine the topic mixtures of each document. Gibbs sampling is used to sample conditional distributions of variables. A biased random walk is performed to explore the dist of probabilities. and converges to true dist.
- Non-negative Matrix Factorization (NMF): To reduce the dimension of the input corpora or corpora by making use of SVD -> gives orthogonal topics
Evaluation of Topic Models: Eye Balling Models by checking Top N Words, Topics / Documents, Intrinsic Evaluation Metrics such as capturing model semantics and topics’ interpretibility, Human Judgements, Evaluation on Downstream Tasks

Neural Networks and Deep Learning

Feedforward NNs with nonlinear hidden layer and activation units allow nonlinear function approximation / curve fitting. -> Next Word Prediction: Past output at time t-1 is passed into model as input at time t. Based on error of between actual word and output, error is backpropagated.
- Enlarging window enlarges the weights to train, W = L (embedding size) * n (window size) * h (size of hidden dim). This make it difficult for sequential data information (context for all words) to be captured. Word2vec also have limited context length
Recurrent Neural Networks (RNN) -> Uses layer normalization to make independent of the batch size and sequence length, making more suitable for variable length sequences
- Vanishing and Exploding Gradient – Exploding Gradient is solved w/ Gradient Clipping (gradient is high, scale smaller before applying). Vanishing Gradient still unsolvable due to Long Range Dependencies in long sentences but is required to maintain context
LSTM Architecture
- Contains Forget gate (delete state vector), Input Gate (add state vector), and Output Gate (Control info flow to next hidden state). Also allows state info to be retained over longer periods of time and decides what info shd be thrown away by checking with a sigmoid if the state is 0 or 1 (keep)
- Generalization Techniques
  - L2 Regularization: Normalizes vectors by the sum of square of each element in the vector. E.g. og. vector is [3, 4], then normalized vector takes sqrt(3^2 + 4^2) = 5 and then we take 1/5 * [3, 4] =[0.6, 0.8]
  - Dropout: Dropping neurons in the hidden layers during training, with a fixed probability during training, using a dropout mask. However, no dropout happens during testing, but all weights are multiplied by 1-p
Attention Mechanism and Transformers
- Attention: Created to solve issue where we lose information on individual words – Relative importance the model assigns to each input token -> synthesize all these objects and nouns together to create the full context.
  - Query Vector: A question about context represented as a vector of scalars, which is the product of matrix W_Q and embedding vector of word. Key Vector: a vector of scalars that represents the answer to those questions asked by query vectors.
    - This query space is made of 128 dimensions, way smaller than the embedding space. Query and Key vectors would create a larger dot product if the query matched up strongly with the response from the key.
  - Values Vector: a vector of scalars that represents the actual informtion provided by the token.
  - Formula for Attention:
  - A softmax is then applied on all dot products of queries over keys -> probability distribution of the query generating the key.
    - For numerical stability (prevent the softmax function from reaching regions with small gradients), the dot products are divided by the √d_k, the square root of the dimension in that key-query space. d_k is usually 512 dimensions.
  - This entire table of key/query is called an attention pattern. Allows us to better explain the output of our transformer.
  - Outputs the same sequence of input embeddings X before applying any non-linearity
- Self Attention – Apply attention on own sentence to represent word on input source sentence as well using a decoder-only model. “The animal didn’t cross the street because it was too tired” -> self attention allows it to associate ‘it’ with ‘animal’
- Transformers – seq2seq model based on self-attention and machine translation and with parallel computation. Output -> Positional Encoding -> Masked MHA -> Add & Norm -> Masked MHA -> Add & Norm -> FF -> Add & Norm -> Linear -> Softmax -> Probabilities
  - Process: Tokenization -> Padding (adding tokens) and Truncation (removing tokens) for same length -> Given a TokenID -> Learn & Input 512-length embedding vector into model -> so on…
  - Positional Encodings – include information about sequence due to self-attention mechanism -> encode the order of the sentence in K Q and V using positional encoding. Number of params in embedding = d_hidden * vocab size
  - Multiheaded Attention -> K Q and V are split N-ways and passes each split through a separate Attention Head -> Combined to produce a final Attention score called Multiheaded Attention. Allows the Transformer to encode multiple relationships and nuances.
  - Masked Attention -> Make the model causal (during training stage) by giving output at a certain position must not be able to see future words. For every word that happens after the word itself, the Q*K is set to -inf: softmax sets this to 0
  - Layer Normalization -> Performed only across the feature dimension. Help models train faster by cutting down on uninformative variation in hidden vector values through normalization.
  - Residual Connections -> Allows for gradients to flow directly through the network and learn the residual from the previous layer, mitigating the vanishing gradient problem
  - Feed-Forward Network -> For process output of self-attention mechanism and apply non-linearity to the final output of the model
  - Transformer Training -> Encoder outputs for each word a vector that captures embedding or position, but also interaction with other words thr multiheaded attention. The output then is processed to obtain cross entropy loss at each word level
  - Transformer Inference (Decoder Only) -> Done through Greedy method, Beam Search or Exhaustive Search. Sampling Mechanisms/Hyperparams: Top P Nucleus Sampling, Top K, Temperature
    - Greedy method: select the word with the maximum softmax value. Model is unable to explore diverse combinations of words, Exhaustive search: – model has bad latency, Beam Search: select top B words at each step and evaluate all possible next words for each step and keep top B most probable. At every step decoder accepts the previous token ( if start) + encoder output to output decoder where a softmax is applied to select word to append. A compromise of both Greedy and Exhaustive Search
    - Top K: Samples from the K tokens with the highest prob, Top P: Filter for set of tokens with summed probability higher than TopP param, then normalize this set of probabilities,Temperature: Alter softmax transformation for output probs. The higher the more random.
- Evaluating Quality of Generated Text
  - Precision: whether it correctly predicts if words are in the target sentence. May have repetition with same words: “He He He” has precision of 1
  - Clipped Precision: Precision, but limit the count for each correct word to the max number of times the word occurs in the target sentence. Predicted Sentence: He He He eats tasty fruit -> He is clipped to only 1 time instead of 3
  - Bleu Score: Clipped Precision for n-grams: Find the Geometric Average Precision (N) and Brevity Penalty, then get Calculate
- BERT: Made up of layers of encoders of the Transformer model. Has embedding vector size of 768, compared to transformers that have 1024
  - Positional embeddings learned during training and limited to 512 positions, Uses the WordPiece tokenizer which also allows sub-word tokens, up to 30k tokens. 12/24 encoder layers, 3072/4096 size of hidden layer, 12/16 attention heads, 110M/340M params
  - Pretraining is done to perform language modeling and learn general structure of language. -> Then finetuning is performed to adapt to the task required. Segmentation embeddings [SEP] are introduced to understand which tokens belong to sentence 1, 2, etc
- Pretraining Architectures
  - Encoders: Gets bidirectional context – can condition on future. But how to train them to build strong representations?
  - Encoder-Decoders: Language modeling decoder, and encoder benefits from bidirectional context. E.g. T5: In the case of span corruption, we replace different length spans from the input with unique placeholders and decode out the spans removed
  - Decoders: Language models; nice to generate from but can’t condition on future words.
- Generative Pre-trained Transformers (GPT-3): Performs Autoregressive Modeling, with context 50K tokens, context window 2048 tokens processed by 90 decoder layers. Unsupervised Pre-training with a large dataset of text to learn 175B params -> fine-tuning to become better at certain tasks
  - Can perform In-context Learning due to large amt of params – learn without gradient steps from examples provided within contexts given by user. Allows for few-shot learning to become zero-shot learning
  - Neural Scaling Laws: Ideal parameter to token ratio for learning is about 1.7 text tokens per param (Kaplan 2020) -> For smaller models, we have to tradeoff model size and the number of training tokens if we want smaller models with better information
- Additional Tasks performed to fine-tune models:
  - Reinforcement Learning with Human Feedback (RLHF): Prompt and reward model outputs -> train our reward model. New prompt is sampled from the dataset, where the policy made generates output evaluated by reward model. Can result in hallucination
  - Fine-Tuning Paradigm: Fine-tune for many tasks and force them to adapt to all. Collect examples of (prediction, output) pairs across many tasks and finetune an LM. -> fine-tunes the model to emulate a correct style or behavior & preserve generalization
  - Supervised Fine-Tuning: Different training architectures used for each task. Limitations: Expensive to collect ground truth data for tasks, Some token prediction errors such as “Please send this package to Tokyo” vs “Pittsburgh”, Training data is unclean -> bad response
- LLM Architectures
- LLaMa 2: Input -> Embedding Vectors -> RMS Normalization layer -> Self Attention (multiquery attention) with KV Cache & RoPE -> RMS Norm + SwiGLU feedforward -> RMS Norm -> Linear function + Softmax -> Output
  - Rotary Positional Embeddings (RoPE): Incorporate explicit relative position dependency. Only applies to the Keys and Queries after the vector q and k have been multiplied by relative weights W in the attention mechanism
  - Normalization: Scales feature values within a stable range so GD can converge faster for larger magnitudes of data to deal with Internal Covariate Shift -> slower training as neurons have to switch direction and large magnitude when changing input.
    - Layer Normalization: Normalization across all weights in the network . Helps models train faster, with γ and β being “gain” and “bias” parameters that help modulate each value on the layer
    - RMSNorm: Only rescales invariance and regularizes the summed inputs according to the RMS statistic . Requires less computation compared to Layer Normalization, and works well in practice
  - KV Cache: Avoid wasting time to recalculate some dot products have already been computed for attention mechanism. We store the previous keys and values so we can focus on only calculating the attention for the new token
    - We trade memory against compute: Per-token memory consumption in bytes, of the KV cache of a MHA model is 2 * n_layers * n_heads * dim_heads * precision
      - For large models, caching keys and values can exceed the memory required to load the model weights for large seqs -> limits # of long sequences. -> Reduce batch size, dependency to total seq length, number layers, number of attention heads
  - Multi-Query Attention: Keep only 1 head for Key and Values. Significantly reduces memory overhead of managing K, V matrices. The stripping choice is more aggressive for larger models (i.e. going from 64 heads to 1 causes significant cut in model’s representation capacity)
  - Grouped Multi-Query Attention: Provide a middle ground for that tradeoff on representation vs memory overhead. Have 1 Key and Value head handle multiple heads from Query heads, by splitting all Query heads into a few groups. Used by LLaMa 2
  - SwiGLU Activation Function: where
- Mixtral: Mixture of Experts Transformer Architecture. Input -> Embedding -> RMS Norm -> KQV goes into Sliding Window Attention w/ Rolling Buffer KV Cache -> RMSNorm -> FF SILU -> RMSNorm -> Linear -> Output
  - Sliding Window Attention: Further add on a mask that makes it such that the words sufficiently far from the attention being calculated can be set to 0 by softmax too (i.e. word in chapter 1 might not be relevant after chapter 6) -> Local Context in focus, reduces dot products to perform
  - Rolling Buffer KV Cache: Follows up with Sliding Window Attention: Since we are using Sliding Window Attention with size W, we can also limit attention calc to keep only the latest W tokens.
    - Key and Values are stored in cache slot determined by i mod W, where W = fixed cache size. After replacing a previous token, we “unroll” the cache by putting all items initially after the rewritten token, to the front of this token. Cuts down memory usage.
    - Pre-fill and Chunking: Cache is pre-filled with provided prompt to enhance context and manage the cache size. This allows input sequence in smaller chunks using pre-fill. For long prompts, chunking divides prompt into smaller segments and treated with both cache and current chunk attention -> optimizing process. Chunk size is determined by sliding attention window size
  - Sparse Mixture of Experts: In Mistral 8x7B, FF layers present at every Encoder layer. Input -> RMSNorm -> Sliding Window Attention w/ Rolling Buffer KV Cache -> RMSNorm -> Gate (selects top 2 experts based on logit, add by weighted sum) -> 8 FF SILU layers -> Output.
  - Model Sharding: When the model becomes too big to fit in a single GPU, we can divide the model into “groups of layers” and place each group of layers in a GPU. E.g. if I have 32 encoder layers, I can put 8 encoding layers into 1 GPU, given a total of 4 GPUs
Retrieval Augmented Generation: Makes use of Incontext Learning to help LLM to personalize and perform non-language tasks beyond its memory while reducing hallucaination
- Long context models: Computationally expensive (complexity of attention layer is O(n_tokens^2 * embedding size ^2)). Hard to extract information from the middle of incoherent context. Cost of sending tokens slows down the computation process so RAG holds importance
- Long Context Evaluation – Needle In A Haystack: Tests in-context retrieval ability of long context LLMs by placing random fact in mid of long context and ask model to retrieve it
- Internal Knowledge Base is Chunked, Indexed and Stored by DB. Chunking Considerations: Nature of Content (long document or tweets), Embedding Model, Complexity of User Queries, Use Case, Size of Chunks, Character Splitter to use
  - Character Chunking, Recursive Chunking, Special Chunking -> Chunk by document formats and headers such as and , Semantic Chunking -> Take embeddings of every sentence and compare similarity of all sentences to chunk
  - Calculating time taken to chunk: total time taken = num_chunks * chunk_size / processing_speed
  - Embedding Considerations: Different kinds of User Queries, Sparse (SR) vs Dense (DR) Retrieval . SR algos like TF-IDF or BM25 are used. Low latency and explainable, but has sparse representation of text based on tokens. DR encodes queries and documents using single vector representations. Leverages rich semantics from queries and passages, and finetuning allows models to better align with app.
  - Embedding Model: BERT encodes input text effectively, but was more intended for specific downstream tasks so it should not be used for sentence embedding. Rather, train BERT to learn sentence similarity:
    - Siamese Networks compare embedding vectors of 2 different sentences and either applies softmax classifier to the vector diff or applies cosine similarity to both vectors, evaluate with Triple Loss or Cosine Similarity:
    - BGE-M3: Trained for use with RAG Systems. Multi Vec Retrieval uses the entire output embedding for the representation of query and passage, Multi-Granularity – process inputs from short sentences to long docs up to 8192 tokens
  - Retrieval of Chunks similar to User Query can take very long: If there are N embedding vectors of Dimension D,Complexity of searching for top K vectors is O(N*D) -> Use Vector DB for specific indexing techniques such as KD Trees and HNSW
- HNSW: Uses a Skip List (Linked List but with layers having half the nodes of the list below. This allows the pointer to descend to layers below thr comparison of values until we reach the desired element) with O(log n) complexity
  - NSW (Greedy Algo that starts at a predefined point in a graph and selects nodes closer to the target node, measured by Euclidean or Cosine sim until reaching nearest neighbors of target)
  - Extended by inheriting Skip List properties: Top layer has fewest data pts w/ longest connections, with # elements increasing down the hierarchy till lowest layer. Select a random starting point and move to local best on layer, before descending below and moving to local best on new layer till hitting bottom layer
Fine-Tuning and Quantization
Counting Parameters in DL Models and Transformers
- FFNN – Let i = input size, h = size of hidden layer, o = output size. for 1 hidden layer, # params = (i x h + h x o) + (h + o)
- RNNs – Let g = # FFNNs in a unit (RNN = 1, GRU = 3, LSTM = 4), h = size of hidden units, i = dimension/size of input. Each FFNN has h * (h+1) + h -> # params = g x [h * (h+1) + h]
- LLaMa 2 – # params = embed_params + (# layers x attention_module params) + mlp_block_params + per_layer_rms_norm_params + pre_lm_head_rms_norm_params + lm_head_params= 13,015,864,320
- Serving for Inference – 2x number of parameters needed E.g. 7B model needs about 14GB of GPU space to serve for inference. Slow training and Inference due to Model scale (Large # of weights, computations), Architecture (Attention ops is quadratic) wrt token length, Decoding approach in inference
- How to improve computations: To fix Model Scale – Fix Structure Design E.g. MoE (Mixture of Experts), Multi-query Attention
  - To fix Quadratic computation, memory access and memory footprint – Model Compression
    - Quantization: Process to convert and store weights of NNs using lower precision data types (typically lower precision FPs). De-Quantization: Convert back to floating values. Done through a mapping mechanism
      - Floating Point Representations: Sign -> 1 bit, 0 indicates positive, 1 indicates negative, Exponent -> usually 8 bits. can also be negative or positive, Mantissa -> rest of the bits (16FP or 32FP).
        FP32 has high computational and memory footprint in exchange for high precision and numerical stability compared to FP16
        FP Value in Decimal: . For FP32: 1-8-23, For FP16: 1-5-10, For BF16: 1-8-7, reducing underflow/overflow. Does not significantly impact model performance and is useful compromise for DL
        Usually, weights are held in FP32 and FP16/BF16 (half-precision) is used for computation in forward and backward pass
    - Types of Quantization
      - Symmetric Quantization – FP grid aligns with the fixed point grid:
        Absolute Maximum (ABSMAX) Quantization – og number / max value of the tensor * scaling factor (127) to map inputs to [-127, 127] 127 is the max integer represented by 8 bits.
      - Asymmetric Quantization – Uses a ReLU function to skew all distribution of values towards the positive side. less error in dequantization but more overhead of calculation
        Uniform Affine Quantization: Has 3 quantization parameters
        let scale factor = s, zero-point = z, and bit-width = b. , where and . Dequantization step is
        Naive 8-bit Quantization / zero-point quantization – useful for considering the output of a ReLU function. Dequantization step is
    - Dequantize weights -> get Output of weights -> Y_q = X (input) * W (weight) + B (bias) [the quantization] -> Perform all operations using integer arithmetic
    - Different ways to perform Weight Quantization: Post-Training Quantization (PTQ): weight conversion happens after model is trained, Quantization-Aware Training (QAT): weight conversion process during pre-training/fine-tuning – computationally exp but enhances perf.
- vLLM: Better optimized LLM Inference using Paged Attention: Better handles attention K and V by storing different heads and layers in a block -> map the blocks to non-contiguous physical memory -> PagedAttention kernel can get KV blocks via page table for attention until
- Optimizing LLM Fine-tuning: Requires retraining all the model params (up to billions for LLMs), and increases the risk of overfitting especially when the new data is small or noisy -> Catastrophic forgetting effect
  - Parameter Efficient Fine Tuning (PEFT) only updates a small subset of parameters (15-20%).
    - Adapters: Special type of submodule added to pre-trained LMs after MHA/FF layer in transformers. to modify their hidden representation during finetuning. Has 2 FF projection layers connected with a non linear-activation layer, with a skip connection bypassing the FF layers
    - LoRA: Represent Pre-trained overparameterized models n lower dimensions using SVD then perform GD to optimize weights at lower dimensions . Adds a secondary path that can be used with any fully connected layer. Attention heads are recommended
      - Maximize finetuning performance by taking weight matrices of a lower rank r and more matrices. The goldilock zone for the rank matric r is 4 or 8
    - QLoRA (Quantized Low-Rank Adaptation): Upgrades PEFT an upgrade to LoRA through the use of 4-bit NormalFloat (NF4) for normally distributed weights in NN, quantization of quantization constants, and paged optimizers to manage memory spikes
      - Transforms all weights to a single fixed distribution by scaling with SD and setting an arbitrary range of [-1, 1], thus has a mean of 0 and SD of 1. Each bucket needs to have almost block_size (b) number of elements
  - Other fine-tuning methods
    - Prefix-tuning: freezes LM params and optimizes for task using prefix -> set of free params that are trained with the LM that subsequent tokens can attend to. Extrapolates better to examples with topics unseen, as well as outperforms fine-tuning in low-data settings
    - Prompt tuning: Store a small task-specific prompt for each task -> word embeddings. Outperforms few-shot learning of GPT-3. becomes more competitive as the model size increases -> robust model transfer

Natural Language Processing (NLP) Techniques and Models Overview

DSA4213 – Natural Language Processing (NLP)

Word Tokenization

Word-based models

Feature Engineering for Classical ML used in NLP: BOW representations, N-grams, TF-IDF, Word2Vec, Skip grams, CBOW

Topic Modelling:

Recent Notes

Subjects

Publicidad