LLM Cheatsheet: Essential Concepts and Architectures
Posted on Nov 18, 2025 in Mathematics and Computer Science
LLM Cheatsheet: Essential Concepts
Everything you need to recall about Large Language Models (LLMs) — summarized on one page.
What is a Large Language Model (LLM)?
A Large Language Model (LLM) is a deep learning model trained on vast amounts of text data to understand, generate, and manipulate human language.
- Examples: GPT-4, Claude, LLaMA, Gemini, Mistral, Falcon
- Core Technology: The Transformer Architecture
LLM Core Components
| Component | Description |
|---|
| Tokenizer | Converts text into tokens (subwords, words) for numerical input. |
| Embedding | Maps tokens to vectors in high-dimensional space. |
| Transformer | Uses attention mechanisms to understand context and sequence relationships. |
| Decoder | Generates the next tokens based on context (e.g., in GPT models). |
Key Concepts in LLM Architecture
| Term | Meaning |
|---|
| Attention | Focuses on relevant parts of the input (via learned weights). |
| Self-Attention | Mechanism where each word attends to every other word in the sequence. |
| Positional Encoding | Adds order information to input tokens (since Transformers are permutation-invariant). |
| Parameters | Weights learned during training (e.g., GPT-3: 175B). |
| Context Window | Maximum number of tokens the model can process simultaneously (e.g., GPT-4: ~128k tokens). |
How Large Language Models Work (Simplified)
- Input text is tokenized.
- Tokens are converted to embeddings.
- Embeddings are passed through the multi-layer Transformer block.
- The model predicts the next token (autoregressive generation).
- The process repeats until a stop condition or maximum length is reached.
Common LLM Architectures
- Decoder-only: Used primarily for generation (e.g., GPT, LLaMA).
- Encoder-only: Used for understanding tasks like classification or search (e.g., BERT).
- Encoder-Decoder: Used for sequence-to-sequence tasks like translation or summarization (e.g., T5, BART).
LLM Training Phases
| Phase | Goal |
|---|
| Pretraining | Unsupervised learning on massive text datasets (web, books, code). |
| Finetuning | Task-specific tuning using labeled data (e.g., summarization, Q&A). |
| RLHF (Reinforcement Learning with Human Feedback) | Improves model behavior and alignment based on human preferences. |
Important Generation Parameters
| Hyperparameter | Role |
|---|
| Temperature | Controls randomness (0 = deterministic/precise, 1 = creative/diverse). |
| Top-K | The model selects the next token only from the top K most probable tokens. |
| Top-P (Nucleus Sampling) | The model selects the next token from the smallest set of tokens whose cumulative probability exceeds P%. |
| Max Tokens | Sets the limit on the response length generated by the model. |
| Stop Tokens | Specific tokens that, when generated, immediately end the sequence. |
Prompt Engineering Techniques
- Zero-shot Prompting: Direct task instruction without providing any examples.
- Few-shot Prompting: Providing the task instruction along with several input/output examples.
- Chain-of-Thought (CoT): Instructing the model to reason step-by-step before answering.
- ReAct: Combining Reasoning (CoT) and Action (using external tools).
- System Prompt: Setting the initial behavior, persona, and constraints for the model.
- Temperature: Lower temperature yields precise results; higher temperature yields creative results.
LLM Customization Methods: Fine-tuning, RAG, Prompting
| Method | Primary Use Case |
|---|
| Prompting | Using the base model effectively via clever input design. |
| Fine-tuning | Training new model weights on specific, task-oriented data. |
| RAG (Retrieval-Augmented Generation) | Combining the LLM with external, up-to-date knowledge sources (e.g., documents, databases) via search. |
Understanding Memory in LLMs
- Stateless: The model has no memory of previous chat turns outside the current prompt.
- Contextual Memory: Past conversation history is included within the current prompt (limited by context window size).
- Long-term Memory: Achieved via custom extensions or tools that store and retrieve past interactions (e.g., vector databases).
Key Limitations of LLMs
- Hallucinations: Generating confident but factually incorrect information.
- Lack of real-world awareness (unless external tools or RAG are utilized).
- Can reflect biases present in the training data.
- Costly to train and slower performance with longer context windows.
LLM Evaluation Metrics
| Metric | Description |
|---|
| Perplexity | Measures how well the model predicts the next token (lower is better). |
| BLEU/ROUGE | Used to measure the similarity of generated text to reference text (common in translation/summarization). |
| Accuracy/F1 Score | Standard metrics used for classification and structured prediction tasks. |
| Human Evaluation | Actual human quality rating based on relevance, coherence, and helpfulness. |
Practical LLM Use Cases
- Chatbots and Writing Assistants
- Code Generation (e.g., using models like Codex or DeepSeek)
- Document Summarization and Information Extraction
- Search and Retrieval Systems
- Autonomous Agents and Workflow Automation
- Data Analysis, SQL Generation, and Mathematical Reasoning
LLM Ecosystem: Popular Tools and Frameworks
| Tool/Library | Primary Use Case |
|---|
| HuggingFace | Platform for LLM hosting, sharing, and training. |
| LangChain | Framework for connecting LLMs with tools, memory, and data sources. |
| LLamaIndex | Specialized framework for Retrieval-Augmented Generation (RAG). |
| OpenAI API | Programmatic access to models like GPT-3.5 and GPT-4. |
| Transformers (HF) | Python library for using pretrained models in PyTorch, TensorFlow, or JAX. |
| Gradio/Streamlit | Tools for quickly building interactive LLM user interfaces (UIs). |
LLM Agents: Combining LLMs with Tools
Agents utilize LLMs to make complex decisions, plan workflows, and take actions in dynamic environments.
- Agents use external tools (e.g., calculators, web search APIs).
- They follow structured plans (e.g., ReAct, Tree-of-Thought (ToT)).
- They can chain multiple steps, utilize memory, and incorporate feedback loops.
Quick Recap Mnemonic: TAP-TOP-RAG
A simple way to remember key LLM concepts:
- Tokenizer → Attention → Prompts
- Temperature → Output control → Plans (ReAct)
- Retrieval → Agents → Generation