LLM Fundamentals

Understand the core concepts, architectures, training methods, and capabilities of Large Language Models that power modern AI systems.

45 min readIntermediate
Not Started
Loading...

🧠 What are Large Language Models?

Large Language Models (LLMs) are neural networks trained on vast amounts of text data to understand and generate human-like language. They represent a breakthrough in AI that has enabled applications like ChatGPT, GPT-4, Claude, and numerous other AI systems.

Scale

Billions to trillions of parameters trained on massive text corpora

Capabilities

Text generation, reasoning, code writing, translation, and more

Training

Self-supervised learning on next-token prediction with human feedback

🏗️ LLM Architectures

Transformer Architecture

Self-attention based architecture powering modern LLMs

Key Components

  • Multi-Head Attention
  • Feed-Forward Networks
  • Layer Normalization
  • Positional Encoding

Advantages

  • Parallelizable training
  • Long-range dependencies
  • Scalable to large sizes

Implementation Example

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attention_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        
        return output, attention_weights
    
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # Linear projections
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        attention_output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads
        attention_output = attention_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model)
        
        return self.W_o(attention_output)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

# Example: GPT-style decoder-only model
class GPTModel(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, max_length):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_length, d_model)
        
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_model * 4)
            for _ in range(num_layers)
        ])
        
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
    
    def forward(self, input_ids):
        seq_length = input_ids.size(1)
        
        # Create causal mask
        mask = torch.tril(torch.ones(seq_length, seq_length)).unsqueeze(0).unsqueeze(0)
        
        # Embeddings
        positions = torch.arange(0, seq_length).expand(input_ids.size(0), seq_length)
        x = self.token_embedding(input_ids) + self.position_embedding(positions)
        
        # Transformer blocks
        for block in self.transformer_blocks:
            x = block(x, mask)
        
        # Final layer norm and output projection
        x = self.ln_f(x)
        logits = self.head(x)
        
        return logits

🎯 Training Phases

Objective

Next token prediction (autoregressive language modeling)

Data

Massive unlabeled text (books, web pages, articles)

Scale

Hundreds of billions to trillions of tokens

Implementation

# Pre-training objective: Predict next token
import torch
import torch.nn.functional as F

def pretraining_loss(model, input_ids):
    """
    Autoregressive language modeling loss
    """
    # Shift inputs: predict token at position i+1 given tokens 0..i
    inputs = input_ids[:, :-1]  # Remove last token
    targets = input_ids[:, 1:]  # Remove first token
    
    # Forward pass
    logits = model(inputs)  # (batch_size, seq_len, vocab_size)
    
    # Compute cross-entropy loss
    loss = F.cross_entropy(
        logits.reshape(-1, logits.size(-1)),
        targets.reshape(-1),
        ignore_index=-100  # Ignore padding tokens
    )
    
    return loss

# Training characteristics:
# - Self-supervised learning (no labels needed)
# - Massive scale (GPT-3: 300B tokens, PaLM: 780B tokens)
# - Emergent capabilities appear at scale
# - Foundation for all downstream tasks

class PreTrainingDataset:
    def __init__(self, tokenizer, block_size=1024):
        self.tokenizer = tokenizer
        self.block_size = block_size
    
    def tokenize_and_chunk(self, text):
        """Convert text to training examples"""
        tokens = self.tokenizer.encode(text)
        
        # Create overlapping chunks
        examples = []
        for i in range(0, len(tokens) - self.block_size, self.block_size):
            chunk = tokens[i:i + self.block_size + 1]  # +1 for target
            examples.append(torch.tensor(chunk))
        
        return examples

⚡ Core Capabilities

Text Generation

Generate coherent, contextually relevant text

Applications

Creative writingCode generationDocumentationSummarization

Implementation

# Text generation with different strategies
import torch
import torch.nn.functional as F

def greedy_decode(model, input_ids, max_length=100):
    """Generate text using greedy decoding"""
    generated = input_ids.clone()
    
    for _ in range(max_length):
        with torch.no_grad():
            outputs = model(generated)
            logits = outputs[:, -1, :]  # Last token logits
            next_token = torch.argmax(logits, dim=-1, keepdim=True)
            generated = torch.cat([generated, next_token], dim=1)
            
            # Stop if EOS token
            if next_token.item() == tokenizer.eos_token_id:
                break
    
    return generated

def nucleus_sampling(model, input_ids, max_length=100, top_p=0.9, temperature=1.0):
    """Generate text using nucleus (top-p) sampling"""
    generated = input_ids.clone()
    
    for _ in range(max_length):
        with torch.no_grad():
            outputs = model(generated)
            logits = outputs[:, -1, :] / temperature
            
            # Apply nucleus sampling
            sorted_logits, sorted_indices = torch.sort(logits, descending=True)
            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
            
            # Remove tokens with cumulative probability above threshold
            sorted_indices_to_remove = cumulative_probs > top_p
            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
            sorted_indices_to_remove[..., 0] = 0
            
            indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
            logits[indices_to_remove] = -float('inf')
            
            # Sample from the filtered distribution
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            generated = torch.cat([generated, next_token], dim=1)
    
    return generated

# Advanced generation techniques:
# - Beam search: Maintain multiple hypotheses
# - Contrastive search: Balance coherence and diversity
# - Typical sampling: Sample from "typical" probability mass

📈 Scaling Laws & Emergent Abilities

Scaling Laws

  • Model size: More parameters → better performance
  • Data scale: More training data → better generalization
  • Compute budget: More compute → higher quality models
  • Power law relationships: Predictable scaling curves

Emergent Abilities

  • Chain-of-thought reasoning: Appears at ~10B parameters
  • In-context learning: Few-shot task adaptation
  • Complex instruction following: Multi-step task execution
  • Code generation: Programming in multiple languages

Key Insight

Many capabilities are not explicitly trained but emerge from scale. This suggests that language modeling is a powerful objective that captures many aspects of intelligence and reasoning.

⚠️ Limitations & Challenges

Current Limitations

  • Hallucination: Generate plausible but false information
  • Knowledge cutoff: Limited to training data time period
  • Context limits: Finite context window for long documents
  • Computational cost: Expensive inference and training
  • Alignment challenges: Difficult to ensure desired behavior

Active Research Areas

  • Retrieval-augmented generation: Connect to external knowledge
  • Tool use: Integration with calculators, APIs, databases
  • Constitutional AI: Self-improving safety mechanisms
  • Multimodal models: Text, images, audio, video
  • Efficient architectures: Mixture of experts, sparse models

🎯 Key Takeaways

Transformers revolutionized NLP: Self-attention enables parallel processing and captures long-range dependencies

Scale matters: Many capabilities emerge only at large model sizes and training scales

Training is multi-phase: Pre-training provides foundation, fine-tuning and RLHF align with human preferences

In-context learning is powerful: Models can adapt to new tasks with just examples in the prompt

Challenges remain: Hallucination, alignment, and computational efficiency are active research areas

📝 LLM Fundamentals Mastery Check

1 of 8Current: 0/8

What is the primary training objective for pre-training LLMs?