Skip to main contentSkip to user menuSkip to navigation

LLM Fundamentals

Understand the core concepts, architectures, training methods, and capabilities of Large Language Models that power modern AI systems.

45 min readIntermediate
Not Started
Loading...

🧠 What are Large Language Models?

Large Language Models (LLMs) are neural networks trained on vast amounts of text data to understand and generate human-like language. They represent a breakthrough in AI that has enabled applications like ChatGPT, GPT-4, Claude, and numerous other AI systems.

Scale

Billions to trillions of parameters trained on massive text corpora

Capabilities

Text generation, reasoning, code writing, translation, and more

Training

Self-supervised learning on next-token prediction with human feedback

🏗️ LLM Architectures

Transformer Architecture

Self-attention based architecture powering modern LLMs

Key Components

  • Multi-Head Attention
  • Feed-Forward Networks
  • Layer Normalization
  • Positional Encoding

Advantages

  • Parallelizable training
  • Long-range dependencies
  • Scalable to large sizes

Implementation Example

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attention_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        
        return output, attention_weights
    
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # Linear projections
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        attention_output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads
        attention_output = attention_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model)
        
        return self.W_o(attention_output)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

# Example: GPT-style decoder-only model
class GPTModel(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, max_length):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_length, d_model)
        
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_model * 4)
            for _ in range(num_layers)
        ])
        
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
    
    def forward(self, input_ids):
        seq_length = input_ids.size(1)
        
        # Create causal mask
        mask = torch.tril(torch.ones(seq_length, seq_length)).unsqueeze(0).unsqueeze(0)
        
        # Embeddings
        positions = torch.arange(0, seq_length).expand(input_ids.size(0), seq_length)
        x = self.token_embedding(input_ids) + self.position_embedding(positions)
        
        # Transformer blocks
        for block in self.transformer_blocks:
            x = block(x, mask)
        
        # Final layer norm and output projection
        x = self.ln_f(x)
        logits = self.head(x)
        
        return logits

🎯 Training Phases

Objective

Next token prediction (autoregressive language modeling)

Data

Massive unlabeled text (books, web pages, articles)

Scale

Hundreds of billions to trillions of tokens

Implementation

# Pre-training objective: Predict next token
import torch
import torch.nn.functional as F

def pretraining_loss(model, input_ids):
    """
    Autoregressive language modeling loss
    """
    # Shift inputs: predict token at position i+1 given tokens 0..i
    inputs = input_ids[:, :-1]  # Remove last token
    targets = input_ids[:, 1:]  # Remove first token
    
    # Forward pass
    logits = model(inputs)  # (batch_size, seq_len, vocab_size)
    
    # Compute cross-entropy loss
    loss = F.cross_entropy(
        logits.reshape(-1, logits.size(-1)),
        targets.reshape(-1),
        ignore_index=-100  # Ignore padding tokens
    )
    
    return loss

# Training characteristics:
# - Self-supervised learning (no labels needed)
# - Massive scale (GPT-3: 300B tokens, PaLM: 780B tokens)
# - Emergent capabilities appear at scale
# - Foundation for all downstream tasks

class PreTrainingDataset:
    def __init__(self, tokenizer, block_size=1024):
        self.tokenizer = tokenizer
        self.block_size = block_size
    
    def tokenize_and_chunk(self, text):
        """Convert text to training examples"""
        tokens = self.tokenizer.encode(text)
        
        # Create overlapping chunks
        examples = []
        for i in range(0, len(tokens) - self.block_size, self.block_size):
            chunk = tokens[i:i + self.block_size + 1]  # +1 for target
            examples.append(torch.tensor(chunk))
        
        return examples

⚡ Core Capabilities

Text Generation

Generate coherent, contextually relevant text

Applications

Creative writingCode generationDocumentationSummarization

Implementation

# Text generation with different strategies
import torch
import torch.nn.functional as F

def greedy_decode(model, input_ids, max_length=100):
    """Generate text using greedy decoding"""
    generated = input_ids.clone()
    
    for _ in range(max_length):
        with torch.no_grad():
            outputs = model(generated)
            logits = outputs[:, -1, :]  # Last token logits
            next_token = torch.argmax(logits, dim=-1, keepdim=True)
            generated = torch.cat([generated, next_token], dim=1)
            
            # Stop if EOS token
            if next_token.item() == tokenizer.eos_token_id:
                break
    
    return generated

def nucleus_sampling(model, input_ids, max_length=100, top_p=0.9, temperature=1.0):
    """Generate text using nucleus (top-p) sampling"""
    generated = input_ids.clone()
    
    for _ in range(max_length):
        with torch.no_grad():
            outputs = model(generated)
            logits = outputs[:, -1, :] / temperature
            
            # Apply nucleus sampling
            sorted_logits, sorted_indices = torch.sort(logits, descending=True)
            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
            
            # Remove tokens with cumulative probability above threshold
            sorted_indices_to_remove = cumulative_probs > top_p
            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
            sorted_indices_to_remove[..., 0] = 0
            
            indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
            logits[indices_to_remove] = -float('inf')
            
            # Sample from the filtered distribution
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            generated = torch.cat([generated, next_token], dim=1)
    
    return generated

# Advanced generation techniques:
# - Beam search: Maintain multiple hypotheses
# - Contrastive search: Balance coherence and diversity
# - Typical sampling: Sample from "typical" probability mass

📈 Scaling Laws & Emergent Abilities

Scaling Laws

  • Model size: More parameters → better performance
  • Data scale: More training data → better generalization
  • Compute budget: More compute → higher quality models
  • Power law relationships: Predictable scaling curves

Emergent Abilities

  • Chain-of-thought reasoning: Appears at ~10B parameters
  • In-context learning: Few-shot task adaptation
  • Complex instruction following: Multi-step task execution
  • Code generation: Programming in multiple languages

Key Insight

Many capabilities are not explicitly trained but emerge from scale. This suggests that language modeling is a powerful objective that captures many aspects of intelligence and reasoning.

⚠️ Limitations & Challenges

Current Limitations

  • Hallucination: Generate plausible but false information
  • Knowledge cutoff: Limited to training data time period
  • Context limits: Finite context window for long documents
  • Computational cost: Expensive inference and training
  • Alignment challenges: Difficult to ensure desired behavior

Active Research Areas

  • Retrieval-augmented generation: Connect to external knowledge
  • Tool use: Integration with calculators, APIs, databases
  • Constitutional AI: Self-improving safety mechanisms
  • Multimodal models: Text, images, audio, video
  • Efficient architectures: Mixture of experts, sparse models

🎯 Key Takeaways

Transformers revolutionized NLP: Self-attention enables parallel processing and captures long-range dependencies

Scale matters: Many capabilities emerge only at large model sizes and training scales

Training is multi-phase: Pre-training provides foundation, fine-tuning and RLHF align with human preferences

In-context learning is powerful: Models can adapt to new tasks with just examples in the prompt

Challenges remain: Hallucination, alignment, and computational efficiency are active research areas

No quiz questions available
Quiz ID "llm-fundamentals" not found