System Designer

🧠 What are Large Language Models?

Large Language Models (LLMs) are AI systems trained on massive amounts of text data that can understand, generate, and manipulate human language. They power ChatGPT, Claude, and most modern AI applications.

Think of LLMs as: Incredibly well-read assistants who have read billions of pages and can help with almost any text task, but sometimes confidently make things up.

📊 What are Parameters?

Parameters are the learned weights in the neural network—think of them as the "knowledge" the model has absorbed during training. A model with 70 billion parameters has 70 billion numbers that determine how it responds. More parameters typically mean more capacity to understand complex patterns, but also higher costs and slower responses.

💬 What are Tokens?

LLMs don't read words—they read tokens, which are chunks of text (roughly 0.75 words each). "Hello world!" might be 3 tokens. Both input and output are measured in tokens, which directly affects cost and context limits. Think: 1,000 tokens ≈ 750 words.

📏 What is Context Window?

The context window is the maximum number of tokens (input + output) the model can handle at once. A 128K context window means ~96,000 words total—enough for a full book chapter. Longer context = more expensive and slower.

⚡ What are FLOPs?

FLOPs (Floating Point Operations) measure computational work. Training a 70B parameter model on 1 trillion tokens requires ~420 × 10²¹ FLOPs—millions of dollars in compute. This is why only big companies train frontier models.

✅ What LLMs Can Do

• Generate human-like text
• Answer questions
• Write and debug code
• Translate languages
• Summarize documents
• Reason through problems

❌ What LLMs Cannot Do

• Access real-time information
• Remember previous conversations
• Guarantee factual accuracy
• Perform actual computations
• Learn from your data
• Access external systems (without tools)

📊 Model Scale & Parameters

What are Parameters?

Parameters are the weights and biases in a neural network that get adjusted during training. Think of them as the "knowledge" the model has learned. More parameters generally mean more capacity to learn complex patterns, but also higher costs and slower inference.

Small Models

1B-10B

Fast, efficient, great for simple tasks

• Gemma 2B, 9B
• Qwen 2.5 (1.5B-7B)
• Llama 3.2 (1B, 3B)

Medium Models

10B-100B

Balanced performance and cost

• Mistral 7B, 24B
• Llama 3.1 (70B)
• Gemma 27B

Large Models

100B+

Top performance, highest costs

• Qwen 3 (110B)
• Llama 3.1 (405B)
• DeepSeek-R1 (671B MoE)

FLOP Count & Training Cost

FLOPs (Floating Point Operations) measure computational cost. Training requires approximately 6ND FLOPs, where N = parameters and D = training tokens. With activation checkpointing (required for large models), this increases to ~8ND FLOPs.

Example: 70B model, 1T tokens

6 × 70B × 1T = 420 × 10²¹ FLOPs

≈ 4,200 petaFLOP-days on A100s

Llama 2 70B (actual)

21,000 petaFLOP-days

Millions in compute costs

💡 MoE (Mixture of Experts): Models like DeepSeek-R1 (671B total, 37B active) and Llama 4 Maverick (400B total, 17B active) use sparse activation to run efficiently. Only a subset of parameters are active per token, drastically reducing compute while maintaining capability.

⚙️ How LLMs Work

1️⃣ Training Phase

LLMs learn patterns from trillions of words of text from the internet, books, and other sources.

Input: "The capital of France is" → Output: "Paris"

2️⃣ Tokenization

Text is broken into tokens (chunks of characters) that the model can process.

"Hello world!" → ["Hello", " world", "!"] → [15496, 1917, 0]

3️⃣ Attention Mechanism

The model focuses on relevant parts of the input to generate contextually appropriate responses.

"The cat sat on the [?]" → Model attends to "cat" and "sat" → Predicts "mat"

4️⃣ Generation

The model predicts the next most likely token, one at a time, to build complete responses.

Generates: "The" → "cat" → "is" → "sleeping" (token by token)

🔍 Compare Popular LLMs

GPT-5

by OpenAI

—

parameters

Context Window

256K

Input Cost

$1.25/1M tokens

Output Cost

$10/1M tokens

✅ Strengths

• Top-tier reasoning
• Agentic coding
• Tool use
• Long-context retrieval

⚠️ Weaknesses

• Higher output cost
• Proprietary
• No self-hosting

Best for: Complex multi-step tasks, top coding performance, agent workflows

✨ What's New in GPT‑5

Model family

• API variants: gpt-5, gpt-5-mini, gpt-5-nano
• ChatGPT non-reasoning: gpt-5-chat-latest
• Typical context: ~256K API (reasoning), ~196K in Chat

Controls & capabilities

• verbosity: low, medium, high
• reasoning_effort: minimal vs deeper thinking
• Custom tools: plaintext tool calls with constraints
• Better tool use, long-context retrieval, lower hallucinations

🚀 Latest Open-Source Models (2025)

Llama 4 (Meta - April 2025)

First Llama with Mixture-of-Experts architecture

NEW

Llama 4 Scout

Small & Fast

Llama 4 Maverick

400B total, 17B active

Llama 4 Behemoth

Largest variant

Key innovation: MoE architecture activates only a fraction of parameters per token, dramatically improving efficiency while maintaining top-tier performance. Crushes GPT-4o and Gemini on many benchmarks.

DeepSeek-R1 (DeepSeek - January 2025)

Largest open reasoning model with efficient MoE

671B

37B active

Breakthrough: 671 billion parameters with only 37B activated at a time using Mixture-of-Experts. Focuses on advanced reasoning tasks while maintaining efficiency comparable to much smaller models.

ReasoningMoE ArchitectureOpen Weights

Qwen 3 (Alibaba - 2025)

0.5B to 110B

Full family from tiny edge models to massive multilingual powerhouses. Excels at code, math, and supports 100+ languages.

MultilingualCodeMath

Gemma 3 (Google - 2025)

2B to 27B

Lightweight but powerful. The 27B variant performs like models 2x its size. Optimized for consumer GPUs and easy integration.

EfficientConsumer GPUsFast

Mistral Small 3 (Mistral AI - January 2025)

24B params

State-of-the-art for its size. Competes with models 3-4x larger. Great for EU data residency and function calling.

EU-friendlyFunction Calling

Llama 3.1 (Meta - July 2024)

70B & 405B

Industry standard. The 405B is the largest non-MoE open model. Proven reliability, extensive tooling, great community.

ProvenWide Support

💡 2025 Trend: Open-source models are rapidly closing the gap with proprietary models. MoE architectures (Llama 4, DeepSeek-R1) enable massive scale with practical efficiency. Expect more specialized models optimized for specific tasks (code, math, multilingual) rather than general-purpose giants.

💰 Token Cost Calculator

Number of tokens (1 token ≈ 0.75 words)

💾 Prompt caching available: GPT-5 supports cached inputs at 0.125x lower cost for repeated content.

⚙️ GPT-5 Specific Controls: Verbosity and reasoning effort affect response quality and length but not per-token pricing.

Verbosity

Reasoning effort

Input share %

Split tokens into input vs output for cost math.

Cached input share %

Percentage of input tokens served from cache (10x cheaper)

Note: Verbosity and reasoning effort affect quality, latency, and token usage, but not per‑token rates. Adjust the input/output split to reflect your use case.

🏋️ Training Cost Calculator

💡 Why training costs millions: Training requires ~6ND FLOPs (N=parameters, D=tokens). Even with cutting-edge GPUs, large models take months and cost millions in compute.

Model size (billions of parameters)

Examples: 7 (Mistral), 70 (Llama), 405 (Llama 3.1)

Training tokens (billions)

Typical: 1000-2000 (1-2 trillion tokens)

📖 Key Concepts

Tokens

Basic units of text (≈0.75 words). LLMs process text as tokens, not words.

"ChatGPT is amazing!" = 5 tokens

Why it matters: Affects cost and context limits

Parameters

Weights and biases in the neural network that store learned knowledge. More parameters = more capacity to learn patterns.

GPT-4: ~1.8T, Llama 3.1: 70B or 405B, Gemma 2: 27B

Why it matters: Determines model capability, size, and inference cost

FLOPs (Floating Point Operations)

Computational work required for training. Measured in petaFLOPs or exaFLOPs. Training ≈ 6ND FLOPs (N=params, D=tokens).

70B model on 1T tokens = ~420×10²¹ FLOPs

Why it matters: Determines training time, cost, and hardware requirements

Context Window

Maximum tokens an LLM can process in one request or conversation.

Gemini 1.5: up to 1M tokens (very long docs)

Why it matters: Limits conversation length and document size

MoE (Mixture of Experts)

Architecture that activates only a subset of parameters per token, keeping total params high but active params low.

DeepSeek-R1: 671B total, only 37B active per token

Why it matters: Enables massive scale with practical efficiency and lower inference costs

Temperature

Controls randomness in responses (0 = deterministic, 2 = very random).

Low (0.2): Factual answers | High (1.5): Creative writing

Why it matters: Balance between consistency and creativity

Hallucination

When LLMs generate plausible-sounding but incorrect information.

Inventing fake citations or historical events

Why it matters: Critical risk in production systems

⚠️ Common Pitfalls

❌

Trusting outputs blindly

LLMs can hallucinate facts. Always verify critical information.

❌

Ignoring token costs

GPT-4 can cost $0.12 per page. Budget accordingly for production.

❌

Expecting perfect consistency

Same prompt can give different outputs. Use temperature=0 for consistency.

❌

Overloading context

Performance degrades near context limits. Keep conversations focused.

🎯 Key Takeaways

✓

LLMs are pattern matchers: They predict likely text based on training, not true understanding

✓

Choose models wisely: GPT-5 for top reasoning/coding, GPT-5 mini for cost-effective assistants, Claude 3.5 for long docs, Gemini 1.5 for ultra-long context

✓

Tokens = Money: Optimize prompts and responses to control costs

✓

Hallucinations are inevitable: Build validation and verification into your systems

✓

Context windows matter: Plan for conversation length and document size limits

No quiz questions available

Quiz ID "llm-intro" not found

Introduction to Large Language Models

🧠 What are Large Language Models?

📊 What are Parameters?

💬 What are Tokens?

📏 What is Context Window?

⚡ What are FLOPs?

✅ What LLMs Can Do

❌ What LLMs Cannot Do

📊 Model Scale & Parameters

What are Parameters?

Small Models

Medium Models

Large Models

FLOP Count & Training Cost

⚙️ How LLMs Work

1️⃣ Training Phase

2️⃣ Tokenization

3️⃣ Attention Mechanism

4️⃣ Generation

🔍 Compare Popular LLMs

GPT-5

✅ Strengths

⚠️ Weaknesses

✨ What's New in GPT‑5

Model family

Controls & capabilities

🚀 Latest Open-Source Models (2025)

Llama 4 (Meta - April 2025)

DeepSeek-R1 (DeepSeek - January 2025)

Qwen 3 (Alibaba - 2025)

Gemma 3 (Google - 2025)

Mistral Small 3 (Mistral AI - January 2025)

Llama 3.1 (Meta - July 2024)

💰 Token Cost Calculator

🏋️ Training Cost Calculator

📖 Key Concepts

Tokens

Parameters

FLOPs (Floating Point Operations)

Context Window

MoE (Mixture of Experts)

Temperature

Hallucination

⚠️ Common Pitfalls

Trusting outputs blindly

Ignoring token costs

Expecting perfect consistency

Overloading context

🎯 Key Takeaways