🧠 What are Large Language Models?
Large Language Models (LLMs) are AI systems trained on massive amounts of text data that can understand, generate, and manipulate human language. They power ChatGPT, Claude, and most modern AI applications.
Think of LLMs as: Incredibly well-read assistants who have read billions of pages and can help with almost any text task, but sometimes confidently make things up.
📊 What are Parameters?
Parameters are the learned weights in the neural network—think of them as the "knowledge" the model has absorbed during training. A model with 70 billion parameters has 70 billion numbers that determine how it responds. More parameters typically mean more capacity to understand complex patterns, but also higher costs and slower responses.
💬 What are Tokens?
LLMs don't read words—they read tokens, which are chunks of text (roughly 0.75 words each). "Hello world!" might be 3 tokens. Both input and output are measured in tokens, which directly affects cost and context limits. Think: 1,000 tokens ≈ 750 words.
📏 What is Context Window?
The context window is the maximum number of tokens (input + output) the model can handle at once. A 128K context window means ~96,000 words total—enough for a full book chapter. Longer context = more expensive and slower.
⚡ What are FLOPs?
FLOPs (Floating Point Operations) measure computational work. Training a 70B parameter model on 1 trillion tokens requires ~420 × 10²¹ FLOPs—millions of dollars in compute. This is why only big companies train frontier models.
✅ What LLMs Can Do
- • Generate human-like text
- • Answer questions
- • Write and debug code
- • Translate languages
- • Summarize documents
- • Reason through problems
❌ What LLMs Cannot Do
- • Access real-time information
- • Remember previous conversations
- • Guarantee factual accuracy
- • Perform actual computations
- • Learn from your data
- • Access external systems (without tools)
📊 Model Scale & Parameters
What are Parameters?
Parameters are the weights and biases in a neural network that get adjusted during training. Think of them as the "knowledge" the model has learned. More parameters generally mean more capacity to learn complex patterns, but also higher costs and slower inference.
Small Models
Fast, efficient, great for simple tasks
- • Gemma 2B, 9B
- • Qwen 2.5 (1.5B-7B)
- • Llama 3.2 (1B, 3B)
Medium Models
Balanced performance and cost
- • Mistral 7B, 24B
- • Llama 3.1 (70B)
- • Gemma 27B
Large Models
Top performance, highest costs
- • Qwen 3 (110B)
- • Llama 3.1 (405B)
- • DeepSeek-R1 (671B MoE)
FLOP Count & Training Cost
FLOPs (Floating Point Operations) measure computational cost. Training requires approximately 6ND FLOPs, where N = parameters and D = training tokens. With activation checkpointing (required for large models), this increases to ~8ND FLOPs.
💡 MoE (Mixture of Experts): Models like DeepSeek-R1 (671B total, 37B active) and Llama 4 Maverick (400B total, 17B active) use sparse activation to run efficiently. Only a subset of parameters are active per token, drastically reducing compute while maintaining capability.
⚙️ How LLMs Work
1️⃣ Training Phase
LLMs learn patterns from trillions of words of text from the internet, books, and other sources.
2️⃣ Tokenization
Text is broken into tokens (chunks of characters) that the model can process.
3️⃣ Attention Mechanism
The model focuses on relevant parts of the input to generate contextually appropriate responses.
4️⃣ Generation
The model predicts the next most likely token, one at a time, to build complete responses.
🔍 Compare Popular LLMs
GPT-5
by OpenAI
✅ Strengths
- • Top-tier reasoning
- • Agentic coding
- • Tool use
- • Long-context retrieval
⚠️ Weaknesses
- • Higher output cost
- • Proprietary
- • No self-hosting
Best for: Complex multi-step tasks, top coding performance, agent workflows
✨ What's New in GPT‑5
Model family
- • API variants: gpt-5, gpt-5-mini, gpt-5-nano
- • ChatGPT non-reasoning: gpt-5-chat-latest
- • Typical context: ~256K API (reasoning), ~196K in Chat
Controls & capabilities
- • verbosity: low, medium, high
- • reasoning_effort: minimal vs deeper thinking
- • Custom tools: plaintext tool calls with constraints
- • Better tool use, long-context retrieval, lower hallucinations
🚀 Latest Open-Source Models (2025)
Llama 4 (Meta - April 2025)
First Llama with Mixture-of-Experts architecture
Key innovation: MoE architecture activates only a fraction of parameters per token, dramatically improving efficiency while maintaining top-tier performance. Crushes GPT-4o and Gemini on many benchmarks.
DeepSeek-R1 (DeepSeek - January 2025)
Largest open reasoning model with efficient MoE
Breakthrough: 671 billion parameters with only 37B activated at a time using Mixture-of-Experts. Focuses on advanced reasoning tasks while maintaining efficiency comparable to much smaller models.
Qwen 3 (Alibaba - 2025)
Full family from tiny edge models to massive multilingual powerhouses. Excels at code, math, and supports 100+ languages.
Gemma 3 (Google - 2025)
Lightweight but powerful. The 27B variant performs like models 2x its size. Optimized for consumer GPUs and easy integration.
Mistral Small 3 (Mistral AI - January 2025)
State-of-the-art for its size. Competes with models 3-4x larger. Great for EU data residency and function calling.
Llama 3.1 (Meta - July 2024)
Industry standard. The 405B is the largest non-MoE open model. Proven reliability, extensive tooling, great community.
💡 2025 Trend: Open-source models are rapidly closing the gap with proprietary models. MoE architectures (Llama 4, DeepSeek-R1) enable massive scale with practical efficiency. Expect more specialized models optimized for specific tasks (code, math, multilingual) rather than general-purpose giants.
💰 Token Cost Calculator
💾 Prompt caching available: GPT-5 supports cached inputs at 0.125x lower cost for repeated content.
⚙️ GPT-5 Specific Controls: Verbosity and reasoning effort affect response quality and length but not per-token pricing.
Split tokens into input vs output for cost math.
Percentage of input tokens served from cache (10x cheaper)
Note: Verbosity and reasoning effort affect quality, latency, and token usage, but not per‑token rates. Adjust the input/output split to reflect your use case.
🏋️ Training Cost Calculator
💡 Why training costs millions: Training requires ~6ND FLOPs (N=parameters, D=tokens). Even with cutting-edge GPUs, large models take months and cost millions in compute.
Examples: 7 (Mistral), 70 (Llama), 405 (Llama 3.1)
Typical: 1000-2000 (1-2 trillion tokens)
📖 Key Concepts
Tokens
Basic units of text (≈0.75 words). LLMs process text as tokens, not words.
"ChatGPT is amazing!" = 5 tokens
Why it matters: Affects cost and context limits
Parameters
Weights and biases in the neural network that store learned knowledge. More parameters = more capacity to learn patterns.
GPT-4: ~1.8T, Llama 3.1: 70B or 405B, Gemma 2: 27B
Why it matters: Determines model capability, size, and inference cost
FLOPs (Floating Point Operations)
Computational work required for training. Measured in petaFLOPs or exaFLOPs. Training ≈ 6ND FLOPs (N=params, D=tokens).
70B model on 1T tokens = ~420×10²¹ FLOPs
Why it matters: Determines training time, cost, and hardware requirements
Context Window
Maximum tokens an LLM can process in one request or conversation.
Gemini 1.5: up to 1M tokens (very long docs)
Why it matters: Limits conversation length and document size
MoE (Mixture of Experts)
Architecture that activates only a subset of parameters per token, keeping total params high but active params low.
DeepSeek-R1: 671B total, only 37B active per token
Why it matters: Enables massive scale with practical efficiency and lower inference costs
Temperature
Controls randomness in responses (0 = deterministic, 2 = very random).
Low (0.2): Factual answers | High (1.5): Creative writing
Why it matters: Balance between consistency and creativity
Hallucination
When LLMs generate plausible-sounding but incorrect information.
Inventing fake citations or historical events
Why it matters: Critical risk in production systems
⚠️ Common Pitfalls
Trusting outputs blindly
LLMs can hallucinate facts. Always verify critical information.
Ignoring token costs
GPT-4 can cost $0.12 per page. Budget accordingly for production.
Expecting perfect consistency
Same prompt can give different outputs. Use temperature=0 for consistency.
Overloading context
Performance degrades near context limits. Keep conversations focused.
🎯 Key Takeaways
LLMs are pattern matchers: They predict likely text based on training, not true understanding
Choose models wisely: GPT-5 for top reasoning/coding, GPT-5 mini for cost-effective assistants, Claude 3.5 for long docs, Gemini 1.5 for ultra-long context
Tokens = Money: Optimize prompts and responses to control costs
Hallucinations are inevitable: Build validation and verification into your systems
Context windows matter: Plan for conversation length and document size limits