Data Modeling Basics
Learn the fundamentals of structuring data effectively. Master entity relationships, normalization principles, and schema design patterns that scale from startup to enterprise.
What is Data Modeling?
Data modeling is the process of creating a visual representation of data and its relationships. It's the blueprint that guides how information is stored, accessed, and maintained in your system.
Why Data Modeling Matters
- • Performance: Well-designed schemas can be 10-100x faster
- • Scalability: Proper relationships handle growth gracefully
- • Maintainability: Clear structure reduces bugs and technical debt
- • Cost: Efficient storage and queries save infrastructure costs
Conceptual Model
High-level business view of data entities and relationships
Logical Model
Detailed structure with attributes, keys, and constraints
Physical Model
Database-specific implementation with indexes and storage
Entity Relationships
Understanding how entities relate to each other is crucial for designing efficient and logical database schemas.
One-to-One (1:1)
Each record in table A relates to exactly one record in table B
One-to-Many (1:N)
One record in table A can relate to multiple records in table B
Many-to-Many (M:N)
Multiple records in table A can relate to multiple records in table B
Database Normalization
Normalization eliminates data redundancy and ensures data integrity by organizing tables according to specific rules. Each level builds upon the previous one.
1NF (First Normal Form)
Each column contains atomic values, no repeating groups
2NF (Second Normal Form)
Must be in 1NF + all non-key attributes depend on entire primary key
3NF (Third Normal Form)
Must be in 2NF + no transitive dependencies
When NOT to Normalize
- • Read-heavy applications where JOINs hurt performance
- • Data warehousing scenarios (denormalization for analytics)
- • Caching layers where redundancy improves speed
- • When team complexity outweighs data redundancy costs
Indexing for Performance
Indexes are data structures that improve query performance by creating shortcuts to your data. Choose the right indexing strategy based on your query patterns.
Query Performance Comparison
When to Add Indexes
Index Trade-offs
Real-World Example: E-commerce Schema
Let's apply data modeling principles to design a schema for an e-commerce platform. This example demonstrates relationships, normalization, and indexing decisions.
Core Entities
Design Decisions
Data Modeling Best Practices
Schema Design
- ☐ Start with business requirements, not technical constraints
- ☐ Use meaningful, consistent naming conventions
- ☐ Define clear primary and foreign key relationships
- ☐ Add constraints to enforce data integrity
- ☐ Plan for data evolution and migrations
Performance Optimization
- ☐ Index frequently queried columns
- ☐ Avoid over-indexing (impacts write performance)
- ☐ Consider denormalization for read-heavy workloads
- ☐ Use appropriate data types (save storage)
- ☐ Monitor query patterns and optimize accordingly
🎯 Data Modeling Decisions in Production
Learn from real-world data modeling challenges and solutions from major tech companies
Metrics
Outcome
Redesigned from single table to time-based partitioning with denormalized data for common queries. Accepted some data redundancy for massive performance gains.
Lessons Learned
- Single table design fails at massive scale - time-based partitioning essential
- Denormalization acceptable when read performance is critical
- Message search requires separate indexing system (Elasticsearch)
- Channel metadata and message content have different access patterns - separate storage
ScenariosClick to explore
Context
How Slack evolved their message storage model to handle billions of messages
Metrics
Outcome
Redesigned from single table to time-based partitioning with denormalized data for common queries. Accepted some data redundancy for massive performance gains.
Key Lessons
- •Single table design fails at massive scale - time-based partitioning essential
- •Denormalization acceptable when read performance is critical
- •Message search requires separate indexing system (Elasticsearch)
- •Channel metadata and message content have different access patterns - separate storage
1. Slack Message Storage Evolution
Context
How Slack evolved their message storage model to handle billions of messages
Metrics
Outcome
Redesigned from single table to time-based partitioning with denormalized data for common queries. Accepted some data redundancy for massive performance gains.
Key Lessons
- •Single table design fails at massive scale - time-based partitioning essential
- •Denormalization acceptable when read performance is critical
- •Message search requires separate indexing system (Elasticsearch)
- •Channel metadata and message content have different access patterns - separate storage
2. Shopify Product Catalog Schema
Context
Shopify's flexible product model supporting millions of merchants
Metrics
Outcome
Hybrid model: structured data in relational tables, flexible attributes in JSON columns. Best of both worlds for e-commerce flexibility.
Key Lessons
- •E-commerce requires extreme schema flexibility - different stores need different attributes
- •JSON columns in PostgreSQL provide flexibility without sacrificing ACID compliance
- •Core product relationships (pricing, inventory) remain in structured tables
- •Separate indexes on JSON attributes enable fast filtering on custom fields
3. Spotify Music Recommendation Data
Context
Spotify's user listening data model for personalization
Metrics
Outcome
Dual-path architecture: real-time events for immediate recommendations, batch processing for deep analysis. Different data models optimized for each use case.
Key Lessons
- •Listening events and user preferences have different modeling needs
- •Real-time recommendations need fast key-value access patterns
- •Historical analysis benefits from denormalized, aggregated tables
- •Lambda architecture: batch + stream processing for comprehensive coverage
4. Uber Trip Data Architecture
Context
Uber's trip and location data model handling millions of rides
Metrics
Outcome
Separated hot data (active trips) from cold data (historical). Hot data in Redis/memory, cold data in time-partitioned tables with aggressive compression.
Key Lessons
- •Trip lifecycle data has very different access patterns - model accordingly
- •Active trip data needs microsecond access - in-memory storage essential
- •Historical trip data compressed and partitioned by date for analytics
- •Geographic data requires specialized indexing (geospatial indexes)
5. GitHub Repository Metadata
Context
GitHub's repository and commit data modeling for code hosting
Metrics
Outcome
Leveraged Git's existing object model for file storage, PostgreSQL for queryable metadata. Avoided reinventing proven data structures.
Key Lessons
- •Don't reinvent proven data models - Git already solves code versioning
- •Separate file content storage from queryable metadata
- •Repository relationships (forks, stars, follows) fit relational model well
- •Search across code requires specialized full-text indexing (Elasticsearch)
6. Stripe Payment Data Compliance
Context
Stripe's payment data model balancing performance with compliance
Metrics
Outcome
Encrypted sensitive data at rest, tokenized card data, audit logs for every operation. Compliance requirements drove data model design.
Key Lessons
- •Financial data modeling must prioritize compliance over convenience
- •Sensitive data (PII, card numbers) encrypted with separate key management
- •Audit trail requirements affect every table design decision
- •Real-time fraud detection needs denormalized feature tables for ML