Data Modeling Basics

25 min readIntermediate
Not Started
Loading...

Learn the fundamentals of structuring data effectively. Master entity relationships, normalization principles, and schema design patterns that scale from startup to enterprise.

What is Data Modeling?

Data modeling is the process of creating a visual representation of data and its relationships. It's the blueprint that guides how information is stored, accessed, and maintained in your system.

Why Data Modeling Matters

  • Performance: Well-designed schemas can be 10-100x faster
  • Scalability: Proper relationships handle growth gracefully
  • Maintainability: Clear structure reduces bugs and technical debt
  • Cost: Efficient storage and queries save infrastructure costs

Conceptual Model

High-level business view of data entities and relationships

Example: "Customers place orders for products"

Logical Model

Detailed structure with attributes, keys, and constraints

Example: Customer(id, email, name), Order(id, customer_id, date)

Physical Model

Database-specific implementation with indexes and storage

Example: PostgreSQL tables with specific data types and indexes

Entity Relationships

Understanding how entities relate to each other is crucial for designing efficient and logical database schemas.

1

One-to-One (1:1)

Each record in table A relates to exactly one record in table B

Example
User ↔ User Profile
Implementation
Foreign key in either table
Common Use Cases
User authentication details, Profile extensions
2

One-to-Many (1:N)

One record in table A can relate to multiple records in table B

Example
Customer → Orders
Implementation
Foreign key in the "many" table
Common Use Cases
Blog posts and comments, Orders and items
3

Many-to-Many (M:N)

Multiple records in table A can relate to multiple records in table B

Example
Students ↔ Courses
Implementation
Junction table with foreign keys
Common Use Cases
User permissions, Product tags

Database Normalization

Normalization eliminates data redundancy and ensures data integrity by organizing tables according to specific rules. Each level builds upon the previous one.

1

1NF (First Normal Form)

Each column contains atomic values, no repeating groups

Example
Split "John, Jane" into separate rows for each person
Benefit
Eliminates duplicate data in columns
Complexity
Simple
2

2NF (Second Normal Form)

Must be in 1NF + all non-key attributes depend on entire primary key

Example
Move customer details from order table to separate customer table
Benefit
Reduces redundancy, easier updates
Complexity
Medium
3

3NF (Third Normal Form)

Must be in 2NF + no transitive dependencies

Example
Move city/state data based on zip code to separate location table
Benefit
Eliminates indirect dependencies
Complexity
High

When NOT to Normalize

  • • Read-heavy applications where JOINs hurt performance
  • • Data warehousing scenarios (denormalization for analytics)
  • • Caching layers where redundancy improves speed
  • • When team complexity outweighs data redundancy costs

Indexing for Performance

Indexes are data structures that improve query performance by creating shortcuts to your data. Choose the right indexing strategy based on your query patterns.

Query Performance Comparison

Primary Index
Clustered index on primary key
Query Time
1x sloweroptimized
1x slowerprimary index
Best for: ID-based lookups
Secondary Index
Non-clustered index on frequently queried columns
Query Time
1x sloweroptimized
10x slowersecondary index
Best for: Email, username searches
Composite Index
Index on multiple columns together
Query Time
1x sloweroptimized
50x slowercomposite index
Best for: Multi-field filters
Full Table Scan
No index, reads entire table
Query Time
1x sloweroptimized
1000x slowerfull table scan
Best for: Avoid at all costs

When to Add Indexes

WHERE clauses
Columns used in filtering
High Priority
JOIN conditions
Foreign key relationships
High Priority
ORDER BY
Sorting operations
Medium Priority
GROUP BY
Aggregation queries
Medium Priority

Index Trade-offs

Pros
Faster queries, better user experience
Cons
Slower writes, more storage space

Real-World Example: E-commerce Schema

Let's apply data modeling principles to design a schema for an e-commerce platform. This example demonstrates relationships, normalization, and indexing decisions.

Core Entities

Users
id, email, password_hash, created_at
Index: email (login), created_at (analytics)
Products
id, name, description, price, category_id
Index: category_id (browsing), price (filtering)
Orders
id, user_id, total, status, created_at
Index: user_id (user history), status (admin)
Order_Items
order_id, product_id, quantity, price
Composite PK: (order_id, product_id)

Design Decisions

Normalization Choice
3NF for consistency, denormalize product cache for reads
Price Storage
Store price in order_items to handle price changes
Category Hierarchy
Separate categories table for flexible taxonomy
Audit Trail
created_at, updated_at on all tables for debugging

Data Modeling Best Practices

Schema Design

  • ☐ Start with business requirements, not technical constraints
  • ☐ Use meaningful, consistent naming conventions
  • ☐ Define clear primary and foreign key relationships
  • ☐ Add constraints to enforce data integrity
  • ☐ Plan for data evolution and migrations

Performance Optimization

  • ☐ Index frequently queried columns
  • ☐ Avoid over-indexing (impacts write performance)
  • ☐ Consider denormalization for read-heavy workloads
  • ☐ Use appropriate data types (save storage)
  • ☐ Monitor query patterns and optimize accordingly

🎯 Data Modeling Decisions in Production

Learn from real-world data modeling challenges and solutions from major tech companies

Scenarios

Slack Message Storage Evolution
How Slack evolved their message storage model to handle billions of messages
Shopify Product Catalog Schema
Shopify's flexible product model supporting millions of merchants
Spotify Music Recommendation Data
Spotify's user listening data model for personalization
Uber Trip Data Architecture
Uber's trip and location data model handling millions of rides
GitHub Repository Metadata
GitHub's repository and commit data modeling for code hosting
Stripe Payment Data Compliance
Stripe's payment data model balancing performance with compliance

Context

How Slack evolved their message storage model to handle billions of messages

Metrics

Daily Messages
10+ billion messages
Original Model
Single messages table
New Model
Time-partitioned shards
Query Performance
100x faster

Outcome

Redesigned from single table to time-based partitioning with denormalized data for common queries. Accepted some data redundancy for massive performance gains.

Key Lessons

  • Single table design fails at massive scale - time-based partitioning essential
  • Denormalization acceptable when read performance is critical
  • Message search requires separate indexing system (Elasticsearch)
  • Channel metadata and message content have different access patterns - separate storage

📝 Data Modeling Quiz

1 of 5Current: 0/5

In Third Normal Form (3NF), what additional constraint is added beyond Second Normal Form?