Data Modeling Basics

25 min read•Intermediate

Not Started

Learn the fundamentals of structuring data effectively. Master entity relationships, normalization principles, and schema design patterns that scale from startup to enterprise.

What is Data Modeling?

Data modeling is the process of creating a visual representation of data and its relationships. It's the blueprint that guides how information is stored, accessed, and maintained in your system.

Why Data Modeling Matters

• Performance: Well-designed schemas can be 10-100x faster
• Scalability: Proper relationships handle growth gracefully
• Maintainability: Clear structure reduces bugs and technical debt
• Cost: Efficient storage and queries save infrastructure costs

Conceptual Model

High-level business view of data entities and relationships

Example: "Customers place orders for products"

Logical Model

Detailed structure with attributes, keys, and constraints

Example: Customer(id, email, name), Order(id, customer_id, date)

Physical Model

Database-specific implementation with indexes and storage

Example: PostgreSQL tables with specific data types and indexes

Entity Relationships

Understanding how entities relate to each other is crucial for designing efficient and logical database schemas.

One-to-One (1:1)

Each record in table A relates to exactly one record in table B

Example

User ↔ User Profile

Implementation

Foreign key in either table

Common Use Cases

User authentication details, Profile extensions

One-to-Many (1:N)

One record in table A can relate to multiple records in table B

Example

Customer → Orders

Implementation

Foreign key in the "many" table

Common Use Cases

Blog posts and comments, Orders and items

Many-to-Many (M:N)

Multiple records in table A can relate to multiple records in table B

Example

Students ↔ Courses

Implementation

Junction table with foreign keys

Common Use Cases

User permissions, Product tags

Database Normalization

Normalization eliminates data redundancy and ensures data integrity by organizing tables according to specific rules. Each level builds upon the previous one.

1NF (First Normal Form)

Each column contains atomic values, no repeating groups

Example

Split "John, Jane" into separate rows for each person

Benefit

Eliminates duplicate data in columns

Complexity

Simple

2NF (Second Normal Form)

Must be in 1NF + all non-key attributes depend on entire primary key

Example

Move customer details from order table to separate customer table

Benefit

Reduces redundancy, easier updates

Complexity

Medium

3NF (Third Normal Form)

Must be in 2NF + no transitive dependencies

Example

Move city/state data based on zip code to separate location table

Benefit

Eliminates indirect dependencies

Complexity

High

When NOT to Normalize

• Read-heavy applications where JOINs hurt performance
• Data warehousing scenarios (denormalization for analytics)
• Caching layers where redundancy improves speed
• When team complexity outweighs data redundancy costs

Indexing for Performance

Indexes are data structures that improve query performance by creating shortcuts to your data. Choose the right indexing strategy based on your query patterns.

Query Performance Comparison

Primary Index

Clustered index on primary key

Query Time

1x sloweroptimized

1x slowerprimary index

Best for: ID-based lookups

Secondary Index

Non-clustered index on frequently queried columns

Query Time

1x sloweroptimized

10x slowersecondary index

Best for: Email, username searches

Composite Index

Index on multiple columns together

Query Time

1x sloweroptimized

50x slowercomposite index

Best for: Multi-field filters

Full Table Scan

No index, reads entire table

Query Time

1x sloweroptimized

1000x slowerfull table scan

Best for: Avoid at all costs

When to Add Indexes

WHERE clauses

Columns used in filtering

High Priority

JOIN conditions

Foreign key relationships

High Priority

ORDER BY

Sorting operations

Medium Priority

GROUP BY

Aggregation queries

Medium Priority

Index Trade-offs

Pros

Faster queries, better user experience

Cons

Slower writes, more storage space

Real-World Example: E-commerce Schema

Let's apply data modeling principles to design a schema for an e-commerce platform. This example demonstrates relationships, normalization, and indexing decisions.

Core Entities

Users

id, email, password_hash, created_at

Index: email (login), created_at (analytics)

Products

id, name, description, price, category_id

Index: category_id (browsing), price (filtering)

Orders

id, user_id, total, status, created_at

Index: user_id (user history), status (admin)

Order_Items

order_id, product_id, quantity, price

Composite PK: (order_id, product_id)

Design Decisions

Normalization Choice

3NF for consistency, denormalize product cache for reads

Price Storage

Store price in order_items to handle price changes

Category Hierarchy

Separate categories table for flexible taxonomy

Audit Trail

created_at, updated_at on all tables for debugging

Data Modeling Best Practices

Schema Design

☐ Start with business requirements, not technical constraints
☐ Use meaningful, consistent naming conventions
☐ Define clear primary and foreign key relationships
☐ Add constraints to enforce data integrity
☐ Plan for data evolution and migrations

Performance Optimization

☐ Index frequently queried columns
☐ Avoid over-indexing (impacts write performance)
☐ Consider denormalization for read-heavy workloads
☐ Use appropriate data types (save storage)
☐ Monitor query patterns and optimize accordingly

🎯 Data Modeling Decisions in Production

Learn from real-world data modeling challenges and solutions from major tech companies

Metrics

Daily Messages

10+ billion messages

Original Model

Single messages table

New Model

Time-partitioned shards

Query Performance

100x faster

Outcome

Redesigned from single table to time-based partitioning with denormalized data for common queries. Accepted some data redundancy for massive performance gains.

Lessons Learned

Single table design fails at massive scale - time-based partitioning essential
Denormalization acceptable when read performance is critical
Message search requires separate indexing system (Elasticsearch)
Channel metadata and message content have different access patterns - separate storage

ScenariosClick to explore

Slack Message Storage Evolution

How Slack evolved their message storage model to handle billions of messages

Shopify Product Catalog Schema

Shopify's flexible product model supporting millions of merchants

Spotify Music Recommendation Data

Spotify's user listening data model for personalization

Uber Trip Data Architecture

Uber's trip and location data model handling millions of rides

GitHub Repository Metadata

GitHub's repository and commit data modeling for code hosting

Stripe Payment Data Compliance

Stripe's payment data model balancing performance with compliance

Context

How Slack evolved their message storage model to handle billions of messages

Metrics

Daily Messages

10+ billion messages

Original Model

Single messages table

New Model

Time-partitioned shards

Query Performance

100x faster

Outcome

Redesigned from single table to time-based partitioning with denormalized data for common queries. Accepted some data redundancy for massive performance gains.

Key Lessons

•Single table design fails at massive scale - time-based partitioning essential
•Denormalization acceptable when read performance is critical
•Message search requires separate indexing system (Elasticsearch)
•Channel metadata and message content have different access patterns - separate storage

1. Slack Message Storage Evolution

Context

How Slack evolved their message storage model to handle billions of messages

Metrics

Daily Messages

10+ billion messages

Original Model

Single messages table

New Model

Time-partitioned shards

Query Performance

100x faster

Outcome

Redesigned from single table to time-based partitioning with denormalized data for common queries. Accepted some data redundancy for massive performance gains.

Key Lessons

•Single table design fails at massive scale - time-based partitioning essential
•Denormalization acceptable when read performance is critical
•Message search requires separate indexing system (Elasticsearch)
•Channel metadata and message content have different access patterns - separate storage

2. Shopify Product Catalog Schema

Context

Shopify's flexible product model supporting millions of merchants

Metrics

Product Variants

1M+ per product

Custom Attributes

Unlimited per product

Schema Flexibility

JSON + relational hybrid

Query Performance

Sub-100ms catalog loads

Outcome

Hybrid model: structured data in relational tables, flexible attributes in JSON columns. Best of both worlds for e-commerce flexibility.

Key Lessons

•E-commerce requires extreme schema flexibility - different stores need different attributes
•JSON columns in PostgreSQL provide flexibility without sacrificing ACID compliance
•Core product relationships (pricing, inventory) remain in structured tables
•Separate indexes on JSON attributes enable fast filtering on custom fields

3. Spotify Music Recommendation Data

Context

Spotify's user listening data model for personalization

Metrics

Play Events

4+ billion daily

User Profiles

400M+ active users

Real-time Processing

Kafka + stream processing

Historical Analysis

Batch ETL to data warehouse

Outcome

Dual-path architecture: real-time events for immediate recommendations, batch processing for deep analysis. Different data models optimized for each use case.

Key Lessons

•Listening events and user preferences have different modeling needs
•Real-time recommendations need fast key-value access patterns
•Historical analysis benefits from denormalized, aggregated tables
•Lambda architecture: batch + stream processing for comprehensive coverage

4. Uber Trip Data Architecture

Context

Uber's trip and location data model handling millions of rides

Metrics

Daily Trips

15M+ trips

Location Updates

1M+ per second

Trip History Storage

Multi-year retention

Real-time Matching

<2 second response

Outcome

Separated hot data (active trips) from cold data (historical). Hot data in Redis/memory, cold data in time-partitioned tables with aggressive compression.

Key Lessons

•Trip lifecycle data has very different access patterns - model accordingly
•Active trip data needs microsecond access - in-memory storage essential
•Historical trip data compressed and partitioned by date for analytics
•Geographic data requires specialized indexing (geospatial indexes)

5. GitHub Repository Metadata

Context

GitHub's repository and commit data modeling for code hosting

Metrics

Repositories

200M+ repositories

Commits

1B+ commits daily

File Changes

Git object storage

Metadata Queries

PostgreSQL clusters

Outcome

Leveraged Git's existing object model for file storage, PostgreSQL for queryable metadata. Avoided reinventing proven data structures.

Key Lessons

•Don't reinvent proven data models - Git already solves code versioning
•Separate file content storage from queryable metadata
•Repository relationships (forks, stars, follows) fit relational model well
•Search across code requires specialized full-text indexing (Elasticsearch)

6. Stripe Payment Data Compliance

Context

Stripe's payment data model balancing performance with compliance

Metrics

Daily Transactions

100M+ payments

Data Retention

7+ years required

PCI Compliance

Level 1 certified

Query Performance

Real-time fraud detection

Outcome

Encrypted sensitive data at rest, tokenized card data, audit logs for every operation. Compliance requirements drove data model design.

Key Lessons

•Financial data modeling must prioritize compliance over convenience
•Sensitive data (PII, card numbers) encrypted with separate key management
•Audit trail requirements affect every table design decision
•Real-time fraud detection needs denormalized feature tables for ML

No quiz questions available

Quiz ID "data-modeling" not found