Skip to main contentSkip to user menuSkip to navigation

DuckDB

Master DuckDB: in-process analytical database, OLAP queries, and high-performance analytics.

25 min readIntermediate
Not Started
Loading...

What is DuckDB?

DuckDB is an in-process analytical database (OLAP) that's designed to support analytical query workloads. Often called the "SQLite for analytics," DuckDB combines the simplicity of embedded databases with the performance of columnar storage and vectorized execution, making it perfect for data science and analytics workflows.

In-Process

No server required - runs embedded in your application

Columnar

Optimized column storage for fast analytical queries

Zero-Config

Works out of the box with minimal setup

DuckDB Performance Calculator

1 GB
5:1 compression
1 users
0.38 GB
Working Memory
1000
Queries/sec
1.5 GB/s
Scan Speed
205 MB
Storage Used
0.002s
Query Time
100%
Memory Efficiency

Getting Started with DuckDB

DuckDB can be used from multiple programming languages and provides both persistent and in-memory database options.

Installation
# Install DuckDB CLI
wget https://github.com/duckdb/duckdb/releases/download/v0.9.0/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
chmod +x duckdb

# Install Python client
pip install duckdb

# Install Node.js client  
npm install duckdb

# Install R client
install.packages("duckdb")

Python Usage

import duckdb
import pandas as pd

# Create in-memory database
con = duckdb.connect()

# Or create persistent database
con = duckdb.connect('analytics.duckdb')

# Query CSV files directly
result = con.execute("""
    SELECT region, SUM(sales) as total_sales
    FROM 'sales_data.csv'
    GROUP BY region
    ORDER BY total_sales DESC
""").fetchall()

# Work with pandas DataFrames
df = pd.read_csv('data.csv')
con.execute("CREATE TABLE sales AS SELECT * FROM df")

# Query the table
sales_summary = con.sql("SELECT * FROM sales WHERE amount > 1000")
print(sales_summary.to_df())

CLI Usage

-- Start DuckDB CLI
./duckdb analytics.duckdb

-- Create table and load data
CREATE TABLE sales AS 
SELECT * FROM 'sales_data.csv';

-- Analytical queries
SELECT 
    DATE_TRUNC('month', order_date) as month,
    SUM(amount) as monthly_sales,
    COUNT(*) as order_count
FROM sales
GROUP BY month
ORDER BY month;

-- Query Parquet files directly
SELECT * FROM 'large_dataset.parquet'
WHERE created_at >= '2023-01-01'
LIMIT 100;

Real-World Examples

Data Science Workflows

Data scientists use DuckDB for fast exploratory data analysis and prototyping.

  • 10x faster than traditional pandas operations on large datasets
  • Direct querying of CSV/Parquet files without ETL
  • Seamless integration with Jupyter notebooks and Python ecosystem
  • Memory efficient processing of datasets larger than RAM

Business Intelligence & Reporting

Small to medium businesses leverage DuckDB for self-service analytics and reporting.

  • Sub-second response times on million-row datasets
  • Zero-maintenance embedded deployment
  • Direct connection from BI tools like Tableau, Power BI
  • SQL compatibility with existing reporting workflows

Edge Analytics & IoT

DuckDB enables real-time analytics at the edge with minimal resource requirements.

  • 50MB memory footprint for basic operations
  • Battery-efficient processing on mobile devices
  • Offline-capable analytics without network dependency
  • Real-time aggregation of sensor data streams

Advanced Features

Window Functions & Analytics

-- Advanced analytical queries
SELECT 
    product_id,
    sales,
    ROW_NUMBER() OVER (ORDER BY sales DESC) as rank,
    LAG(sales) OVER (ORDER BY date) as prev_sales,
    SUM(sales) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as rolling_7day_sum
FROM daily_sales
WHERE date >= '2023-01-01';

-- Percentiles and statistical functions
SELECT 
    region,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue,
    STDDEV(revenue) as revenue_stddev
FROM sales_data
GROUP BY region;

File Format Support

-- Query various file formats directly
-- CSV files
SELECT * FROM 'data/*.csv';

-- Parquet files with partitioning
SELECT * FROM 'data/year=2023/month=*/sales.parquet'
WHERE region = 'US';

-- JSON files
SELECT json_extract(data, '$.user_id') as user_id
FROM 'events.json';

-- Excel files
SELECT * FROM 'financial_report.xlsx';

-- Remote files
SELECT * FROM 'https://example.com/data.csv';
SELECT * FROM 's3://mybucket/data.parquet';

Best Practices

✅ Do

  • Use columnar file formats (Parquet) for better performance
  • Partition large datasets by commonly filtered columns
  • Use appropriate data types to minimize storage
  • Query only needed columns to reduce I/O
  • Use persistent connections for multiple queries

❌ Don't

  • Use DuckDB for high-concurrency OLTP workloads
  • Store frequently updated data in DuckDB
  • Query uncompressed CSV files for large datasets
  • Use SELECT * on wide tables unnecessarily
  • Expect full database server features (users, permissions)
No quiz questions available
Quiz ID "duckdb" not found