What is DuckDB?
DuckDB is an in-process analytical database (OLAP) that's designed to support analytical query workloads. Often called the "SQLite for analytics," DuckDB combines the simplicity of embedded databases with the performance of columnar storage and vectorized execution, making it perfect for data science and analytics workflows.
In-Process
No server required - runs embedded in your application
Columnar
Optimized column storage for fast analytical queries
Zero-Config
Works out of the box with minimal setup
DuckDB Performance Calculator
Getting Started with DuckDB
DuckDB can be used from multiple programming languages and provides both persistent and in-memory database options.
# Install DuckDB CLI
wget https://github.com/duckdb/duckdb/releases/download/v0.9.0/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
chmod +x duckdb
# Install Python client
pip install duckdb
# Install Node.js client
npm install duckdb
# Install R client
install.packages("duckdb")
Python Usage
import duckdb
import pandas as pd
# Create in-memory database
con = duckdb.connect()
# Or create persistent database
con = duckdb.connect('analytics.duckdb')
# Query CSV files directly
result = con.execute("""
SELECT region, SUM(sales) as total_sales
FROM 'sales_data.csv'
GROUP BY region
ORDER BY total_sales DESC
""").fetchall()
# Work with pandas DataFrames
df = pd.read_csv('data.csv')
con.execute("CREATE TABLE sales AS SELECT * FROM df")
# Query the table
sales_summary = con.sql("SELECT * FROM sales WHERE amount > 1000")
print(sales_summary.to_df())
CLI Usage
-- Start DuckDB CLI
./duckdb analytics.duckdb
-- Create table and load data
CREATE TABLE sales AS
SELECT * FROM 'sales_data.csv';
-- Analytical queries
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as monthly_sales,
COUNT(*) as order_count
FROM sales
GROUP BY month
ORDER BY month;
-- Query Parquet files directly
SELECT * FROM 'large_dataset.parquet'
WHERE created_at >= '2023-01-01'
LIMIT 100;
Real-World Examples
Data Science Workflows
Data scientists use DuckDB for fast exploratory data analysis and prototyping.
- • 10x faster than traditional pandas operations on large datasets
- • Direct querying of CSV/Parquet files without ETL
- • Seamless integration with Jupyter notebooks and Python ecosystem
- • Memory efficient processing of datasets larger than RAM
Business Intelligence & Reporting
Small to medium businesses leverage DuckDB for self-service analytics and reporting.
- • Sub-second response times on million-row datasets
- • Zero-maintenance embedded deployment
- • Direct connection from BI tools like Tableau, Power BI
- • SQL compatibility with existing reporting workflows
Edge Analytics & IoT
DuckDB enables real-time analytics at the edge with minimal resource requirements.
- • 50MB memory footprint for basic operations
- • Battery-efficient processing on mobile devices
- • Offline-capable analytics without network dependency
- • Real-time aggregation of sensor data streams
Advanced Features
Window Functions & Analytics
-- Advanced analytical queries
SELECT
product_id,
sales,
ROW_NUMBER() OVER (ORDER BY sales DESC) as rank,
LAG(sales) OVER (ORDER BY date) as prev_sales,
SUM(sales) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as rolling_7day_sum
FROM daily_sales
WHERE date >= '2023-01-01';
-- Percentiles and statistical functions
SELECT
region,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue,
STDDEV(revenue) as revenue_stddev
FROM sales_data
GROUP BY region;
File Format Support
-- Query various file formats directly
-- CSV files
SELECT * FROM 'data/*.csv';
-- Parquet files with partitioning
SELECT * FROM 'data/year=2023/month=*/sales.parquet'
WHERE region = 'US';
-- JSON files
SELECT json_extract(data, '$.user_id') as user_id
FROM 'events.json';
-- Excel files
SELECT * FROM 'financial_report.xlsx';
-- Remote files
SELECT * FROM 'https://example.com/data.csv';
SELECT * FROM 's3://mybucket/data.parquet';
Best Practices
✅ Do
- •Use columnar file formats (Parquet) for better performance
- •Partition large datasets by commonly filtered columns
- •Use appropriate data types to minimize storage
- •Query only needed columns to reduce I/O
- •Use persistent connections for multiple queries
❌ Don't
- •Use DuckDB for high-concurrency OLTP workloads
- •Store frequently updated data in DuckDB
- •Query uncompressed CSV files for large datasets
- •Use SELECT * on wide tables unnecessarily
- •Expect full database server features (users, permissions)