What is Riak?
Riak is a distributed key-value database designed for high availability, fault tolerance, and operational simplicity. Built on Amazon's Dynamo paper principles, it provides configurable consistency levels and automatic data distribution across nodes.
Key Features
- • Masterless architecture with no single point of failure
- • Tunable CAP theorem trade-offs (N, R, W values)
- • Automatic data distribution via consistent hashing
- • Built-in conflict resolution with vector clocks
- • Multi-datacenter replication
- • MapReduce for distributed computation
Riak Cluster Calculator
Cluster Metrics
CAP Theorem Configuration
Current Settings: N=3, R=2, W=2
High Availability
R=1, W=1 (Fast but eventual consistency)
- • Fastest operations
- • Tolerates more failures
- • Risk of stale reads
Strong Consistency
R + W > N (Guaranteed consistency)
- • Always consistent reads
- • Higher latency
- • Less fault tolerant
Balanced
R=2, W=2, N=3 (Common production)
- • Good consistency
- • Reasonable performance
- • Tolerates 1 node failure
Real-World Examples
EA Sports
EA uses Riak for game statistics and player data across multiple game franchises.
- • 50M+ player profiles globally distributed
- • 99.99% uptime requirement for live games
- • Multi-datacenter replication for low latency
Comcast
Comcast uses Riak for customer data and service provisioning across their network infrastructure.
- • 30M+ customer records
- • Geographic data distribution
- • Integration with legacy billing systems
NHS (UK Healthcare)
NHS uses Riak for patient data storage requiring high availability and data sovereignty.
- • 65M+ patient records
- • Strict data locality requirements
- • 24/7 availability for emergency services
Basic Operations
import riak
# Connect to Riak cluster
client = riak.RiakClient(host='127.0.0.1', pb_port=8087)
# Create a bucket with custom properties
bucket = client.bucket('users')
bucket.set_properties({
'n_val': 3, # 3 replicas
'r': 2, # Read from 2 nodes
'w': 2, # Write to 2 nodes
'pr': 1, # Primary read quorum
'pw': 1 # Primary write quorum
})
# Store an object
user_data = {
'name': 'John Doe',
'email': 'john@example.com',
'created': '2024-01-15T10:00:00Z'
}
user_obj = bucket.new('user123', data=user_data)
user_obj.store()
# Retrieve an object
retrieved_user = bucket.get('user123')
print(retrieved_user.data['name']) # 'John Doe'
# Update with conflict resolution
retrieved_user.data['last_login'] = '2024-01-16T14:30:00Z'
retrieved_user.store()
# Search using Secondary Indexes (2i)
bucket.new('user124', data={
'name': 'Jane Smith',
'email': 'jane@example.com',
'department': 'engineering'
}).add_index('department_bin', 'engineering').store()
# Query by index
engineering_users = bucket.get_index('department_bin', 'engineering')
for key in engineering_users:
user = bucket.get(key)
print(f"User: {user.data['name']}")
# MapReduce example
mr = client.add('users')
mr.map('function(v) { var data = JSON.parse(v.values[0].data); return [data.department]; }')
mr.reduce('function(values) { return values.sort(); }')
departments = mr.run()
print(departments)
Best Practices
✅ Do
- •Choose N, R, W values based on your consistency needs
- •Use meaningful bucket names and key naming conventions
- •Implement application-level conflict resolution
- •Use secondary indexes (2i) for simple queries
- •Monitor cluster health and ring status regularly
❌ Don't
- •Use Riak for complex analytical queries
- •Store large objects (>50MB) without chunking
- •Ignore sibling conflicts in your application
- •Set R=1, W=1 for critical consistent data
- •Use MapReduce for real-time operations