System Designer

What is etcd?

etcd is a distributed, reliable key-value store for the most critical data of a distributed system. It provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node.

Core Features

• Distributed key-value store
• Watch for real-time updates
• Distributed locking
• Leader election
• ACID transactions

Key Guarantees

• Strong consistency (Raft)
• High availability
• Durability
• Atomic operations
• Sequential consistency

Core Features

Distributed Key-Value Store

Strongly consistent, distributed key-value store with MVCC and transactional support

Use Case: Configuration management, service discovery, distributed coordination, feature flags

Implementation Examples

# Basic key-value operations
# Set a key-value pair
etcdctl put /config/database/host "db.example.com"
etcdctl put /config/database/port "5432"
etcdctl put /config/database/name "myapp"

# Get a single key
etcdctl get /config/database/host

# Get all keys with a prefix
etcdctl get /config/database/ --prefix

# Get keys with values and metadata
etcdctl get /config/database/ --prefix --print-value-only
etcdctl get /config/database/ --prefix -w json

# Watch for changes
etcdctl watch /config/database/ --prefix

# Delete keys
etcdctl del /config/database/host
etcdctl del /config/database/ --prefix

# Atomic operations with transactions
etcdctl txn <<< '
mod("/config/feature/enabled") = "0"

put /config/feature/enabled "true"
put /config/feature/rollout "10"

get /config/feature/enabled
'

Key Benefits

✓Strong consistency (linearizable)

✓ACID transactions

✓Multi-version concurrency control

✓Efficient range queries

✓Built-in watching capabilities

Cluster Metrics Dashboard

Live Cluster Status

Real-time etcd cluster monitoring

Cluster Nodes

🖥️

Keys Stored

8,420

🗝️

Active Watchers

245

👁️

Ops/sec

850

⚡

Cluster Health

Leader: Healthy

Consensus: Active

Replication: Synced

Implementation Patterns

Service Discovery Pattern

Dynamic service registry with health checking and load balancing

Implementation: Services register themselves and discover others through etcd

Complete Implementation

// Service registration and discovery
package main

import (
    "context"
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "os"
    "time"

    "go.etcd.io/etcd/clientv3"
)

type ServiceInfo struct {
    Name     string    `json:"name"`
    Address  string    `json:"address"`
    Port     int       `json:"port"`
    Health   string    `json:"health"`
    Metadata map[string]string `json:"metadata"`
    RegisteredAt time.Time `json:"registered_at"`
}

type ServiceRegistry struct {
    client    *clientv3.Client
    leaseID   clientv3.LeaseID
    keepAlive <-chan *clientv3.LeaseKeepAliveResponse
}

func NewServiceRegistry(endpoints []string) (*ServiceRegistry, error) {
    client, err := clientv3.New(clientv3.Config{
        Endpoints:   endpoints,
        DialTimeout: 5 * time.Second,
    })
    if err != nil {
        return nil, err
    }

    return &ServiceRegistry{client: client}, nil
}

func (sr *ServiceRegistry) Register(service ServiceInfo) error {
    // Create lease with 30-second TTL
    resp, err := sr.client.Grant(context.Background(), 30)
    if err != nil {
        return err
    }
    sr.leaseID = resp.ID

    // Keep the lease alive
    sr.keepAlive, err = sr.client.KeepAlive(context.Background(), sr.leaseID)
    if err != nil {
        return err
    }

    // Process keep alive responses
    go func() {
        for ka := range sr.keepAlive {
            if ka != nil {
                log.Printf("Lease renewed: %x", ka.ID)
            }
        }
    }()

    // Serialize service info
    serviceData, err := json.Marshal(service)
    if err != nil {
        return err
    }

    // Register service with lease
    key := fmt.Sprintf("/services/%s/%s:%d", service.Name, service.Address, service.Port)
    _, err = sr.client.Put(context.Background(), key, string(serviceData), 
        clientv3.WithLease(sr.leaseID))

    return err
}

func (sr *ServiceRegistry) Discover(serviceName string) ([]ServiceInfo, error) {
    resp, err := sr.client.Get(context.Background(), 
        fmt.Sprintf("/services/%s/", serviceName), 
        clientv3.WithPrefix())
    if err != nil {
        return nil, err
    }

    var services []ServiceInfo
    for _, kv := range resp.Kvs {
        var service ServiceInfo
        if err := json.Unmarshal(kv.Value, &service); err != nil {
            continue
        }
        services = append(services, service)
    }

    return services, nil
}

func (sr *ServiceRegistry) Watch(serviceName string, callback func([]ServiceInfo)) {
    watchChan := sr.client.Watch(context.Background(), 
        fmt.Sprintf("/services/%s/", serviceName), 
        clientv3.WithPrefix())

    for watchResp := range watchChan {
        services, err := sr.Discover(serviceName)
        if err != nil {
            log.Printf("Error discovering services: %v", err)
            continue
        }
        callback(services)
    }
}

// Usage example
func main() {
    registry, err := NewServiceRegistry([]string{"localhost:2379"})
    if err != nil {
        log.Fatal(err)
    }
    defer registry.client.Close()

    // Register this service
    service := ServiceInfo{
        Name:    "api-server",
        Address: "192.168.1.100",
        Port:    8080,
        Health:  "healthy",
        Metadata: map[string]string{
            "version": "1.2.3",
            "region":  "us-east-1",
        },
        RegisteredAt: time.Now(),
    }

    if err := registry.Register(service); err != nil {
        log.Fatal(err)
    }

    // Start health check endpoint
    http.HandleFunc("/health", healthCheckHandler)
    go http.ListenAndServe(":8080", nil)

    // Watch for other API servers
    go registry.Watch("api-server", func(services []ServiceInfo) {
        log.Printf("Available API servers: %d", len(services))
        for _, svc := range services {
            log.Printf("  - %s:%d (health: %s)", svc.Address, svc.Port, svc.Health)
        }
        updateLoadBalancer(services)
    })

    // Keep running
    select {}
}

func healthCheckHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("OK"))
}

func updateLoadBalancer(services []ServiceInfo) {
    // Update load balancer configuration
    // Filter only healthy services
    healthyServices := make([]ServiceInfo, 0)
    for _, svc := range services {
        if svc.Health == "healthy" {
            healthyServices = append(healthyServices, svc)
        }
    }
    // Update routing rules...
}

Key Considerations

→Service health checking integration
→Lease TTL configuration
→Network partition handling
→Service metadata management
→Load balancer integration

etcd vs Alternatives

Feature	etcd	Consul	ZooKeeper	Redis
Consensus Algorithm	Raft	Raft	Zab (Paxos-like)	Master-Replica
HTTP/gRPC API	✓ Both	✓ HTTP	✗ Custom	○ RESP
Watch/Subscribe	✓ Efficient	✓ Built-in	○ Limited	✓ Pub/Sub
Multi-Version	✓ MVCC	✗ No	✗ No	✗ No
Performance	High	Medium	Medium	Very High
Operational Complexity	Low	Medium	High	Low

Best Practices

Cluster Setup

•Use odd number of nodes (3, 5, 7)
•Separate network for peer communication
•Use dedicated storage for data directory
•Enable TLS for security

Performance

•Use SSD storage for WAL and data
•Tune heartbeat and election timeouts
•Regular compaction and defragmentation
•Monitor memory and disk usage

Application Design

•Use appropriate key hierarchy
•Implement proper lease management
•Handle watch reconnections gracefully
•Use transactions for atomic operations

Operations

•Automated backup and restore
•Monitor cluster health continuously
•Plan for rolling upgrades
•Set up proper alerting

📝 Test Your Knowledge

No quiz questions available

Quiz ID "etcd" not found

etcd Key-Value Store

What is etcd?

Core Features

Key Guarantees

Core Features

Distributed Key-Value Store

Implementation Examples

Key Benefits

Cluster Metrics Dashboard

Live Cluster Status

Cluster Health

Implementation Patterns

Service Discovery Pattern

Complete Implementation

Key Considerations

etcd vs Alternatives

Best Practices

Cluster Setup

Performance

Application Design

Operations

📝 Test Your Knowledge