etcd: Distributed Key-Value Store

Master etcd for distributed coordination, service discovery, and consistent configuration management

26 min readIntermediate
Not Started
Loading...

What is etcd?

etcd is a distributed, reliable key-value store for the most critical data of a distributed system. It provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node.

Core Features

  • • Distributed key-value store
  • • Watch for real-time updates
  • • Distributed locking
  • • Leader election
  • • ACID transactions

Key Guarantees

  • • Strong consistency (Raft)
  • • High availability
  • • Durability
  • • Atomic operations
  • • Sequential consistency

Core Features

Distributed Key-Value Store

Strongly consistent, distributed key-value store with MVCC and transactional support

Use Case: Configuration management, service discovery, distributed coordination, feature flags

Implementation Examples

# Basic key-value operations
# Set a key-value pair
etcdctl put /config/database/host "db.example.com"
etcdctl put /config/database/port "5432"
etcdctl put /config/database/name "myapp"

# Get a single key
etcdctl get /config/database/host

# Get all keys with a prefix
etcdctl get /config/database/ --prefix

# Get keys with values and metadata
etcdctl get /config/database/ --prefix --print-value-only
etcdctl get /config/database/ --prefix -w json

# Watch for changes
etcdctl watch /config/database/ --prefix

# Delete keys
etcdctl del /config/database/host
etcdctl del /config/database/ --prefix

# Atomic operations with transactions
etcdctl txn <<< '
mod("/config/feature/enabled") = "0"

put /config/feature/enabled "true"
put /config/feature/rollout "10"

get /config/feature/enabled
'

Key Benefits

Strong consistency (linearizable)
ACID transactions
Multi-version concurrency control
Efficient range queries
Built-in watching capabilities

Cluster Metrics Dashboard

Live Cluster Status

Real-time etcd cluster monitoring

Cluster Nodes

3

🖥️

Keys Stored

8,420

🗝️

Active Watchers

245

👁️

Ops/sec

850

Cluster Health

Leader: Healthy
Consensus: Active
Replication: Synced

Implementation Patterns

Service Discovery Pattern

Dynamic service registry with health checking and load balancing

Implementation: Services register themselves and discover others through etcd

Complete Implementation

// Service registration and discovery
package main

import (
    "context"
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "os"
    "time"

    "go.etcd.io/etcd/clientv3"
)

type ServiceInfo struct {
    Name     string    `json:"name"`
    Address  string    `json:"address"`
    Port     int       `json:"port"`
    Health   string    `json:"health"`
    Metadata map[string]string `json:"metadata"`
    RegisteredAt time.Time `json:"registered_at"`
}

type ServiceRegistry struct {
    client    *clientv3.Client
    leaseID   clientv3.LeaseID
    keepAlive <-chan *clientv3.LeaseKeepAliveResponse
}

func NewServiceRegistry(endpoints []string) (*ServiceRegistry, error) {
    client, err := clientv3.New(clientv3.Config{
        Endpoints:   endpoints,
        DialTimeout: 5 * time.Second,
    })
    if err != nil {
        return nil, err
    }

    return &ServiceRegistry{client: client}, nil
}

func (sr *ServiceRegistry) Register(service ServiceInfo) error {
    // Create lease with 30-second TTL
    resp, err := sr.client.Grant(context.Background(), 30)
    if err != nil {
        return err
    }
    sr.leaseID = resp.ID

    // Keep the lease alive
    sr.keepAlive, err = sr.client.KeepAlive(context.Background(), sr.leaseID)
    if err != nil {
        return err
    }

    // Process keep alive responses
    go func() {
        for ka := range sr.keepAlive {
            if ka != nil {
                log.Printf("Lease renewed: %x", ka.ID)
            }
        }
    }()

    // Serialize service info
    serviceData, err := json.Marshal(service)
    if err != nil {
        return err
    }

    // Register service with lease
    key := fmt.Sprintf("/services/%s/%s:%d", service.Name, service.Address, service.Port)
    _, err = sr.client.Put(context.Background(), key, string(serviceData), 
        clientv3.WithLease(sr.leaseID))

    return err
}

func (sr *ServiceRegistry) Discover(serviceName string) ([]ServiceInfo, error) {
    resp, err := sr.client.Get(context.Background(), 
        fmt.Sprintf("/services/%s/", serviceName), 
        clientv3.WithPrefix())
    if err != nil {
        return nil, err
    }

    var services []ServiceInfo
    for _, kv := range resp.Kvs {
        var service ServiceInfo
        if err := json.Unmarshal(kv.Value, &service); err != nil {
            continue
        }
        services = append(services, service)
    }

    return services, nil
}

func (sr *ServiceRegistry) Watch(serviceName string, callback func([]ServiceInfo)) {
    watchChan := sr.client.Watch(context.Background(), 
        fmt.Sprintf("/services/%s/", serviceName), 
        clientv3.WithPrefix())

    for watchResp := range watchChan {
        services, err := sr.Discover(serviceName)
        if err != nil {
            log.Printf("Error discovering services: %v", err)
            continue
        }
        callback(services)
    }
}

// Usage example
func main() {
    registry, err := NewServiceRegistry([]string{"localhost:2379"})
    if err != nil {
        log.Fatal(err)
    }
    defer registry.client.Close()

    // Register this service
    service := ServiceInfo{
        Name:    "api-server",
        Address: "192.168.1.100",
        Port:    8080,
        Health:  "healthy",
        Metadata: map[string]string{
            "version": "1.2.3",
            "region":  "us-east-1",
        },
        RegisteredAt: time.Now(),
    }

    if err := registry.Register(service); err != nil {
        log.Fatal(err)
    }

    // Start health check endpoint
    http.HandleFunc("/health", healthCheckHandler)
    go http.ListenAndServe(":8080", nil)

    // Watch for other API servers
    go registry.Watch("api-server", func(services []ServiceInfo) {
        log.Printf("Available API servers: %d", len(services))
        for _, svc := range services {
            log.Printf("  - %s:%d (health: %s)", svc.Address, svc.Port, svc.Health)
        }
        updateLoadBalancer(services)
    })

    // Keep running
    select {}
}

func healthCheckHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("OK"))
}

func updateLoadBalancer(services []ServiceInfo) {
    // Update load balancer configuration
    // Filter only healthy services
    healthyServices := make([]ServiceInfo, 0)
    for _, svc := range services {
        if svc.Health == "healthy" {
            healthyServices = append(healthyServices, svc)
        }
    }
    // Update routing rules...
}

Key Considerations

  • Service health checking integration
  • Lease TTL configuration
  • Network partition handling
  • Service metadata management
  • Load balancer integration

etcd vs Alternatives

FeatureetcdConsulZooKeeperRedis
Consensus AlgorithmRaftRaftZab (Paxos-like)Master-Replica
HTTP/gRPC API✓ Both✓ HTTP✗ Custom○ RESP
Watch/Subscribe✓ Efficient✓ Built-in○ Limited✓ Pub/Sub
Multi-Version✓ MVCC✗ No✗ No✗ No
PerformanceHighMediumMediumVery High
Operational ComplexityLowMediumHighLow

Best Practices

Cluster Setup

  • Use odd number of nodes (3, 5, 7)
  • Separate network for peer communication
  • Use dedicated storage for data directory
  • Enable TLS for security

Performance

  • Use SSD storage for WAL and data
  • Tune heartbeat and election timeouts
  • Regular compaction and defragmentation
  • Monitor memory and disk usage

Application Design

  • Use appropriate key hierarchy
  • Implement proper lease management
  • Handle watch reconnections gracefully
  • Use transactions for atomic operations

Operations

  • Automated backup and restore
  • Monitor cluster health continuously
  • Plan for rolling upgrades
  • Set up proper alerting

📝 Test Your Knowledge

📝 etcd Quiz

1 of 5Current: 0/5

What consensus algorithm does etcd use to maintain consistency?