System Designer

What is DVC Data Versioning?

DVC (Data Version Control) is an open-source tool that brings Git-like version control to machine learning projects, specifically designed for managing large datasets, ML models, and complex pipelines that Git cannot handle efficiently.

Version Control

For data & models

Pipeline Management

Reproducible workflows

Experiment Tracking

Track metrics & artifacts

📊 DVC Storage Calculator

Calculate storage requirements, costs, and performance metrics for your DVC setup.

Dataset Size (GB): 10

Number of Versions: 5

Compression Ratio: 30%

Storage Type

Storage Analysis

Raw Storage:50.0 GB

Compressed Storage:15.0 GB

Storage Savings:35.0 GB

Monthly Cost:$0.34

Sync Time:2.0 min

Core DVC Concepts

Data Versioning

• Track large files and datasets with Git
• Efficient storage using deduplication
• Branch and merge data like source code
• Remote storage integration (S3, GCS, etc.)

Pipeline Management

• Define reproducible ML workflows
• Automatic dependency tracking
• Incremental pipeline execution
• Stage-by-stage caching

Experiment Tracking

• Compare experiment results
• Track hyperparameters and metrics
• Visualize experiment history
• Share experiments with teams

Model Registry

• Version control for trained models
• Model metadata and lineage
• Deployment artifact management
• Model performance tracking

DVC Setup and Basic Usage

Installation and Initialization

Install and Setup DVC

# Install DVC
pip install dvc[s3]  # or [gcs], [azure], [all]

# Initialize DVC in your Git repository
cd my-ml-project
git init
dvc init

# Add remote storage
dvc remote add -d myremote s3://my-bucket/dvc-storage
dvc remote modify myremote region us-west-2

# Configure credentials (AWS example)
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"

Data Versioning Workflow

Track and Version Data

# Add data to DVC tracking
dvc add data/train.csv
dvc add models/my_model.pkl

# Commit the .dvc files to Git
git add data/train.csv.dvc models/my_model.pkl.dvc .dvc/config
git commit -m "Add training data and model"

# Push data to remote storage
dvc push

# Pull data from remote storage
dvc pull

# Check data status
dvc status
dvc diff

DVC Pipelines

Pipeline Definition (dvc.yaml)

dvc.yaml - Pipeline Configuration

stages:
  prepare:
    cmd: python src/prepare.py
    deps:
    - src/prepare.py
    - data/raw
    params:
    - prepare.seed
    - prepare.split
    outs:
    - data/prepared

  train:
    cmd: python src/train.py
    deps:
    - src/train.py
    - data/prepared
    params:
    - train.lr
    - train.epochs
    - train.batch_size
    outs:
    - models/model.pkl
    metrics:
    - metrics/train.json

  evaluate:
    cmd: python src/evaluate.py
    deps:
    - src/evaluate.py
    - models/model.pkl
    - data/prepared
    metrics:
    - metrics/eval.json
    plots:
    - plots/confusion_matrix.json
    - plots/roc_curve.json

Parameters Configuration

params.yaml - Hyperparameters

prepare:
  seed: 42
  split: 0.2

train:
  lr: 0.001
  epochs: 10
  batch_size: 32
  model_type: "random_forest"
  n_estimators: 100
  max_depth: 10

evaluate:
  threshold: 0.5
  metrics:
    - accuracy
    - precision
    - recall
    - f1_score

Production DVC Implementation

ML Training Script with DVC

src/train.py - DVC-integrated Training

import os
import json
import yaml
import pickle
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib

def load_params(params_path="params.yaml"):
    """Load parameters from params.yaml"""
    with open(params_path, 'r') as file:
        params = yaml.safe_load(file)
    return params

def train_model(data_path, params, output_dir="models", metrics_dir="metrics"):
    """Train model with DVC parameter management"""
    
    # Create output directories
    os.makedirs(output_dir, exist_ok=True)
    os.makedirs(metrics_dir, exist_ok=True)
    
    # Load prepared data
    train_df = pd.read_csv(f"{data_path}/train.csv")
    
    # Extract features and target
    X = train_df.drop(['target'], axis=1)
    y = train_df['target']
    
    # Split data
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=params['prepare']['seed']
    )
    
    # Initialize model with parameters
    model = RandomForestClassifier(
        n_estimators=params['train']['n_estimators'],
        max_depth=params['train']['max_depth'],
        random_state=params['prepare']['seed']
    )
    
    # Train model
    print("Training model...")
    model.fit(X_train, y_train)
    
    # Evaluate on validation set
    val_predictions = model.predict(X_val)
    
    # Calculate metrics
    metrics = {
        'accuracy': float(accuracy_score(y_val, val_predictions)),
        'precision': float(precision_score(y_val, val_predictions, average='weighted')),
        'recall': float(recall_score(y_val, val_predictions, average='weighted')),
        'f1_score': float(f1_score(y_val, val_predictions, average='weighted')),
        'n_estimators': params['train']['n_estimators'],
        'max_depth': params['train']['max_depth']
    }
    
    # Save model
    model_path = f"{output_dir}/model.pkl"
    joblib.dump(model, model_path)
    
    # Save metrics for DVC
    metrics_path = f"{metrics_dir}/train.json"
    with open(metrics_path, 'w') as f:
        json.dump(metrics, f, indent=2)
    
    print(f"Model saved to {model_path}")
    print(f"Metrics saved to {metrics_path}")
    print(f"Training completed - Accuracy: {metrics['accuracy']:.4f}")
    
    return model, metrics

if __name__ == "__main__":
    # Load parameters
    params = load_params()
    
    # Train model
    model, metrics = train_model(
        data_path="data/prepared",
        params=params
    )
    
    print("Training pipeline completed successfully!")

DVC API Integration

src/dvc_utils.py - DVC Python API

import dvc.api
import dvc.repo
import pandas as pd
from typing import Dict, Any, List
import os

class DVCManager:
    """DVC operations manager for ML pipelines"""
    
    def __init__(self, repo_path: str = "."):
        self.repo_path = repo_path
        self.repo = dvc.repo.Repo(repo_path)
    
    def get_data(self, path: str, rev: str = None) -> pd.DataFrame:
        """Get data from DVC tracking"""
        try:
            with dvc.api.open(path, repo=self.repo_path, rev=rev) as f:
                return pd.read_csv(f)
        except Exception as e:
            print(f"Error loading data from {path}: {e}")
            raise
    
    def get_params(self, params_file: str = "params.yaml", rev: str = None) -> Dict[str, Any]:
        """Get parameters from DVC params file"""
        return dvc.api.params_show(repo=self.repo_path, rev=rev)
    
    def get_metrics(self, rev: str = None) -> Dict[str, Any]:
        """Get metrics from DVC metrics files"""
        return dvc.api.metrics_show(repo=self.repo_path, rev=rev)
    
    def compare_experiments(self, revisions: List[str]) -> pd.DataFrame:
        """Compare experiments across different revisions"""
        comparison_data = []
        
        for rev in revisions:
            try:
                metrics = self.get_metrics(rev=rev)
                params = self.get_params(rev=rev)
                
                row = {'revision': rev}
                
                # Flatten metrics
                for file_path, file_metrics in metrics.items():
                    for metric_name, metric_value in file_metrics.items():
                        row[f"metric_{metric_name}"] = metric_value
                
                # Flatten params
                for category, category_params in params.items():
                    if isinstance(category_params, dict):
                        for param_name, param_value in category_params.items():
                            row[f"param_{category}_{param_name}"] = param_value
                    else:
                        row[f"param_{category}"] = category_params
                
                comparison_data.append(row)
                
            except Exception as e:
                print(f"Error processing revision {rev}: {e}")
        
        return pd.DataFrame(comparison_data)
    
    def run_pipeline(self, stage: str = None) -> bool:
        """Run DVC pipeline stages"""
        try:
            if stage:
                self.repo.reproduce(stage)
            else:
                self.repo.reproduce()
            return True
        except Exception as e:
            print(f"Pipeline execution failed: {e}")
            return False
    
    def push_data(self, targets: List[str] = None) -> bool:
        """Push data to remote storage"""
        try:
            if targets:
                for target in targets:
                    self.repo.push(targets=[target])
            else:
                self.repo.push()
            return True
        except Exception as e:
            print(f"Data push failed: {e}")
            return False
    
    def pull_data(self, targets: List[str] = None) -> bool:
        """Pull data from remote storage"""
        try:
            if targets:
                for target in targets:
                    self.repo.pull(targets=[target])
            else:
                self.repo.pull()
            return True
        except Exception as e:
            print(f"Data pull failed: {e}")
            return False

# Usage example
if __name__ == "__main__":
    dvc_manager = DVCManager()
    
    # Get current experiment data
    data = dvc_manager.get_data("data/prepared/train.csv")
    params = dvc_manager.get_params()
    metrics = dvc_manager.get_metrics()
    
    print(f"Data shape: {data.shape}")
    print(f"Current params: {params}")
    print(f"Current metrics: {metrics}")
    
    # Compare multiple experiments
    experiments = dvc_manager.compare_experiments(['HEAD', 'experiment-1', 'experiment-2'])
    print(f"Experiment comparison:\n{experiments}")

Real-World Examples

Netflix

Uses DVC for managing petabyte-scale datasets and ML model versioning across recommendation systems.

• 10+ PB of versioned data
• 1000+ ML experiments monthly
• Multi-region data synchronization

Spotify

Leverages DVC for music recommendation pipeline management and A/B testing data versioning.

• 500+ TB audio feature datasets
• Real-time pipeline orchestration
• Cross-team experiment sharing

Airbnb

Implements DVC for pricing model data management and feature engineering pipeline versioning.

• 100+ TB booking/pricing data
• Automated feature pipelines
• Model performance tracking

DVC Best Practices

✅ Do's

•Use meaningful commit messages for data changes
•Organize data in logical directory structures
•Use remote storage for team collaboration
•Version control your dvc.yaml and params.yaml
•Use DVC pipelines for reproducible workflows
•Monitor data drift and model performance

❌ Don'ts

•Don't add large files directly to Git
•Don't ignore .dvc files in your .gitignore
•Don't modify .dvc files manually
•Don't commit without running `dvc repro` first
•Don't store credentials in repository files
•Don't skip data validation in pipelines

Advanced DVC Features

Experiment Management

DVC Experiments Workflow

# Run experiment with different parameters
dvc exp run -S train.lr=0.01 -S train.epochs=20

# Compare experiments
dvc exp show --include-metrics accuracy,f1_score

# Apply best experiment
dvc exp apply exp-123-456

# Clean up experiments
dvc exp gc --workspace
dvc exp remove exp-789-012

Data Registry and Import

DVC Data Registry

# Import data from another DVC project
dvc import https://github.com/example/dataset data/raw.csv

# Import and track updates
dvc import --rev main https://github.com/example/dataset data/features.csv
dvc update data/features.csv.dvc

# Create data registry
dvc add shared/common_features.csv
dvc push
git add shared/common_features.csv.dvc
git commit -m "Add shared features dataset"
git tag -a "v1.0-features" -m "Feature set v1.0"
git push --tags

Multi-stage Pipeline Optimization

Advanced Pipeline Configuration