Skip to main contentSkip to user menuSkip to navigation

DVC Data Versioning

Master DVC for data versioning, pipeline automation, and ML experiment tracking. Learn data-centric MLOps with hands-on examples.

40 min readIntermediate
Not Started
Loading...

What is DVC Data Versioning?

DVC (Data Version Control) is an open-source tool that brings Git-like version control to machine learning projects, specifically designed for managing large datasets, ML models, and complex pipelines that Git cannot handle efficiently.

Version Control
For data & models
Pipeline Management
Reproducible workflows
Experiment Tracking
Track metrics & artifacts

📊 DVC Storage Calculator

Calculate storage requirements, costs, and performance metrics for your DVC setup.

Storage Analysis

Raw Storage:50.0 GB
Compressed Storage:15.0 GB
Storage Savings:35.0 GB
Monthly Cost:$0.34
Sync Time:2.0 min

Core DVC Concepts

Data Versioning

  • • Track large files and datasets with Git
  • • Efficient storage using deduplication
  • • Branch and merge data like source code
  • • Remote storage integration (S3, GCS, etc.)

Pipeline Management

  • • Define reproducible ML workflows
  • • Automatic dependency tracking
  • • Incremental pipeline execution
  • • Stage-by-stage caching

Experiment Tracking

  • • Compare experiment results
  • • Track hyperparameters and metrics
  • • Visualize experiment history
  • • Share experiments with teams

Model Registry

  • • Version control for trained models
  • • Model metadata and lineage
  • • Deployment artifact management
  • • Model performance tracking

DVC Setup and Basic Usage

Installation and Initialization

Install and Setup DVC
# Install DVC
pip install dvc[s3]  # or [gcs], [azure], [all]

# Initialize DVC in your Git repository
cd my-ml-project
git init
dvc init

# Add remote storage
dvc remote add -d myremote s3://my-bucket/dvc-storage
dvc remote modify myremote region us-west-2

# Configure credentials (AWS example)
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"

Data Versioning Workflow

Track and Version Data
# Add data to DVC tracking
dvc add data/train.csv
dvc add models/my_model.pkl

# Commit the .dvc files to Git
git add data/train.csv.dvc models/my_model.pkl.dvc .dvc/config
git commit -m "Add training data and model"

# Push data to remote storage
dvc push

# Pull data from remote storage
dvc pull

# Check data status
dvc status
dvc diff

DVC Pipelines

Pipeline Definition (dvc.yaml)

dvc.yaml - Pipeline Configuration
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
    - src/prepare.py
    - data/raw
    params:
    - prepare.seed
    - prepare.split
    outs:
    - data/prepared

  train:
    cmd: python src/train.py
    deps:
    - src/train.py
    - data/prepared
    params:
    - train.lr
    - train.epochs
    - train.batch_size
    outs:
    - models/model.pkl
    metrics:
    - metrics/train.json

  evaluate:
    cmd: python src/evaluate.py
    deps:
    - src/evaluate.py
    - models/model.pkl
    - data/prepared
    metrics:
    - metrics/eval.json
    plots:
    - plots/confusion_matrix.json
    - plots/roc_curve.json

Parameters Configuration

params.yaml - Hyperparameters
prepare:
  seed: 42
  split: 0.2

train:
  lr: 0.001
  epochs: 10
  batch_size: 32
  model_type: "random_forest"
  n_estimators: 100
  max_depth: 10

evaluate:
  threshold: 0.5
  metrics:
    - accuracy
    - precision
    - recall
    - f1_score

Production DVC Implementation

ML Training Script with DVC

src/train.py - DVC-integrated Training
import os
import json
import yaml
import pickle
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib

def load_params(params_path="params.yaml"):
    """Load parameters from params.yaml"""
    with open(params_path, 'r') as file:
        params = yaml.safe_load(file)
    return params

def train_model(data_path, params, output_dir="models", metrics_dir="metrics"):
    """Train model with DVC parameter management"""
    
    # Create output directories
    os.makedirs(output_dir, exist_ok=True)
    os.makedirs(metrics_dir, exist_ok=True)
    
    # Load prepared data
    train_df = pd.read_csv(f"{data_path}/train.csv")
    
    # Extract features and target
    X = train_df.drop(['target'], axis=1)
    y = train_df['target']
    
    # Split data
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=params['prepare']['seed']
    )
    
    # Initialize model with parameters
    model = RandomForestClassifier(
        n_estimators=params['train']['n_estimators'],
        max_depth=params['train']['max_depth'],
        random_state=params['prepare']['seed']
    )
    
    # Train model
    print("Training model...")
    model.fit(X_train, y_train)
    
    # Evaluate on validation set
    val_predictions = model.predict(X_val)
    
    # Calculate metrics
    metrics = {
        'accuracy': float(accuracy_score(y_val, val_predictions)),
        'precision': float(precision_score(y_val, val_predictions, average='weighted')),
        'recall': float(recall_score(y_val, val_predictions, average='weighted')),
        'f1_score': float(f1_score(y_val, val_predictions, average='weighted')),
        'n_estimators': params['train']['n_estimators'],
        'max_depth': params['train']['max_depth']
    }
    
    # Save model
    model_path = f"{output_dir}/model.pkl"
    joblib.dump(model, model_path)
    
    # Save metrics for DVC
    metrics_path = f"{metrics_dir}/train.json"
    with open(metrics_path, 'w') as f:
        json.dump(metrics, f, indent=2)
    
    print(f"Model saved to {model_path}")
    print(f"Metrics saved to {metrics_path}")
    print(f"Training completed - Accuracy: {metrics['accuracy']:.4f}")
    
    return model, metrics

if __name__ == "__main__":
    # Load parameters
    params = load_params()
    
    # Train model
    model, metrics = train_model(
        data_path="data/prepared",
        params=params
    )
    
    print("Training pipeline completed successfully!")

DVC API Integration

src/dvc_utils.py - DVC Python API
import dvc.api
import dvc.repo
import pandas as pd
from typing import Dict, Any, List
import os

class DVCManager:
    """DVC operations manager for ML pipelines"""
    
    def __init__(self, repo_path: str = "."):
        self.repo_path = repo_path
        self.repo = dvc.repo.Repo(repo_path)
    
    def get_data(self, path: str, rev: str = None) -> pd.DataFrame:
        """Get data from DVC tracking"""
        try:
            with dvc.api.open(path, repo=self.repo_path, rev=rev) as f:
                return pd.read_csv(f)
        except Exception as e:
            print(f"Error loading data from {path}: {e}")
            raise
    
    def get_params(self, params_file: str = "params.yaml", rev: str = None) -> Dict[str, Any]:
        """Get parameters from DVC params file"""
        return dvc.api.params_show(repo=self.repo_path, rev=rev)
    
    def get_metrics(self, rev: str = None) -> Dict[str, Any]:
        """Get metrics from DVC metrics files"""
        return dvc.api.metrics_show(repo=self.repo_path, rev=rev)
    
    def compare_experiments(self, revisions: List[str]) -> pd.DataFrame:
        """Compare experiments across different revisions"""
        comparison_data = []
        
        for rev in revisions:
            try:
                metrics = self.get_metrics(rev=rev)
                params = self.get_params(rev=rev)
                
                row = {'revision': rev}
                
                # Flatten metrics
                for file_path, file_metrics in metrics.items():
                    for metric_name, metric_value in file_metrics.items():
                        row[f"metric_{metric_name}"] = metric_value
                
                # Flatten params
                for category, category_params in params.items():
                    if isinstance(category_params, dict):
                        for param_name, param_value in category_params.items():
                            row[f"param_{category}_{param_name}"] = param_value
                    else:
                        row[f"param_{category}"] = category_params
                
                comparison_data.append(row)
                
            except Exception as e:
                print(f"Error processing revision {rev}: {e}")
        
        return pd.DataFrame(comparison_data)
    
    def run_pipeline(self, stage: str = None) -> bool:
        """Run DVC pipeline stages"""
        try:
            if stage:
                self.repo.reproduce(stage)
            else:
                self.repo.reproduce()
            return True
        except Exception as e:
            print(f"Pipeline execution failed: {e}")
            return False
    
    def push_data(self, targets: List[str] = None) -> bool:
        """Push data to remote storage"""
        try:
            if targets:
                for target in targets:
                    self.repo.push(targets=[target])
            else:
                self.repo.push()
            return True
        except Exception as e:
            print(f"Data push failed: {e}")
            return False
    
    def pull_data(self, targets: List[str] = None) -> bool:
        """Pull data from remote storage"""
        try:
            if targets:
                for target in targets:
                    self.repo.pull(targets=[target])
            else:
                self.repo.pull()
            return True
        except Exception as e:
            print(f"Data pull failed: {e}")
            return False

# Usage example
if __name__ == "__main__":
    dvc_manager = DVCManager()
    
    # Get current experiment data
    data = dvc_manager.get_data("data/prepared/train.csv")
    params = dvc_manager.get_params()
    metrics = dvc_manager.get_metrics()
    
    print(f"Data shape: {data.shape}")
    print(f"Current params: {params}")
    print(f"Current metrics: {metrics}")
    
    # Compare multiple experiments
    experiments = dvc_manager.compare_experiments(['HEAD', 'experiment-1', 'experiment-2'])
    print(f"Experiment comparison:\n{experiments}")

Real-World Examples

Netflix

Uses DVC for managing petabyte-scale datasets and ML model versioning across recommendation systems.

  • • 10+ PB of versioned data
  • • 1000+ ML experiments monthly
  • • Multi-region data synchronization

Spotify

Leverages DVC for music recommendation pipeline management and A/B testing data versioning.

  • • 500+ TB audio feature datasets
  • • Real-time pipeline orchestration
  • • Cross-team experiment sharing

Airbnb

Implements DVC for pricing model data management and feature engineering pipeline versioning.

  • • 100+ TB booking/pricing data
  • • Automated feature pipelines
  • • Model performance tracking

DVC Best Practices

✅ Do's

  • Use meaningful commit messages for data changes
  • Organize data in logical directory structures
  • Use remote storage for team collaboration
  • Version control your dvc.yaml and params.yaml
  • Use DVC pipelines for reproducible workflows
  • Monitor data drift and model performance

❌ Don'ts

  • Don't add large files directly to Git
  • Don't ignore .dvc files in your .gitignore
  • Don't modify .dvc files manually
  • Don't commit without running `dvc repro` first
  • Don't store credentials in repository files
  • Don't skip data validation in pipelines

Advanced DVC Features

Experiment Management

DVC Experiments Workflow
# Run experiment with different parameters
dvc exp run -S train.lr=0.01 -S train.epochs=20

# Compare experiments
dvc exp show --include-metrics accuracy,f1_score

# Apply best experiment
dvc exp apply exp-123-456

# Clean up experiments
dvc exp gc --workspace
dvc exp remove exp-789-012

Data Registry and Import

DVC Data Registry
# Import data from another DVC project
dvc import https://github.com/example/dataset data/raw.csv

# Import and track updates
dvc import --rev main https://github.com/example/dataset data/features.csv
dvc update data/features.csv.dvc

# Create data registry
dvc add shared/common_features.csv
dvc push
git add shared/common_features.csv.dvc
git commit -m "Add shared features dataset"
git tag -a "v1.0-features" -m "Feature set v1.0"
git push --tags

Multi-stage Pipeline Optimization

Advanced Pipeline Configuration
No quiz questions available
Quiz ID "dvc-data-versioning" not found