What is DVC Data Versioning?
DVC (Data Version Control) is an open-source tool that brings Git-like version control to machine learning projects, specifically designed for managing large datasets, ML models, and complex pipelines that Git cannot handle efficiently.
Version Control
For data & models
Pipeline Management
Reproducible workflows
Experiment Tracking
Track metrics & artifacts
📊 DVC Storage Calculator
Calculate storage requirements, costs, and performance metrics for your DVC setup.
Storage Analysis
Raw Storage:50.0 GB
Compressed Storage:15.0 GB
Storage Savings:35.0 GB
Monthly Cost:$0.34
Sync Time:2.0 min
Core DVC Concepts
Data Versioning
- • Track large files and datasets with Git
- • Efficient storage using deduplication
- • Branch and merge data like source code
- • Remote storage integration (S3, GCS, etc.)
Pipeline Management
- • Define reproducible ML workflows
- • Automatic dependency tracking
- • Incremental pipeline execution
- • Stage-by-stage caching
Experiment Tracking
- • Compare experiment results
- • Track hyperparameters and metrics
- • Visualize experiment history
- • Share experiments with teams
Model Registry
- • Version control for trained models
- • Model metadata and lineage
- • Deployment artifact management
- • Model performance tracking
DVC Setup and Basic Usage
Installation and Initialization
Install and Setup DVC
# Install DVC
pip install dvc[s3] # or [gcs], [azure], [all]
# Initialize DVC in your Git repository
cd my-ml-project
git init
dvc init
# Add remote storage
dvc remote add -d myremote s3://my-bucket/dvc-storage
dvc remote modify myremote region us-west-2
# Configure credentials (AWS example)
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
Data Versioning Workflow
Track and Version Data
# Add data to DVC tracking
dvc add data/train.csv
dvc add models/my_model.pkl
# Commit the .dvc files to Git
git add data/train.csv.dvc models/my_model.pkl.dvc .dvc/config
git commit -m "Add training data and model"
# Push data to remote storage
dvc push
# Pull data from remote storage
dvc pull
# Check data status
dvc status
dvc diff
DVC Pipelines
Pipeline Definition (dvc.yaml)
dvc.yaml - Pipeline Configuration
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
train:
cmd: python src/train.py
deps:
- src/train.py
- data/prepared
params:
- train.lr
- train.epochs
- train.batch_size
outs:
- models/model.pkl
metrics:
- metrics/train.json
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/prepared
metrics:
- metrics/eval.json
plots:
- plots/confusion_matrix.json
- plots/roc_curve.json
Parameters Configuration
params.yaml - Hyperparameters
prepare:
seed: 42
split: 0.2
train:
lr: 0.001
epochs: 10
batch_size: 32
model_type: "random_forest"
n_estimators: 100
max_depth: 10
evaluate:
threshold: 0.5
metrics:
- accuracy
- precision
- recall
- f1_score
Production DVC Implementation
ML Training Script with DVC
src/train.py - DVC-integrated Training
import os
import json
import yaml
import pickle
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib
def load_params(params_path="params.yaml"):
"""Load parameters from params.yaml"""
with open(params_path, 'r') as file:
params = yaml.safe_load(file)
return params
def train_model(data_path, params, output_dir="models", metrics_dir="metrics"):
"""Train model with DVC parameter management"""
# Create output directories
os.makedirs(output_dir, exist_ok=True)
os.makedirs(metrics_dir, exist_ok=True)
# Load prepared data
train_df = pd.read_csv(f"{data_path}/train.csv")
# Extract features and target
X = train_df.drop(['target'], axis=1)
y = train_df['target']
# Split data
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=params['prepare']['seed']
)
# Initialize model with parameters
model = RandomForestClassifier(
n_estimators=params['train']['n_estimators'],
max_depth=params['train']['max_depth'],
random_state=params['prepare']['seed']
)
# Train model
print("Training model...")
model.fit(X_train, y_train)
# Evaluate on validation set
val_predictions = model.predict(X_val)
# Calculate metrics
metrics = {
'accuracy': float(accuracy_score(y_val, val_predictions)),
'precision': float(precision_score(y_val, val_predictions, average='weighted')),
'recall': float(recall_score(y_val, val_predictions, average='weighted')),
'f1_score': float(f1_score(y_val, val_predictions, average='weighted')),
'n_estimators': params['train']['n_estimators'],
'max_depth': params['train']['max_depth']
}
# Save model
model_path = f"{output_dir}/model.pkl"
joblib.dump(model, model_path)
# Save metrics for DVC
metrics_path = f"{metrics_dir}/train.json"
with open(metrics_path, 'w') as f:
json.dump(metrics, f, indent=2)
print(f"Model saved to {model_path}")
print(f"Metrics saved to {metrics_path}")
print(f"Training completed - Accuracy: {metrics['accuracy']:.4f}")
return model, metrics
if __name__ == "__main__":
# Load parameters
params = load_params()
# Train model
model, metrics = train_model(
data_path="data/prepared",
params=params
)
print("Training pipeline completed successfully!")
DVC API Integration
src/dvc_utils.py - DVC Python API
import dvc.api
import dvc.repo
import pandas as pd
from typing import Dict, Any, List
import os
class DVCManager:
"""DVC operations manager for ML pipelines"""
def __init__(self, repo_path: str = "."):
self.repo_path = repo_path
self.repo = dvc.repo.Repo(repo_path)
def get_data(self, path: str, rev: str = None) -> pd.DataFrame:
"""Get data from DVC tracking"""
try:
with dvc.api.open(path, repo=self.repo_path, rev=rev) as f:
return pd.read_csv(f)
except Exception as e:
print(f"Error loading data from {path}: {e}")
raise
def get_params(self, params_file: str = "params.yaml", rev: str = None) -> Dict[str, Any]:
"""Get parameters from DVC params file"""
return dvc.api.params_show(repo=self.repo_path, rev=rev)
def get_metrics(self, rev: str = None) -> Dict[str, Any]:
"""Get metrics from DVC metrics files"""
return dvc.api.metrics_show(repo=self.repo_path, rev=rev)
def compare_experiments(self, revisions: List[str]) -> pd.DataFrame:
"""Compare experiments across different revisions"""
comparison_data = []
for rev in revisions:
try:
metrics = self.get_metrics(rev=rev)
params = self.get_params(rev=rev)
row = {'revision': rev}
# Flatten metrics
for file_path, file_metrics in metrics.items():
for metric_name, metric_value in file_metrics.items():
row[f"metric_{metric_name}"] = metric_value
# Flatten params
for category, category_params in params.items():
if isinstance(category_params, dict):
for param_name, param_value in category_params.items():
row[f"param_{category}_{param_name}"] = param_value
else:
row[f"param_{category}"] = category_params
comparison_data.append(row)
except Exception as e:
print(f"Error processing revision {rev}: {e}")
return pd.DataFrame(comparison_data)
def run_pipeline(self, stage: str = None) -> bool:
"""Run DVC pipeline stages"""
try:
if stage:
self.repo.reproduce(stage)
else:
self.repo.reproduce()
return True
except Exception as e:
print(f"Pipeline execution failed: {e}")
return False
def push_data(self, targets: List[str] = None) -> bool:
"""Push data to remote storage"""
try:
if targets:
for target in targets:
self.repo.push(targets=[target])
else:
self.repo.push()
return True
except Exception as e:
print(f"Data push failed: {e}")
return False
def pull_data(self, targets: List[str] = None) -> bool:
"""Pull data from remote storage"""
try:
if targets:
for target in targets:
self.repo.pull(targets=[target])
else:
self.repo.pull()
return True
except Exception as e:
print(f"Data pull failed: {e}")
return False
# Usage example
if __name__ == "__main__":
dvc_manager = DVCManager()
# Get current experiment data
data = dvc_manager.get_data("data/prepared/train.csv")
params = dvc_manager.get_params()
metrics = dvc_manager.get_metrics()
print(f"Data shape: {data.shape}")
print(f"Current params: {params}")
print(f"Current metrics: {metrics}")
# Compare multiple experiments
experiments = dvc_manager.compare_experiments(['HEAD', 'experiment-1', 'experiment-2'])
print(f"Experiment comparison:\n{experiments}")
Real-World Examples
Netflix
Uses DVC for managing petabyte-scale datasets and ML model versioning across recommendation systems.
- • 10+ PB of versioned data
- • 1000+ ML experiments monthly
- • Multi-region data synchronization
Spotify
Leverages DVC for music recommendation pipeline management and A/B testing data versioning.
- • 500+ TB audio feature datasets
- • Real-time pipeline orchestration
- • Cross-team experiment sharing
Airbnb
Implements DVC for pricing model data management and feature engineering pipeline versioning.
- • 100+ TB booking/pricing data
- • Automated feature pipelines
- • Model performance tracking
DVC Best Practices
✅ Do's
- •Use meaningful commit messages for data changes
- •Organize data in logical directory structures
- •Use remote storage for team collaboration
- •Version control your dvc.yaml and params.yaml
- •Use DVC pipelines for reproducible workflows
- •Monitor data drift and model performance
❌ Don'ts
- •Don't add large files directly to Git
- •Don't ignore .dvc files in your .gitignore
- •Don't modify .dvc files manually
- •Don't commit without running `dvc repro` first
- •Don't store credentials in repository files
- •Don't skip data validation in pipelines
Advanced DVC Features
Experiment Management
DVC Experiments Workflow
# Run experiment with different parameters
dvc exp run -S train.lr=0.01 -S train.epochs=20
# Compare experiments
dvc exp show --include-metrics accuracy,f1_score
# Apply best experiment
dvc exp apply exp-123-456
# Clean up experiments
dvc exp gc --workspace
dvc exp remove exp-789-012
Data Registry and Import
DVC Data Registry
# Import data from another DVC project
dvc import https://github.com/example/dataset data/raw.csv
# Import and track updates
dvc import --rev main https://github.com/example/dataset data/features.csv
dvc update data/features.csv.dvc
# Create data registry
dvc add shared/common_features.csv
dvc push
git add shared/common_features.csv.dvc
git commit -m "Add shared features dataset"
git tag -a "v1.0-features" -m "Feature set v1.0"
git push --tags
Multi-stage Pipeline Optimization
Advanced Pipeline Configuration
No quiz questions available
Quiz ID "dvc-data-versioning" not found