System Designer - Learn System Design & Software Architecture

What is Amazon CloudWatch?

Amazon CloudWatch is AWS's comprehensive monitoring and observability service that provides data and actionable insights to monitor applications, respond to system-wide performance changes, and optimize resource utilization. It collects monitoring and operational data in the form of logs, metrics, and events from AWS resources, applications, and services.

CloudWatch enables you to monitor your complete stack (applications, infrastructure, network, and services) and use alarms, logs, and events data to take automated actions and reduce mean time to resolution (MTTR). It's used by companies like Netflix for monitoring 15+ petabytes of daily streaming data and by Airbnb for monitoring millions of property searches and bookings.

CloudWatch Cost & Performance Calculator

Custom Metrics per Hour: 10,000

Daily Log Volume: 50 GB

Log Retention: 30 days

Number of Alarms: 25

$2164

Monthly Cost

Query Latency

1.8s

Dashboard Load

1500GB

Total Storage

Monthly Metrics: 7,200,000

Alarm Cost: $2.5/month

Log Cost: $1.5/month

CloudWatch Core Components

Metrics

Time-ordered set of data points published to CloudWatch.

• Standard AWS service metrics
• Custom application metrics
• High-resolution metrics (1-second)
• 15-month retention period
• Metric math for derived values

Logs

Centralized log collection and analysis from AWS services and applications.

• Real-time log streaming
• Logs Insights for querying
• Configurable retention periods
• Log groups and streams
• Subscription filters

Alarms

Automated monitoring with threshold-based and anomaly detection alerts.

• Metric and composite alarms
• Anomaly detection models
• SNS integration
• Auto Scaling actions
• EC2 instance actions

Real-World CloudWatch Implementations

Netflix

Monitors 15+ petabytes of daily streaming data with custom metrics for video quality, buffering rates, and user engagement.

• 2+ billion custom metrics per day
• Sub-second anomaly detection
• Automated scaling based on viewing patterns
• Real-time quality degradation alerts

Airbnb

Uses CloudWatch for monitoring millions of property searches, bookings, and host interactions globally.

• Custom business metrics tracking
• Search performance optimization
• Booking funnel analysis
• Real-time fraud detection alerts

Lyft

Monitors ride-sharing operations with real-time metrics for driver-rider matching, ETAs, and surge pricing.

• Real-time location tracking metrics
• Dynamic pricing algorithm monitoring
• Driver utilization analytics
• Service availability dashboards

Spotify

Tracks music streaming performance, recommendation engine effectiveness, and user engagement patterns.

• Music streaming quality metrics
• Recommendation algorithm performance
• User behavior analytics
• Playlist generation monitoring

Advanced CloudWatch Features

Container Insights

Comprehensive monitoring for containerized applications running on Amazon EKS, ECS, and Fargate.

Enable Container Insights

# Enable Container Insights for EKS cluster
aws eks update-cluster-config \
  --region us-west-2 \
  --name my-cluster \
  --logging '{"enable":[{"types":["api","audit","authenticator","controllerManager","scheduler"]}]}'

# Install CloudWatch Agent as DaemonSet
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/cloudwatch-namespace.yaml

Lambda Insights

Detailed monitoring and troubleshooting for serverless applications running on AWS Lambda.

Enable Lambda Insights

# Add Lambda Insights layer to function
aws lambda update-function-configuration \
  --function-name MyFunction \
  --layers arn:aws:lambda:us-east-1:580247275435:layer:LambdaInsightsExtension:21

# Set environment variable
aws lambda update-function-configuration \
  --function-name MyFunction \
  --environment Variables='{AWS_LAMBDA_INSIGHTS_LOG_LEVEL=info}'

Application Insights

Automated application monitoring that uses machine learning to detect and diagnose issues.

Features:
• Automatic application discovery
• Component relationship mapping
• Anomaly detection and root cause analysis
• Integration with AWS Systems Manager OpsCenter

CloudWatch Best Practices

✅ Do

• Use appropriate retention periods to control costs
• Batch metric publishing (up to 20 metrics per call)
• Create composite alarms for complex conditions
• Use anomaly detection for dynamic thresholds
• Implement structured logging with consistent formats
• Use metric filters to extract metrics from logs
• Set up cross-region dashboards for global monitoring
• Use custom namespaces for application metrics

❌ Don't

• Leave log groups with default (never expire) retention
• Create too many high-resolution (1-second) metrics
• Use broad metric dimensions that create high cardinality
• Ignore alarm state changes (OK to ALARM and back)
• Create alarms without proper notification channels
• Store sensitive data in log messages
• Query large time ranges without filtering
• Mix different units in the same custom metric

Cost Optimization Guidelines

Logs

Set retention policies

1-365 days based on needs

Metrics

Batch API calls

20 metrics per PutMetricData

Alarms

Use composite alarms

Reduce alarm count

No quiz questions available

Questions prop is empty