What is Amazon CloudWatch?
Amazon CloudWatch is AWS's comprehensive monitoring and observability service that provides data and actionable insights to monitor applications, respond to system-wide performance changes, and optimize resource utilization. It collects monitoring and operational data in the form of logs, metrics, and events from AWS resources, applications, and services.
CloudWatch enables you to monitor your complete stack (applications, infrastructure, network, and services) and use alarms, logs, and events data to take automated actions and reduce mean time to resolution (MTTR). It's used by companies like Netflix for monitoring 15+ petabytes of daily streaming data and by Airbnb for monitoring millions of property searches and bookings.
CloudWatch Cost & Performance Calculator
Monthly Metrics: 7,200,000
Alarm Cost: $2.5/month
Log Cost: $1.5/month
CloudWatch Core Components
Metrics
Time-ordered set of data points published to CloudWatch.
• Custom application metrics
• High-resolution metrics (1-second)
• 15-month retention period
• Metric math for derived values
Logs
Centralized log collection and analysis from AWS services and applications.
• Logs Insights for querying
• Configurable retention periods
• Log groups and streams
• Subscription filters
Alarms
Automated monitoring with threshold-based and anomaly detection alerts.
• Anomaly detection models
• SNS integration
• Auto Scaling actions
• EC2 instance actions
Real-World CloudWatch Implementations
Netflix
Monitors 15+ petabytes of daily streaming data with custom metrics for video quality, buffering rates, and user engagement.
- • 2+ billion custom metrics per day
- • Sub-second anomaly detection
- • Automated scaling based on viewing patterns
- • Real-time quality degradation alerts
Airbnb
Uses CloudWatch for monitoring millions of property searches, bookings, and host interactions globally.
- • Custom business metrics tracking
- • Search performance optimization
- • Booking funnel analysis
- • Real-time fraud detection alerts
Lyft
Monitors ride-sharing operations with real-time metrics for driver-rider matching, ETAs, and surge pricing.
- • Real-time location tracking metrics
- • Dynamic pricing algorithm monitoring
- • Driver utilization analytics
- • Service availability dashboards
Spotify
Tracks music streaming performance, recommendation engine effectiveness, and user engagement patterns.
- • Music streaming quality metrics
- • Recommendation algorithm performance
- • User behavior analytics
- • Playlist generation monitoring
Advanced CloudWatch Features
Container Insights
Comprehensive monitoring for containerized applications running on Amazon EKS, ECS, and Fargate.
# Enable Container Insights for EKS cluster
aws eks update-cluster-config \
--region us-west-2 \
--name my-cluster \
--logging '{"enable":[{"types":["api","audit","authenticator","controllerManager","scheduler"]}]}'
# Install CloudWatch Agent as DaemonSet
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/cloudwatch-namespace.yaml
Lambda Insights
Detailed monitoring and troubleshooting for serverless applications running on AWS Lambda.
# Add Lambda Insights layer to function
aws lambda update-function-configuration \
--function-name MyFunction \
--layers arn:aws:lambda:us-east-1:580247275435:layer:LambdaInsightsExtension:21
# Set environment variable
aws lambda update-function-configuration \
--function-name MyFunction \
--environment Variables='{AWS_LAMBDA_INSIGHTS_LOG_LEVEL=info}'
Application Insights
Automated application monitoring that uses machine learning to detect and diagnose issues.
• Automatic application discovery
• Component relationship mapping
• Anomaly detection and root cause analysis
• Integration with AWS Systems Manager OpsCenter
CloudWatch Best Practices
✅ Do
- • Use appropriate retention periods to control costs
- • Batch metric publishing (up to 20 metrics per call)
- • Create composite alarms for complex conditions
- • Use anomaly detection for dynamic thresholds
- • Implement structured logging with consistent formats
- • Use metric filters to extract metrics from logs
- • Set up cross-region dashboards for global monitoring
- • Use custom namespaces for application metrics
❌ Don't
- • Leave log groups with default (never expire) retention
- • Create too many high-resolution (1-second) metrics
- • Use broad metric dimensions that create high cardinality
- • Ignore alarm state changes (OK to ALARM and back)
- • Create alarms without proper notification channels
- • Store sensitive data in log messages
- • Query large time ranges without filtering
- • Mix different units in the same custom metric