Performance Metrics

18 min readIntermediate
Not Started
Loading...

Learn to measure what matters. Master the Four Golden Signals of monitoring and understand how top tech companies track system performance and user experience.

The Four Golden Signals (Google SRE)

Google's Site Reliability Engineering team identified four key metrics that provide comprehensive insight into system health. These signals form the foundation of effective monitoring.

Why These Four?

They cover the complete user experience: how fast (latency), how much load (traffic), how often it breaks (errors), and when it might break (saturation). Together, they predict both current user experience and future system stability.

1

Latency

Time taken to serve a request

Good
< 100ms
Alert
> 500ms
How to Measure
P50, P95, P99 response time
Business Impact
User experience, conversion rates
2

Traffic

Amount of demand on your system

Good
Within capacity
Alert
> 80% capacity
How to Measure
Requests per second, concurrent users
Business Impact
System utilization, scaling needs
3

Errors

Rate of failed requests

Good
< 0.1%
Alert
> 1%
How to Measure
Error rate, HTTP 5xx responses
Business Impact
User satisfaction, revenue loss
4

Saturation

How "full" your service is

Good
< 70%
Alert
> 90%
How to Measure
CPU, memory, disk utilization
Business Impact
Performance degradation, outages

Real-World Performance Benchmarks

Performance standards from companies serving billions of users. These benchmarks represent world-class user experiences and are targets worth aspiring to.

Google Search
Response Time
< 200ms
Including network roundtrip
Netflix Streaming
Startup Time
< 3s
Video playback initialization
Amazon Web Store
Page Load
< 1s
100ms delay = 1% sales drop
Facebook News Feed
Time to Feed
< 500ms
Personalized content delivery
Uber Ride Matching
Match Time
< 30s
Driver-rider pairing
WhatsApp Message
Delivery Time
< 100ms
End-to-end message delivery

Key Insight

Notice how these companies set aggressive performance targets. They understand that every millisecond matters for user experience and business outcomes. Speed is a competitive advantage.

Understanding Percentiles

Averages can be misleading. A few slow requests can hide widespread performance issues. Percentiles tell the complete story of user experience.

Why Percentiles Matter

Response Times
150msaverage
800msP99
Example: Average 150ms looks good, but P99 of 800ms means 1% of users wait nearly a second.

Common Percentiles

P50 (Median)
Half of all requests are faster
50%
P95
Typical 'worst case' user experience
95%
P99
Catches outliers and edge cases
99%
P99.9
Ultra-rare but critical failures
99.9%

P50 (Median)

Represents the typical user experience. Good for general performance tracking.

P95

Primary SLA metric. Balances user experience with engineering practicality.

P99

Catches edge cases and system stress. Critical for high-scale applications.

Performance Metrics Quick Reference

Essential Metrics

  • ☐ P95 response time < 200ms
  • ☐ Error rate < 0.1%
  • ☐ Availability > 99.9%
  • ☐ Apdex score > 0.85
  • ☐ CPU utilization < 70%

Monitoring Tools

  • • Prometheus + Grafana
  • • New Relic, Datadog (SaaS)
  • • CloudWatch (AWS)
  • • Application Performance Monitoring (APM)
  • • Synthetic monitoring / uptime checks

🧮 SLA Performance Calculator

Calculate the impact of different performance targets on user experience and costs

Inputs

$/hour
$/month

Result

8.76 hours/year

99.9% availability means 8.76 hours downtime per year

Downtime/Year:8.76 hours
Revenue Loss:$438,000
Infra Cost Multiplier:2x
Total Infra Cost:$20,000/month

🎯 Performance Metrics in Action

Real-world examples of how performance metrics drive business decisions

Scenarios

E-commerce Site Performance Crisis
Major retailer sees 40% drop in conversions during peak shopping season
Streaming Service Buffer Rate Analysis
Video platform reduces churn by optimizing startup time and buffering
API Gateway Saturation Event
Microservices platform experiences cascading failures due to ignored saturation metrics
Mobile App Performance Optimization
Social media app improves user engagement through latency optimization
Database Performance Degradation
SaaS platform detects and resolves gradual performance decline
Real-time Chat System Scaling
Messaging platform maintains low latency while scaling to millions of users

Context

Major retailer sees 40% drop in conversions during peak shopping season

Metrics

Page Load Time
3.2 seconds
Conversion Rate
-40%
Error Rate
2.1%
Revenue Impact
$2M lost/day

Outcome

Performance optimization became top priority. CDN implementation and database optimization reduced load times to under 1 second, recovering conversion rates.

Key Lessons

  • Every 100ms delay costs 1% in conversions for e-commerce
  • Peak traffic periods expose hidden performance bottlenecks
  • Error rate above 1% indicates system stress requiring immediate attention
  • Performance monitoring should trigger automatic alerts before user impact

📝 Performance Metrics Quiz

1 of 5Current: 0/5

Which of the Four Golden Signals is most important for predicting future outages?