Performance Metrics

18 min read•Intermediate

Not Started

Learn to measure what matters. Master the Four Golden Signals of monitoring and understand how top tech companies track system performance and user experience.

The Four Golden Signals (Google SRE)

Google's Site Reliability Engineering team identified four key metrics that provide comprehensive insight into system health. These signals form the foundation of effective monitoring.

Why These Four?

They cover the complete user experience: how fast (latency), how much load (traffic), how often it breaks (errors), and when it might break (saturation). Together, they predict both current user experience and future system stability.

Latency

Time taken to serve a request

Good

< 100ms

Alert

> 500ms

How to Measure

P50, P95, P99 response time

Business Impact

User experience, conversion rates

Traffic

Amount of demand on your system

Good

Within capacity

Alert

> 80% capacity

How to Measure

Requests per second, concurrent users

Business Impact

System utilization, scaling needs

Errors

Rate of failed requests

Good

< 0.1%

Alert

> 1%

How to Measure

Error rate, HTTP 5xx responses

Business Impact

User satisfaction, revenue loss

Saturation

How "full" your service is

Good

< 70%

Alert

> 90%

How to Measure

CPU, memory, disk utilization

Business Impact

Performance degradation, outages

Real-World Performance Benchmarks

Performance standards from companies serving billions of users. These benchmarks represent world-class user experiences and are targets worth aspiring to.

Google Search

Response Time

< 200ms

Including network roundtrip

Netflix Streaming

Startup Time

< 3s

Video playback initialization

Amazon Web Store

Page Load

< 1s

100ms delay = 1% sales drop

Facebook News Feed

Time to Feed

< 500ms

Personalized content delivery

Uber Ride Matching

Match Time

< 30s

Driver-rider pairing

WhatsApp Message

Delivery Time

< 100ms

End-to-end message delivery

Key Insight

Notice how these companies set aggressive performance targets. They understand that every millisecond matters for user experience and business outcomes. Speed is a competitive advantage.

Understanding Percentiles

Averages can be misleading. A few slow requests can hide widespread performance issues. Percentiles tell the complete story of user experience.

Why Percentiles Matter

Response Times

150msaverage

800msP99

Example: Average 150ms looks good, but P99 of 800ms means 1% of users wait nearly a second.

Common Percentiles

P50 (Median)

Half of all requests are faster

50%

P95

Typical 'worst case' user experience

95%

P99

Catches outliers and edge cases

99%

P99.9

Ultra-rare but critical failures

99.9%

P50 (Median)

Represents the typical user experience. Good for general performance tracking.

P95

Primary SLA metric. Balances user experience with engineering practicality.

P99

Catches edge cases and system stress. Critical for high-scale applications.

Performance Metrics Quick Reference

Essential Metrics

☐ P95 response time < 200ms
☐ Error rate < 0.1%
☐ Availability > 99.9%
☐ Apdex score > 0.85
☐ CPU utilization < 70%

Monitoring Tools

• Prometheus + Grafana
• New Relic, Datadog (SaaS)
• CloudWatch (AWS)
• Application Performance Monitoring (APM)
• Synthetic monitoring / uptime checks

🧮 SLA Performance Calculator

Calculate the impact of different performance targets on user experience and costs

Inputs

Target Availability

Revenue per Hour

$/hour

Monthly Infrastructure Cost

$/month

Result

8.76 hours/year

99.9% availability means 8.76 hours downtime per year

Downtime/Year:8.76 hours

Revenue Loss:$438,000

Infra Cost Multiplier:2x

Total Infra Cost:$20,000/month

🎯 Performance Metrics in Action

Real-world examples of how performance metrics drive business decisions

Metrics

Page Load Time

3.2 seconds

Conversion Rate

-40%

Error Rate

2.1%

Revenue Impact

$2M lost/day

Outcome

Performance optimization became top priority. CDN implementation and database optimization reduced load times to under 1 second, recovering conversion rates.

Lessons Learned

Every 100ms delay costs 1% in conversions for e-commerce
Peak traffic periods expose hidden performance bottlenecks
Error rate above 1% indicates system stress requiring immediate attention
Performance monitoring should trigger automatic alerts before user impact

ScenariosClick to explore

E-commerce Site Performance Crisis

Major retailer sees 40% drop in conversions during peak shopping season

Streaming Service Buffer Rate Analysis

Video platform reduces churn by optimizing startup time and buffering

API Gateway Saturation Event

Microservices platform experiences cascading failures due to ignored saturation metrics

Mobile App Performance Optimization

Social media app improves user engagement through latency optimization

Database Performance Degradation

SaaS platform detects and resolves gradual performance decline

Real-time Chat System Scaling

Messaging platform maintains low latency while scaling to millions of users

Context

Major retailer sees 40% drop in conversions during peak shopping season

Metrics

Page Load Time

3.2 seconds

Conversion Rate

-40%

Error Rate

2.1%

Revenue Impact

$2M lost/day

Outcome

Performance optimization became top priority. CDN implementation and database optimization reduced load times to under 1 second, recovering conversion rates.

Key Lessons

•Every 100ms delay costs 1% in conversions for e-commerce
•Peak traffic periods expose hidden performance bottlenecks
•Error rate above 1% indicates system stress requiring immediate attention
•Performance monitoring should trigger automatic alerts before user impact

1. E-commerce Site Performance Crisis

Context

Major retailer sees 40% drop in conversions during peak shopping season

Metrics

Page Load Time

3.2 seconds

Conversion Rate

-40%

Error Rate

2.1%

Revenue Impact

$2M lost/day

Outcome

Performance optimization became top priority. CDN implementation and database optimization reduced load times to under 1 second, recovering conversion rates.

Key Lessons

•Every 100ms delay costs 1% in conversions for e-commerce
•Peak traffic periods expose hidden performance bottlenecks
•Error rate above 1% indicates system stress requiring immediate attention
•Performance monitoring should trigger automatic alerts before user impact

2. Streaming Service Buffer Rate Analysis

Context

Video platform reduces churn by optimizing startup time and buffering

Metrics

Video Startup Time

1.2 seconds

Buffer Rate

0.3%

User Retention

+25%

CDN Cache Hit Rate

94%

Outcome

Aggressive content pre-positioning and adaptive bitrate algorithms kept users watching. Startup time under 2 seconds became competitive advantage.

Key Lessons

•Video startup time over 2 seconds causes 6% of viewers to abandon
•Buffer events have 3x higher impact on user satisfaction than startup delay
•Geographic content distribution reduces latency for global audiences
•Real-time quality adaptation prevents buffer events during network congestion

3. API Gateway Saturation Event

Context

Microservices platform experiences cascading failures due to ignored saturation metrics

Metrics

CPU Utilization

95%

Response Time P99

5.8 seconds

Error Rate

8.5%

Services Affected

12 of 15

Outcome

Resource saturation at API gateway caused widespread timeouts. Horizontal scaling and circuit breakers prevented future incidents.

Key Lessons

•Saturation metrics predict failures before they impact users
•CPU utilization over 80% requires immediate scaling action
•Circuit breakers prevent cascade failures in microservice architectures
•Load testing should validate performance under sustained high utilization

4. Mobile App Performance Optimization

Context

Social media app improves user engagement through latency optimization

Metrics

Feed Load Time

800ms

App Rating

4.6/5

Daily Active Users

+18%

Session Duration

+12 minutes

Outcome

Feed pre-loading and image optimization delivered sub-second load times. User engagement metrics improved across all demographics.

Key Lessons

•Mobile networks amplify latency issues compared to desktop
•Image optimization has outsized impact on mobile performance
•Feed pre-loading during app startup improves perceived performance
•Performance improvements directly correlate with user engagement

5. Database Performance Degradation

Context

SaaS platform detects and resolves gradual performance decline

Metrics

Query Response Time

250ms → 45ms

Database CPU

85% → 45%

Customer Complaints

90% reduction

System Availability

99.95%

Outcome

Proactive monitoring caught slowly degrading query performance. Index optimization and query refactoring restored normal operation.

Key Lessons

•Performance degradation often happens gradually and goes unnoticed
•Database index maintenance is critical for sustained performance
•Query performance monitoring should track trends, not just absolute values
•Automated performance regression detection prevents user impact

6. Real-time Chat System Scaling

Context

Messaging platform maintains low latency while scaling to millions of users

Metrics

Message Latency

45ms

Concurrent Users

2.5M

WebSocket Connections

99.7% success

Message Throughput

50K messages/sec

Outcome

Regional message routing and connection pooling enabled massive scale while maintaining chat-quality latency requirements.

Key Lessons

•Real-time applications require consistent latency, not just low average latency
•WebSocket connection management critical for chat application performance
•Message routing optimization reduces cross-region latency
•Connection pooling and load balancing essential for million-user scale

No quiz questions available

Quiz ID "performance-metrics" not found