Key Monitoring Metrics

Not Started

Monitoring Metrics

Essential metrics and KPIs for system monitoring and observability

Key Metrics

Response Time

Time taken to process and respond to a request

Formula:

Response Time = Request End Time - Request Start Time

Thresholds:

Good

< 200ms

Warning

200ms - 1s

Critical

> 1s

Examples:

API endpoint response timeDatabase query execution timePage load time

Monitoring Tools:

PrometheusNew RelicDataDogGrafana

Recommended Alerts:

P95 response time > 500ms
Average response time > 200ms

Application Performance

criticalgauge

Throughput (RPS)

Number of requests processed per second

Formula:

Throughput = Total Requests / Time Period

Thresholds:

Good

> 1000 RPS

Warning

500-1000 RPS

Critical

< 500 RPS

Examples:

HTTP requests per secondDatabase transactions per secondMessage queue processing rate

Monitoring Tools:

PrometheusGrafanaCloudWatchNew Relic

Application Performance

criticalgauge

Error Rate

Percentage of requests that result in errors

Formula:

Error Rate = (Error Count / Total Requests) × 100

Thresholds:

Good

< 0.1%

Warning

0.1% - 1%

Critical

> 1%

Examples:

4xx client errors5xx server errorsDatabase connection errors

Monitoring Tools:

PrometheusSentryBugsnagCloudWatch

Recommended Alerts:

Error rate > 1%
5xx errors > 0.5%

Infrastructure

highgauge

CPU Utilization

Percentage of CPU capacity being used

Formula:

CPU Utilization = (CPU Used / CPU Total) × 100

Thresholds:

Good

< 70%

Warning

70% - 85%

Critical

> 85%

Examples:

Server CPU usageContainer CPU usageDatabase CPU usage

Monitoring Tools:

PrometheusCloudWatchDataDogGrafana

Infrastructure

highgauge

Memory Utilization

Percentage of memory being used

Formula:

Memory Utilization = (Memory Used / Memory Total) × 100

Thresholds:

Good

< 80%

Warning

80% - 90%

Critical

> 90%

Examples:

RAM usageJVM heap usageContainer memory usage

Monitoring Tools:

PrometheusCloudWatchDataDogGrafana

Infrastructure

highgauge

Disk Utilization

Percentage of disk space being used

Formula:

Disk Utilization = (Disk Used / Disk Total) × 100

Thresholds:

Good

< 80%

Warning

80% - 90%

Critical

> 90%

Examples:

File system usageDatabase storage usageLog file storage

Monitoring Tools:

PrometheusCloudWatchNagiosZabbix

Database

highgauge

Database Connection Pool

Number of active database connections

Formula:

Pool Utilization = (Active Connections / Max Connections) × 100

Thresholds:

Good

< 70%

Warning

70% - 85%

Critical

> 85%

Examples:

PostgreSQL connectionsMySQL connectionsMongoDB connections

Monitoring Tools:

PrometheusDataDogNew RelicCloudWatch

Database

criticalhistogram

Database Query Time

Time taken to execute database queries

Formula:

Query Time = Query End Time - Query Start Time

Thresholds:

Good

< 100ms

Warning

100ms - 500ms

Critical

> 500ms

Examples:

SELECT query execution timeINSERT/UPDATE query timeComplex JOIN query time

Monitoring Tools:

PrometheusDataDogNew Relicpganalyze

Database

highcounter

Database Deadlocks

Number of database deadlocks per time period

Formula:

Deadlock Rate = Deadlocks / Time Period

Thresholds:

Good

0 per hour

Warning

1-5 per hour

Critical

> 5 per hour

Examples:

Transaction deadlocksLock wait timeoutsConcurrent update conflicts

Monitoring Tools:

PrometheusDataDogCloudWatchDatabase logs

Network

highhistogram

Network Latency

Time for data to travel between network endpoints

Formula:

Latency = Round Trip Time / 2

Thresholds:

Good

< 50ms

Warning

50ms - 100ms

Critical

> 100ms

Examples:

Service-to-service latencyDatabase connection latencyCDN latency

Monitoring Tools:

PrometheusGrafanaPingdomCloudWatch

Network

mediumgauge

Bandwidth Utilization

Percentage of network bandwidth being used

Formula:

Bandwidth Utilization = (Used Bandwidth / Total Bandwidth) × 100

Thresholds:

Good

< 70%

Warning

70% - 85%

Critical

> 85%

Examples:

Ingress bandwidth usageEgress bandwidth usageInternal network traffic

Monitoring Tools:

PrometheusCloudWatchDataDogSNMP monitoring

Business

highgauge

Active Users

Number of users actively using the system

Formula:

Active Users = Unique Users in Time Window

Thresholds:

Good

> 10,000

Warning

5,000 - 10,000

Critical

< 5,000

Examples:

Daily Active Users (DAU)Monthly Active Users (MAU)Concurrent users

Monitoring Tools:

Google AnalyticsMixpanelAmplitudeCustom metrics

Business

criticalgauge

Conversion Rate

Percentage of users who complete desired actions

Formula:

Conversion Rate = (Conversions / Total Visitors) × 100

Thresholds:

Good

> 5%

Warning

2% - 5%

Critical

< 2%

Examples:

Sign-up conversion ratePurchase conversion rateEmail subscription rate

Monitoring Tools:

Google AnalyticsMixpanelOptimizelyCustom dashboards

Security

highcounter

Failed Login Attempts

Number of unsuccessful authentication attempts

Formula:

Failed Login Rate = Failed Logins / Time Period

Thresholds:

Good

< 100 per hour

Warning

100 - 500 per hour

Critical

> 500 per hour

Examples:

Brute force attemptsInvalid credentialsAccount lockouts

Monitoring Tools:

PrometheusSplunkELK StackSecurity logs

Recommended Alerts:

Failed logins > 1000/hour
Account lockout threshold reached

Security

criticalcounter

Security Events

Number of security-related events detected

Formula:

Security Event Rate = Security Events / Time Period

Thresholds:

Good

0 per hour

Warning

1-5 per hour

Critical

> 5 per hour

Examples:

Intrusion attemptsMalware detectionPrivilege escalation

Monitoring Tools:

SIEM toolsSplunkELK StackSecurity scanners

Monitoring Strategy

critical Priority

high Priority

medium Priority

low Priority

Monitoring Best Practice: Start with Critical and High priority metrics to establish baseline monitoring. Implement alerting for all Critical metrics. Use dashboards to visualize trends and correlations across different metric categories.

Related Resources

Prometheus & Grafana

Learn about modern monitoring stack setup

Performance Optimization

Techniques to improve system performance

Capacity Calculator

Calculate system capacity and performance metrics