Skip to main contentSkip to user menuSkip to navigation

Key Monitoring Metrics

Not Started
Loading...

Monitoring Metrics

Essential metrics and KPIs for system monitoring and observability

15
Key Metrics
Application Performance
criticalhistogram

Response Time

Time taken to process and respond to a request

Formula:

Response Time = Request End Time - Request Start Time

Thresholds:

Good
< 200ms
Warning
200ms - 1s
Critical
> 1s

Examples:

API endpoint response timeDatabase query execution timePage load time

Monitoring Tools:

PrometheusNew RelicDataDogGrafana

Recommended Alerts:

  • P95 response time > 500ms
  • Average response time > 200ms
Application Performance
criticalgauge

Throughput (RPS)

Number of requests processed per second

Formula:

Throughput = Total Requests / Time Period

Thresholds:

Good
> 1000 RPS
Warning
500-1000 RPS
Critical
< 500 RPS

Examples:

HTTP requests per secondDatabase transactions per secondMessage queue processing rate

Monitoring Tools:

PrometheusGrafanaCloudWatchNew Relic
Application Performance
criticalgauge

Error Rate

Percentage of requests that result in errors

Formula:

Error Rate = (Error Count / Total Requests) × 100

Thresholds:

Good
< 0.1%
Warning
0.1% - 1%
Critical
> 1%

Examples:

4xx client errors5xx server errorsDatabase connection errors

Monitoring Tools:

PrometheusSentryBugsnagCloudWatch

Recommended Alerts:

  • Error rate > 1%
  • 5xx errors > 0.5%
Infrastructure
highgauge

CPU Utilization

Percentage of CPU capacity being used

Formula:

CPU Utilization = (CPU Used / CPU Total) × 100

Thresholds:

Good
< 70%
Warning
70% - 85%
Critical
> 85%

Examples:

Server CPU usageContainer CPU usageDatabase CPU usage

Monitoring Tools:

PrometheusCloudWatchDataDogGrafana
Infrastructure
highgauge

Memory Utilization

Percentage of memory being used

Formula:

Memory Utilization = (Memory Used / Memory Total) × 100

Thresholds:

Good
< 80%
Warning
80% - 90%
Critical
> 90%

Examples:

RAM usageJVM heap usageContainer memory usage

Monitoring Tools:

PrometheusCloudWatchDataDogGrafana
Infrastructure
highgauge

Disk Utilization

Percentage of disk space being used

Formula:

Disk Utilization = (Disk Used / Disk Total) × 100

Thresholds:

Good
< 80%
Warning
80% - 90%
Critical
> 90%

Examples:

File system usageDatabase storage usageLog file storage

Monitoring Tools:

PrometheusCloudWatchNagiosZabbix
Database
highgauge

Database Connection Pool

Number of active database connections

Formula:

Pool Utilization = (Active Connections / Max Connections) × 100

Thresholds:

Good
< 70%
Warning
70% - 85%
Critical
> 85%

Examples:

PostgreSQL connectionsMySQL connectionsMongoDB connections

Monitoring Tools:

PrometheusDataDogNew RelicCloudWatch
Database
criticalhistogram

Database Query Time

Time taken to execute database queries

Formula:

Query Time = Query End Time - Query Start Time

Thresholds:

Good
< 100ms
Warning
100ms - 500ms
Critical
> 500ms

Examples:

SELECT query execution timeINSERT/UPDATE query timeComplex JOIN query time

Monitoring Tools:

PrometheusDataDogNew Relicpganalyze
Database
highcounter

Database Deadlocks

Number of database deadlocks per time period

Formula:

Deadlock Rate = Deadlocks / Time Period

Thresholds:

Good
0 per hour
Warning
1-5 per hour
Critical
> 5 per hour

Examples:

Transaction deadlocksLock wait timeoutsConcurrent update conflicts

Monitoring Tools:

PrometheusDataDogCloudWatchDatabase logs
Network
highhistogram

Network Latency

Time for data to travel between network endpoints

Formula:

Latency = Round Trip Time / 2

Thresholds:

Good
< 50ms
Warning
50ms - 100ms
Critical
> 100ms

Examples:

Service-to-service latencyDatabase connection latencyCDN latency

Monitoring Tools:

PrometheusGrafanaPingdomCloudWatch
Network
mediumgauge

Bandwidth Utilization

Percentage of network bandwidth being used

Formula:

Bandwidth Utilization = (Used Bandwidth / Total Bandwidth) × 100

Thresholds:

Good
< 70%
Warning
70% - 85%
Critical
> 85%

Examples:

Ingress bandwidth usageEgress bandwidth usageInternal network traffic

Monitoring Tools:

PrometheusCloudWatchDataDogSNMP monitoring
Business
highgauge

Active Users

Number of users actively using the system

Formula:

Active Users = Unique Users in Time Window

Thresholds:

Good
> 10,000
Warning
5,000 - 10,000
Critical
< 5,000

Examples:

Daily Active Users (DAU)Monthly Active Users (MAU)Concurrent users

Monitoring Tools:

Google AnalyticsMixpanelAmplitudeCustom metrics
Business
criticalgauge

Conversion Rate

Percentage of users who complete desired actions

Formula:

Conversion Rate = (Conversions / Total Visitors) × 100

Thresholds:

Good
> 5%
Warning
2% - 5%
Critical
< 2%

Examples:

Sign-up conversion ratePurchase conversion rateEmail subscription rate

Monitoring Tools:

Google AnalyticsMixpanelOptimizelyCustom dashboards
Security
highcounter

Failed Login Attempts

Number of unsuccessful authentication attempts

Formula:

Failed Login Rate = Failed Logins / Time Period

Thresholds:

Good
< 100 per hour
Warning
100 - 500 per hour
Critical
> 500 per hour

Examples:

Brute force attemptsInvalid credentialsAccount lockouts

Monitoring Tools:

PrometheusSplunkELK StackSecurity logs

Recommended Alerts:

  • Failed logins > 1000/hour
  • Account lockout threshold reached
Security
criticalcounter

Security Events

Number of security-related events detected

Formula:

Security Event Rate = Security Events / Time Period

Thresholds:

Good
0 per hour
Warning
1-5 per hour
Critical
> 5 per hour

Examples:

Intrusion attemptsMalware detectionPrivilege escalation

Monitoring Tools:

SIEM toolsSplunkELK StackSecurity scanners

Monitoring Strategy

6
critical Priority
8
high Priority
1
medium Priority
0
low Priority

Monitoring Best Practice: Start with Critical and High priority metrics to establish baseline monitoring. Implement alerting for all Critical metrics. Use dashboards to visualize trends and correlations across different metric categories.