Skip to main content

Command Palette

Search for a command to run...

Prometheus Deep Dive — Alerts, Debugging, and What

Updated
9 min read
Prometheus Deep Dive — Alerts, Debugging, and What
N
DevOps and Cloud Engineer sharing hands-on projects, real-world labs, and lessons from building and automating cloud infrastructure.

Actually Happens When Things Don't Work (Week 2)

Last week I spun up a working observability stack from scratch. This week I went deeper. And I hit real problems.

I'm documenting both — because the debugging is where the actual learning happens.


What I set out to build

Three things:

  • Recording rules — pre-computing expensive PromQL queries

  • Alerting rules — five production-grade alerts with proper severity tiers and for durations

  • A live alert firing — watching the full lifecycle from inactive to pending to firing with my own eyes

I built all three. But not without friction.


First — understanding why one metric returns

multiple results

Before writing any rules, I ran this query in Prometheus:

node_cpu_seconds_total

It returned not 1 result but many. Each one looked like this:

node_cpu_seconds_total{cpu="0", mode="idle",   instance="node-exporter:9100"}
node_cpu_seconds_total{cpu="0", mode="iowait", instance="node-exporter:9100"}
node_cpu_seconds_total{cpu="0", mode="system", instance="node-exporter:9100"}
node_cpu_seconds_total{cpu="0", mode="user",   instance="node-exporter:9100"}
node_cpu_seconds_total{cpu="1", mode="idle",   instance="node-exporter:9100"}
...

My machine has 2 CPU cores. Each core has multiple modes: idle, iowait, system, user, softirq, steal, irq, nice. Prometheus creates one completely separate time series per unique label combination.

This is the core of how Prometheus works. The metric name alone doesn't identify a series. The metric name PLUS its full label set identifies a series. Change any label value and you get a completely different data stream.

This is also why I need avg by(instance) in my CPU query. Without it, I get one line per CPU core per mode on my graph. With it, all those series collapse into one number per machine.

# Without avg — multiple lines, unreadable
rate(node_cpu_seconds_total{mode="idle"}[5m])

# With avg by(instance) — one clean line per machine  
avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))

The four metric types — locked in before writing rules

Every metric in Prometheus is one of four types. Getting this wrong means writing queries that return garbage.

Counter — only ever goes up. Always use rate() on it, never query raw. The raw value is meaningless without context.

Gauge — goes up and down. Current value is what matters. Query directly.

Histogram — samples into buckets. Used for latency percentiles. Gives you _bucket, _count, _sum metrics.

Summary — similar to histogram but calculates quantiles client-side. Less flexible. Prefer histograms in modern stacks.


Recording rules — pre-computing expensive queries

Some PromQL queries are expensive. If 10 Grafana panels each re-run the same complex expression every 15 seconds, that's 40 heavy computations per minute. Under load, Prometheus starts choking.

Recording rules solve this. Run the query on a schedule, store the result as a new metric. Grafana queries the cheap pre-computed result.

groups:
  - name: node_recording_rules
    interval: 1m
    rules:
      - record: job:node_cpu_usage:avg_percent
        expr: |
          100 - (
            avg by(instance)(
              rate(node_cpu_seconds_total{mode="idle"}[5m])
            ) * 100
          )

      - record: job:node_memory_used:percent
        expr: |
          (1 - (
            node_memory_MemAvailable_bytes /
            node_memory_MemTotal_bytes
          )) * 100

      - record: job:node_disk_used:percent
        expr: |
          (1 - (
            node_filesystem_avail_bytes{mountpoint="/"} /
            node_filesystem_size_bytes{mountpoint="/"}
          )) * 100

After loading these rules, job:node_cpu_usage:avg_percent becomes a real queryable metric. I verified it:

curl -s http://localhost:9091/api/v1/rules | \
  jq '.data.groups[].rules[].name'

Output:

"HighCPUUsage"
"CriticalCPUUsage" 
"HighMemoryUsage"
"DiskFillingUp"
"InstanceDown"
"job:node_cpu_usage:avg_percent"
"job:node_memory_used:percent"
"job:node_disk_used:percent"

All 8 rules loaded. 5 alerting, 3 recording.


Alerting rules

groups:
  - name: node_alerts
    rules:
      - alert: HighCPUUsage
        expr: job:node_cpu_usage:avg_percent > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU at {{ $value | humanize }}% 
            for 2+ minutes"

      - alert: CriticalCPUUsage
        expr: job:node_cpu_usage:avg_percent > 95
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU on {{ $labels.instance }}"
          description: "CPU at {{ $value | humanize }}% 
            — act now"

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"
          description: "{{ $labels.job }} unreachable 
            for 1 minute"

Three things that make these production-grade:

The for duration filters noise. Without for: 2m, a single 2-second CPU spike fires your alert and wakes someone up. With it, the condition must be true continuously for 2 full minutes before anything fires.

Template variables give context. {{ \(labels.instance }} injects the actual machine name. {{ \)value | humanize }} formats the number cleanly. Without these, every alert says "something is wrong" with zero context.

Two severity tiers for CPU. Warning at 80% — look at it soon. Critical at 95% — wake someone up now. One threshold is never enough for resource alerts.

Setting up the stack — containers running

With the config files written, I spun up the stack:

docker compose up -d
docker compose ps

Three things happened in sequence that matter:

First, Docker created a dedicated bridge network called module-2_monitoring. Every container joins this network automatically. This is why Prometheus can reach node-exporter by name — node-exporter:9100 — without any IP address. Docker's internal DNS resolves container names on this network.

Second, named volumes were created for Prometheus data. Without these, every container restart loses all collected metrics. The volume persists the time-series database across restarts.

Third, the rules directory was mounted into the Prometheus container at /etc/prometheus/rules. Prometheus reads every .yml file in that directory on startup and on hot-reload. Adding a new rule file requires no restart — just drop the file and call the reload endpoint.

Verification that rules actually loaded:

curl -s http://localhost:9091/api/v1/rules | \
  jq '.data.groups[].rules[].name'

Output:

"HighCPUUsage"
"CriticalCPUUsage"
"HighMemoryUsage"
"DiskFillingUp"
"InstanceDown"
"job:node_cpu_usage:avg_percent"
"job:node_memory_used:percent"
"job:node_disk_used:percent"

8 rules loaded. Recording rules creating real queryable metrics. Alert rules watching real conditions.

I also verified the recording rule was actually computing data:

curl -s http://localhost:9091/api/v1/query \
  --data-urlencode 'query=job:node_cpu_usage:avg_percent' | \
  jq '.data.result'

Output showed "job:node_cpu_usage:avg_percent" returning 8.43% — a real computed metric now queryable just like any native Prometheus metric.

Problem 1 — the alert never fired

I ran stress --cpu 4 --timeout 180s and waited. The alert stayed INACTIVE. I ran it again. Same result.

I didn't guess. I queried the actual data:

curl -s 'http://localhost:9091/api/v1/query_range?
  query=job:node_cpu_usage:avg_percent&
  start=[15 min ago]&end=[now]&step=30s' | \
  jq '[.data.result[0].values[] | 
    {time: (.[0] | todate), cpu: .[1]}]'

Output showed CPU peaked at 54% — never crossed 80%.

This is a WSL2 architectural limitation. WSL2 runs inside a Windows virtual machine with a shared CPU scheduler. When you run stress, WSL2 doesn't give processes unrestricted CPU access the way bare Linux would. The host Windows scheduler throttles it.

The fix was not to work around WSL2. The fix was to adjust the thresholds to match reality.

In production you never copy alert thresholds from the internet. You observe your system's actual baseline and set thresholds based on real data. Running the stress test and querying the actual values — that IS the right approach.

My data showed:

  • Idle baseline: 7-10%

  • Maximum under stress: ~54%

  • 80% threshold: unreachable in WSL2

So I updated the thresholds:

- alert: HighCPUUsage
  expr: job:node_cpu_usage:avg_percent > 40
  for: 2m

- alert: CriticalCPUUsage  
  expr: job:node_cpu_usage:avg_percent > 50
  for: 1m

Hot-reloaded without restarting:

curl -X POST http://localhost:9091/-/reload

Problem 2 — missing the state change because

stress finished too fast

The first few attempts I ran stress and then tried to check the browser — but the stress test finished before the for duration elapsed. The alert fired and resolved before I saw it in the browser.

The solution: two terminals side by side. Watch command running BEFORE the stress test starts.

Left terminal:

watch -n 10 "curl -s http://localhost:9091/api/v1/alerts | \
  jq '.data.alerts[] | 
    {name: .labels.alertname, state: .state}'"

Right terminal:

stress --cpu 4 --timeout 300s

The alert lifecycle — watching it happen in real time

With both terminals running, I watched this exact sequence:

00:16:55 — PENDING

{"name": "HighCPUUsage",    "state": "pending"}
{"name": "CriticalCPUUsage","state": "pending"}

00:17:38 — FIRING

{"name": "HighCPUUsage", "state": "firing"}

43 seconds between pending and firing. The for: 2m window elapsed and the alert promoted exactly as designed.

The browser confirmed it — FIRING (2) in red. Both CPU alerts active simultaneously. CPU was at 88.4% according to the alert details panel. Two severity tiers working correctly — warning fired first, critical followed.


What the expanded alert detail showed

When I clicked into the firing HighCPUUsage alert, the details showed:

State:        FIRING
Active since: 6m 15s
Value:        88.40366747659778
Instance:     node-exporter:9100
Severity:     warning

The template variables filled in correctly. The instance label showed exactly which machine. The value showed the actual CPU percentage. This is what makes alerts useful at 3am — complete context, no guessing.


The real lesson from this module

Debugging an alert that doesn't fire is real production work. The steps I took — query the actual metric values, check what CPU really reached, adjust thresholds based on data — that's exactly what you do in production.

Most tutorials show you a clean run. Clean runs don't teach you to think. Hitting a wall, diagnosing it systematically, and fixing it — that's what actually makes you better.

Alerts are a contract. When an alert fires, someone's phone buzzes. Bad alert rules break that contract. Getting the for duration right, setting thresholds based on real observed data, and writing meaningful template messages — that's how you maintain trust in your alerting system.


Module 2 checkpoint — all verified

Prometheus M2 health: Prometheus Server is Healthy.
Rules loaded: 8 rules
Recording rule works: 1 result(s)
Alert rules: 5 alerting rules

Module 3 next — Grafana dashboards as code. Provisioning from config files, variables, thresholds, and a dashboard that survives container restarts without any manual steps.

#DevOps #Prometheus #Observability #SRE #LearningInPublic #CloudEngineering #PromQL #Monitoring