Skip to main content

Command Palette

Search for a command to run...

Grafana Dashboards as Code

Updated
10 min read
Grafana Dashboards as Code
N
DevOps and Cloud Engineer sharing hands-on projects, real-world labs, and lessons from building and automating cloud infrastructure.

Grafana Dashboards as Code — Destroy the Container,

the Dashboard Survives (Week 3)

Someone commented on my last post:

"Don't forget the nice gradient shade, give it some bling."

Challenge accepted.

But before I got to the visuals, I had to solve a more fundamental problem first.


The problem with clicking dashboards together

In Module 1, I built a Grafana dashboard by running a curl command against the API. It worked. The graphs showed up. I was happy.

Then I thought about production.

What happens when the Grafana container crashes and gets recreated? What happens when you deploy to a new environment? Every dashboard built through the UI is gone. You'd have to remember what you built, recreate every panel, reconnect the datasource. Manual work. Error-prone. Unacceptable.

The solution is provisioning. You declare your dashboards, datasources, and settings as files. Grafana reads those files on startup. The container becomes completely disposable — destroy it, recreate it, everything comes back exactly as it was.

That's what this module builds.


The directory structure

module-3/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── rules/
│       ├── recording.rules.yml
│       └── alerting.rules.yml
└── grafana/
    ├── provisioning/
    │   ├── datasources/
    │   │   └── prometheus.yml
    │   └── dashboards/
    │       └── default.yml
    └── dashboards/
        └── node-overview.json

This structure is not arbitrary. It mirrors exactly how production Grafana deployments are managed. Everything under grafana/provisioning/ is mounted into the container at /etc/grafana/provisioning/. Everything under grafana/dashboards/ is mounted at /var/lib/grafana/dashboards/. Grafana knows where to look at startup.


Part 1 — Datasource as code

Instead of adding Prometheus via API like Module 1, I declared it as a YAML file:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

Three fields matter beyond the obvious:

editable: false — the datasource cannot be modified through the UI. The file is the only source of truth. Nobody can accidentally break the connection by clicking the wrong thing.

timeInterval: "15s" — tells Grafana the minimum resolution of the data. Matches the Prometheus scrape interval. Without it, Grafana might request data at intervals that don't exist, creating gaps.

httpMethod: POST — complex PromQL queries with long label matchers can exceed GET request URL length limits. POST handles them cleanly.


Part 2 — Dashboard provisioning config

This file tells Grafana where to find dashboard JSON:

apiVersion: 1

providers:
  - name: default
    orgId: 1
    folder: "Node Monitoring"
    type: file
    disableDeletion: true
    updateIntervalSeconds: 30
    allowUiUpdates: false
    options:
      path: /var/lib/grafana/dashboards

disableDeletion: true — dashboards loaded from files cannot be deleted through the UI. You can only remove them by deleting the JSON file.

updateIntervalSeconds: 30 — Grafana checks for changes to dashboard JSON files every 30 seconds. Edit the file, save it, the dashboard updates live without any restart. This is how you iterate on dashboards in production without touching a running system.

allowUiUpdates: false — if you edit a panel in the Grafana UI and save, that change does not persist. The file is always the source of truth. This prevents dashboard drift — where the UI version and the file version silently diverge and nobody knows which one is correct.


Part 3 — The dashboard JSON

Four panels: CPU usage, memory used, disk used, and target health. Here are the parts that matter beyond basic panel configuration.

Variables — the Instance dropdown

"templating": {
  "list": [
    {
      "name": "instance",
      "type": "query",
      "datasource": "Prometheus",
      "query": "label_values(up, instance)",
      "refresh": 2,
      "includeAll": true,
      "allValue": ".*",
      "label": "Instance",
      "sort": 1
    }
  ]
}

label_values(up, instance) queries all unique values of the instance label from the up metric automatically. The dropdown auto-populates with every scrape target. Add a new target to Prometheus — it appears in the dropdown with zero dashboard changes.

includeAll: true with allValue: ".*" means selecting "All" uses a regex that matches everything. Every panel query uses {instance=~"$instance"} so selecting one instance filters all panels to that machine simultaneously.

The bling — and why most tutorials get it wrong

Most Grafana tutorials show you how to set threshold values. What they don't tell you is that setting the values alone does nothing to the line colour. You need one additional field that ties everything together:

"fieldConfig": {
  "defaults": {
    "color": { "mode": "thresholds" },
    "thresholds": {
      "mode": "absolute",
      "steps": [
        { "color": "green",  "value": null },
        { "color": "yellow", "value": 40   },
        { "color": "red",    "value": 50   }
      ]
    },
    "custom": {
      "lineWidth": 2,
      "fillOpacity": 15,
      "gradientMode": "scheme",
      "thresholdsStyle": { "mode": "line+area" }
    }
  }
}

"color": { "mode": "thresholds" } — this is the missing piece. Without it, thresholds are defined but Grafana never applies them to the line colour. The graph stays the default green regardless of what value it hits. This single field activates everything.

gradientMode: "scheme" — applies the gradient fill under the line using the threshold colour scheme. Below 40% the fill is green-tinted. Above 40% it shifts to yellow. Above 50% it turns red. The fill changes colour continuously as the value changes — not just a solid block but a smooth gradient that makes the severity immediately obvious.

thresholdsStyle: "line+area" — draws visible dashed horizontal lines at exactly 40% and 50% directly on the graph panel. You can see where the boundaries are even before any value crosses them. The area between the threshold line and the graph line also fills with the threshold colour.

The result: a line that is green when healthy, yellow when warning, and red when critical. A gradient fill that changes colour with it. Threshold markers sitting on the graph showing exactly where the limits are. This is the "bling" — and it carries real meaning, not just decoration.

Target Health panel

{
  "type": "stat",
  "title": "Target Health",
  "targets": [{ "expr": "up" }],
  "fieldConfig": {
    "defaults": {
      "color": { "mode": "thresholds" },
      "mappings": [{
        "type": "value",
        "options": {
          "0": { "text": "DOWN", "color": "red"   },
          "1": { "text": "UP",   "color": "green" }
        }
      }]
    }
  },
  "options": { "colorMode": "background" }
}

The up metric returns 1 for reachable targets and 0 for unreachable. The value mapping converts those numbers into human-readable text with full background colours. At a glance you can see which targets are healthy without reading a single number.


Part 4 — Wiring it together in docker-compose

The key part that makes provisioning work:

grafana:
  volumes:
    - grafana_data:/var/lib/grafana
    - ./grafana/provisioning:/etc/grafana/provisioning
    - ./grafana/dashboards:/var/lib/grafana/dashboards
  environment:
    - GF_SECURITY_ADMIN_USER=admin
    - GF_SECURITY_ADMIN_PASSWORD=devops123
    - GF_USERS_ALLOW_SIGN_UP=false
    - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=
        /var/lib/grafana/dashboards/node-overview.json

Three volume mounts, each with a specific purpose:

grafana_data:/var/lib/grafana — persists Grafana's internal database across restarts.

./grafana/provisioning:/etc/grafana/provisioning — mounts your datasource and dashboard provider configs into the container. Grafana reads this directory on startup automatically.

./grafana/dashboards:/var/lib/grafana/dashboards — mounts your dashboard JSON files. Grafana polls this directory every 30 seconds for changes.

GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH — sets your Node Overview as the home page. Every engineer who opens Grafana lands directly on your dashboard, not the default welcome screen.


Part 5 — Starting the stack

docker compose up -d
docker compose ps

Six resources created with green checkmarks — network, two volumes, three containers: prometheus-m3, node-exporter-m3, grafana-m3. All running and healthy.

Startup verification:

Prometheus: Prometheus Server is Healthy.
Grafana: ok
Datasource: Prometheus
Dashboards: 1 dashboard(s)

Dashboard loads at http://localhost:3001. Instance dropdown populated automatically. All four panels showing live data. Both targets showing UP in green.


Part 6 — The real test: destroy and rebuild

This is the proof that provisioning actually works.

# Destroy Grafana completely
docker compose stop grafana
docker compose rm -f grafana
docker volume rm module-3_grafana_data

# Rebuild from zero
docker compose up -d grafana

Wait 15 seconds. Open http://localhost:3001.

The dashboard is back. Datasource connected. All panels showing data. Instance dropdown populated. Zero manual steps. Zero clicking. Zero reconfiguring.

The container is completely disposable. The configuration files are the system. That is infrastructure as code working exactly as it should.


Part 7 — Watching the dashboard respond live

With the dashboard open in the browser on Last 15 minutes time range, I ran a stress test:

stress --cpu 4 --timeout 300s &

And watched the dashboard and alerts page simultaneously.

What happened:

The CPU panel line was flat green at ~7% baseline. As the stress test loaded the CPUs, the line climbed. When it crossed 40%, the line and fill shifted from green to yellow — the warning zone. When it crossed 50%, they turned red — the critical zone. The dashed threshold lines drawn at 40% and 50% made the boundaries completely clear.

On the Prometheus alerts page at the same time, HighCPUUsage went inactive → pending → firing. CriticalCPUUsage followed.

When stress ended, the CPU line fell back through yellow and returned to green. Alerts resolved automatically. The full incident lifecycle — rise, threshold breach, recovery — visible in one colour- coded dashboard view.

This is what observability customisation actually means. Not just having dashboards, but having dashboards that tell the same story as your alerts, in a visual language any engineer can read instantly.


One thing that confused me — and the lesson in it

When I switched the time range to Last 15 minutes during a quiet period, all the panels showed flat lines. No spike visible.

My first instinct was that something broke.

Nothing broke. The spike happened 45 minutes ago. Last 15 minutes only shows the current 15-minute window. Switch back to Last 1 hour and the spike reappears.

This is fundamental to understanding time-series dashboards. The time range is not a filter on existing data — it is a window that determines which data gets rendered. Understanding this stops a whole category of "why is my dashboard empty" confusion.


Module 3 checkpoint — all verified

Prometheus health: Prometheus Server is Healthy. Grafana health: ok Datasource provisioned: Prometheus Dashboard provisioned: Node Overview Recording rules: 3 rules === Checkpoint complete ===

Provisioning working. Dashboard surviving container restarts. Colour-coded thresholds active. Live incident response visible in real time.


What I understand now that I didn't before

Clicking dashboards together is not engineering. It produces something that works today and breaks silently tomorrow. Real dashboards live in version control as JSON files, deploy automatically, and survive any infrastructure event without manual intervention.

Visual consistency is not optional. Dashboard thresholds and alert rule thresholds must match exactly. If your graph turns red at 50% but your alert fires at 80%, engineers stop trusting either one. Consistency is what makes an observability system reliable enough to act on.

Variables make dashboards scalable. A dashboard hardcoded for one instance serves one instance forever. A variable that auto-populates from Prometheus label values works for one instance today and fifty instances next month without a single dashboard edit.

The missing field changes everything. "color": { "mode": "thresholds" } is one line of JSON. Without it, all your threshold configuration is invisible. With it, the entire visual language of the dashboard activates. Details like this are what separate engineers who copy configs from engineers who understand them.


Module 4 next — Loki log aggregation. Promtail pipelines, LogQL queries, and structured logging. The "why did it break" pillar of observability. The piece that metrics alone can never tell you.