Grafana Dashboards as Code

Grafana Dashboards as Code — Destroy the Container,
the Dashboard Survives (Week 3)
Someone commented on my last post:
"Don't forget the nice gradient shade, give it some bling."
Challenge accepted.
But before I got to the visuals, I had to solve a more fundamental problem first.
The problem with clicking dashboards together
In Module 1, I built a Grafana dashboard by running a curl command against the API. It worked. The graphs showed up. I was happy.
Then I thought about production.
What happens when the Grafana container crashes and gets recreated? What happens when you deploy to a new environment? Every dashboard built through the UI is gone. You'd have to remember what you built, recreate every panel, reconnect the datasource. Manual work. Error-prone. Unacceptable.
The solution is provisioning. You declare your dashboards, datasources, and settings as files. Grafana reads those files on startup. The container becomes completely disposable — destroy it, recreate it, everything comes back exactly as it was.
That's what this module builds.
The directory structure
module-3/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ └── rules/
│ ├── recording.rules.yml
│ └── alerting.rules.yml
└── grafana/
├── provisioning/
│ ├── datasources/
│ │ └── prometheus.yml
│ └── dashboards/
│ └── default.yml
└── dashboards/
└── node-overview.json
This structure is not arbitrary. It mirrors exactly how production Grafana deployments are managed. Everything under grafana/provisioning/ is mounted into the container at /etc/grafana/provisioning/. Everything under grafana/dashboards/ is mounted at /var/lib/grafana/dashboards/. Grafana knows where to look at startup.
Part 1 — Datasource as code
Instead of adding Prometheus via API like Module 1, I declared it as a YAML file:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: "15s"
httpMethod: POST
Three fields matter beyond the obvious:
editable: false — the datasource cannot be modified through the UI. The file is the only source of truth. Nobody can accidentally break the connection by clicking the wrong thing.
timeInterval: "15s" — tells Grafana the minimum resolution of the data. Matches the Prometheus scrape interval. Without it, Grafana might request data at intervals that don't exist, creating gaps.
httpMethod: POST — complex PromQL queries with long label matchers can exceed GET request URL length limits. POST handles them cleanly.
Part 2 — Dashboard provisioning config
This file tells Grafana where to find dashboard JSON:
apiVersion: 1
providers:
- name: default
orgId: 1
folder: "Node Monitoring"
type: file
disableDeletion: true
updateIntervalSeconds: 30
allowUiUpdates: false
options:
path: /var/lib/grafana/dashboards
disableDeletion: true — dashboards loaded from files cannot be deleted through the UI. You can only remove them by deleting the JSON file.
updateIntervalSeconds: 30 — Grafana checks for changes to dashboard JSON files every 30 seconds. Edit the file, save it, the dashboard updates live without any restart. This is how you iterate on dashboards in production without touching a running system.
allowUiUpdates: false — if you edit a panel in the Grafana UI and save, that change does not persist. The file is always the source of truth. This prevents dashboard drift — where the UI version and the file version silently diverge and nobody knows which one is correct.
Part 3 — The dashboard JSON
Four panels: CPU usage, memory used, disk used, and target health. Here are the parts that matter beyond basic panel configuration.
Variables — the Instance dropdown
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up, instance)",
"refresh": 2,
"includeAll": true,
"allValue": ".*",
"label": "Instance",
"sort": 1
}
]
}
label_values(up, instance) queries all unique values of the instance label from the up metric automatically. The dropdown auto-populates with every scrape target. Add a new target to Prometheus — it appears in the dropdown with zero dashboard changes.
includeAll: true with allValue: ".*" means selecting "All" uses a regex that matches everything. Every panel query uses {instance=~"$instance"} so selecting one instance filters all panels to that machine simultaneously.
The bling — and why most tutorials get it wrong
Most Grafana tutorials show you how to set threshold values. What they don't tell you is that setting the values alone does nothing to the line colour. You need one additional field that ties everything together:
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 40 },
{ "color": "red", "value": 50 }
]
},
"custom": {
"lineWidth": 2,
"fillOpacity": 15,
"gradientMode": "scheme",
"thresholdsStyle": { "mode": "line+area" }
}
}
}
"color": { "mode": "thresholds" } — this is the missing piece. Without it, thresholds are defined but Grafana never applies them to the line colour. The graph stays the default green regardless of what value it hits. This single field activates everything.
gradientMode: "scheme" — applies the gradient fill under the line using the threshold colour scheme. Below 40% the fill is green-tinted. Above 40% it shifts to yellow. Above 50% it turns red. The fill changes colour continuously as the value changes — not just a solid block but a smooth gradient that makes the severity immediately obvious.
thresholdsStyle: "line+area" — draws visible dashed horizontal lines at exactly 40% and 50% directly on the graph panel. You can see where the boundaries are even before any value crosses them. The area between the threshold line and the graph line also fills with the threshold colour.
The result: a line that is green when healthy, yellow when warning, and red when critical. A gradient fill that changes colour with it. Threshold markers sitting on the graph showing exactly where the limits are. This is the "bling" — and it carries real meaning, not just decoration.
Target Health panel
{
"type": "stat",
"title": "Target Health",
"targets": [{ "expr": "up" }],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"mappings": [{
"type": "value",
"options": {
"0": { "text": "DOWN", "color": "red" },
"1": { "text": "UP", "color": "green" }
}
}]
}
},
"options": { "colorMode": "background" }
}
The up metric returns 1 for reachable targets and 0 for unreachable. The value mapping converts those numbers into human-readable text with full background colours. At a glance you can see which targets are healthy without reading a single number.
Part 4 — Wiring it together in docker-compose
The key part that makes provisioning work:
grafana:
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=devops123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=
/var/lib/grafana/dashboards/node-overview.json
Three volume mounts, each with a specific purpose:
grafana_data:/var/lib/grafana — persists Grafana's internal database across restarts.
./grafana/provisioning:/etc/grafana/provisioning — mounts your datasource and dashboard provider configs into the container. Grafana reads this directory on startup automatically.
./grafana/dashboards:/var/lib/grafana/dashboards — mounts your dashboard JSON files. Grafana polls this directory every 30 seconds for changes.
GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH — sets your Node Overview as the home page. Every engineer who opens Grafana lands directly on your dashboard, not the default welcome screen.
Part 5 — Starting the stack
docker compose up -d
docker compose ps
Six resources created with green checkmarks — network, two volumes, three containers: prometheus-m3, node-exporter-m3, grafana-m3. All running and healthy.
Startup verification:
Prometheus: Prometheus Server is Healthy.
Grafana: ok
Datasource: Prometheus
Dashboards: 1 dashboard(s)
Dashboard loads at http://localhost:3001. Instance dropdown populated automatically. All four panels showing live data. Both targets showing UP in green.
Part 6 — The real test: destroy and rebuild
This is the proof that provisioning actually works.
# Destroy Grafana completely
docker compose stop grafana
docker compose rm -f grafana
docker volume rm module-3_grafana_data
# Rebuild from zero
docker compose up -d grafana
Wait 15 seconds. Open http://localhost:3001.
The dashboard is back. Datasource connected. All panels showing data. Instance dropdown populated. Zero manual steps. Zero clicking. Zero reconfiguring.
The container is completely disposable. The configuration files are the system. That is infrastructure as code working exactly as it should.
Part 7 — Watching the dashboard respond live
With the dashboard open in the browser on Last 15 minutes time range, I ran a stress test:
stress --cpu 4 --timeout 300s &
And watched the dashboard and alerts page simultaneously.
What happened:
The CPU panel line was flat green at ~7% baseline. As the stress test loaded the CPUs, the line climbed. When it crossed 40%, the line and fill shifted from green to yellow — the warning zone. When it crossed 50%, they turned red — the critical zone. The dashed threshold lines drawn at 40% and 50% made the boundaries completely clear.
On the Prometheus alerts page at the same time, HighCPUUsage went inactive → pending → firing. CriticalCPUUsage followed.
When stress ended, the CPU line fell back through yellow and returned to green. Alerts resolved automatically. The full incident lifecycle — rise, threshold breach, recovery — visible in one colour- coded dashboard view.
This is what observability customisation actually means. Not just having dashboards, but having dashboards that tell the same story as your alerts, in a visual language any engineer can read instantly.
One thing that confused me — and the lesson in it
When I switched the time range to Last 15 minutes during a quiet period, all the panels showed flat lines. No spike visible.
My first instinct was that something broke.
Nothing broke. The spike happened 45 minutes ago. Last 15 minutes only shows the current 15-minute window. Switch back to Last 1 hour and the spike reappears.
This is fundamental to understanding time-series dashboards. The time range is not a filter on existing data — it is a window that determines which data gets rendered. Understanding this stops a whole category of "why is my dashboard empty" confusion.
Module 3 checkpoint — all verified
Prometheus health: Prometheus Server is Healthy. Grafana health: ok Datasource provisioned: Prometheus Dashboard provisioned: Node Overview Recording rules: 3 rules === Checkpoint complete ===
Provisioning working. Dashboard surviving container restarts. Colour-coded thresholds active. Live incident response visible in real time.
What I understand now that I didn't before
Clicking dashboards together is not engineering. It produces something that works today and breaks silently tomorrow. Real dashboards live in version control as JSON files, deploy automatically, and survive any infrastructure event without manual intervention.
Visual consistency is not optional. Dashboard thresholds and alert rule thresholds must match exactly. If your graph turns red at 50% but your alert fires at 80%, engineers stop trusting either one. Consistency is what makes an observability system reliable enough to act on.
Variables make dashboards scalable. A dashboard hardcoded for one instance serves one instance forever. A variable that auto-populates from Prometheus label values works for one instance today and fifty instances next month without a single dashboard edit.
The missing field changes everything. "color": { "mode": "thresholds" } is one line of JSON. Without it, all your threshold configuration is invisible. With it, the entire visual language of the dashboard activates. Details like this are what separate engineers who copy configs from engineers who understand them.
Module 4 next — Loki log aggregation. Promtail pipelines, LogQL queries, and structured logging. The "why did it break" pillar of observability. The piece that metrics alone can never tell you.



