First Alert and Runbook

Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc.

What to Do

1. Choose the First Alert

The best candidate is a blackbox probe failure because:

Alloy’s blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals
The metric probe_success is already in Prometheus
It maps directly to what services-check does (HTTP health checks)
A single alert rule with a service label can cover all probed services

2. Create the Alert Rule

Provision via YAML in the alerting provisioning ConfigMap. The rule should:

Query probe_success == 0 from Prometheus
Fire after the condition persists for 2 minutes (avoid flapping)
Include labels: severity: warning, service: {{ $labels.instance }}
Include annotations: summary, runbook_url pointing to the runbook doc

3. Create the Runbook

Write docs/how-to/runbooks/runbook-service-probe-failure.md as a how-to doc explaining:

What the alert means
How to check which service is down
Common causes and resolution steps
How to silence the alert if the downtime is planned

4. Verify End-to-End

Stop one of the probed services (e.g., scale miniflux to 0)
Wait for the alert to fire (~2 minutes)
Confirm ntfy notification arrives with correct summary and runbook link
Click the runbook link and verify it reaches docs.eblu.me
Scale the service back up
Confirm “resolved” notification arrives
Confirm no repeat notification during the 24h window

Key Details

Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies
The blackbox probe metrics from Alloy use the job name blackbox and include an instance label with the service name
The runbook URL format: https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure

Verification

Alert rule appears in Grafana UI under Alerting → Alert Rules
Simulated failure triggers ntfy notification within ~3 minutes
Notification includes service name, summary, and clickable runbook link
Resolution triggers a “resolved” notification
No repeat notification within 24h window

configure-grafana-alerting-pipeline — Prerequisite: pipeline must be working
deploy-infra-alerting — Parent goal
port-services-check-alerts — Next: port remaining checks
runbook-service-probe-failure — The runbook created for this alert

BlumeOps Docs

Explorer

First Alert and Runbook

First Alert and Runbook

What to Do

1. Choose the First Alert

2. Create the Alert Rule

3. Create the Runbook

4. Verify End-to-End

Key Details

Verification

Graph View

Table of Contents

Backlinks

BlumeOps Docs

Explorer

First Alert and Runbook

First Alert and Runbook

What to Do

1. Choose the First Alert

2. Create the Alert Rule

3. Create the Runbook

4. Verify End-to-End

Key Details

Verification

Related

Graph View

Table of Contents

Backlinks