Port services-check Alerts to Grafana

Systematically migrate the health checks from mise run services-check to Grafana alert rules, each with a corresponding runbook. After this card, the alerting system covers everything services-check does today.

What to Do

1. Inventory and Prioritize

Map each services-check probe to a data source and alert rule. Some checks already have metrics in Prometheus; others need new instrumentation.

Already have metrics (easy):

HTTP endpoint probes → Alloy blackbox exporter (probe_success)
PostgreSQL health → CNPG metrics (cnpg_pg_replication_streaming, cnpg_collector_up)
K8s pod health → kube-state-metrics (kube_pod_status_phase)
ArgoCD sync status → ArgoCD metrics (argocd_app_info with sync/health labels)

Need new probes or metrics:

Local indri services (forgejo, alloy, borgmatic, zot via brew/launchctl) → Alloy host textfile or new probes
Metrics textfile freshness → node_textfile_mtime_seconds (already collected by Alloy on indri)
Ringtail SSH/tailscale health → Alloy blackbox on ringtail or cross-cluster probe
Public services (docs, cv, forge via Fly.io) → Alloy on Fly.io or Grafana synthetic monitoring

2. Add Missing Probes

Extend Alloy configurations where needed:

Alloy on indri: Add blackbox targets for forgejo, zot (local HTTP endpoints)
Alloy on ringtail: Add blackbox targets for ringtail-local services
Consider: Whether public endpoint probing belongs in Fly.io Alloy or a separate prober

3. Create Alert Rules

For each check category, create provisioned Grafana alert rules. Group related checks into alert rule groups (e.g., “indri-services”, “k8s-health”, “public-endpoints”).

4. Create Runbooks

One runbook per alert type in docs/how-to/runbooks/runbook-<name>.md. Each runbook should cover:

What the alert means
Diagnostic steps
Common fixes
How to silence for planned maintenance

5. Remove from services-check

As each check is ported, remove it from the services-check script (or mark it as “now handled by alerting”). The goal is that services-check shrinks as alerting grows.

Key Details

Don’t try to port everything in one session — this card may span multiple work cycles within the C2 chain
Prioritize checks that have caught real problems in the past
Some checks (like ArgoCD sync status table) may remain in services-check as a human-readable summary even after alerting covers the failure cases
The Alloy blackbox exporter on k8s already covers 5 services; extending it to more is straightforward

Verification

All HTTP endpoint checks from services-check have corresponding alert rules
Pod health checks have corresponding alert rules
PostgreSQL health has a corresponding alert rule
Each alert rule has a runbook doc in docs/how-to/runbooks/
Test at least 2-3 failure scenarios end-to-end
services-check script has been updated to reflect ported checks

first-alert-and-runbook — Prerequisite: established the pattern
deploy-infra-alerting — Parent goal
refactor-services-check-to-query-alerts — Next: make services-check query alerts

BlumeOps Docs

Explorer

Port services-check Alerts to Grafana

Port services-check Alerts to Grafana

What to Do

1. Inventory and Prioritize

2. Add Missing Probes

3. Create Alert Rules

4. Create Runbooks

5. Remove from services-check

Key Details

Verification

Graph View

Table of Contents

Backlinks

BlumeOps Docs

Explorer

Port services-check Alerts to Grafana

Port services-check Alerts to Grafana

What to Do

1. Inventory and Prioritize

2. Add Missing Probes

3. Create Alert Rules

4. Create Runbooks

5. Remove from services-check

Key Details

Verification

Related

Graph View

Table of Contents

Backlinks