Deploy Infrastructure Alerting Pipeline

Replace the manual mise run services-check approach with Grafana Unified Alerting backed by ntfy push notifications, so infrastructure problems page once and include actionable runbook links.

Architecture

Prometheus (metrics) ──┐
                       ├──▶ Grafana Alert Rules ──▶ ntfy webhook ──▶ iOS push
Loki (logs) ──────────┘          │
                                 │
                          Notification Policy
                          (group_wait: 1m,
                           group_interval: 12h,
                           repeat_interval: 24h)

Design Decisions

DecisionChoiceRationale
Alert engineGrafana Unified AlertingAlready deployed, no new service needed
Notificationntfy webhook contact pointAlready deployed on ringtail, iOS app works
Anti-noise24h repeat intervalPage once per day max per alert group
Runbooksdocs/how-to/runbooks/<name>.mdClickable link in every notification
ProvisioningGrafana provisioning YAML (GitOps)Alerts defined in repo, not just UI
Topicinfra-alerts (separate from frigate-alerts)Different severity/audience

Alerting Policy

  • Each alert fires once and does not re-notify for 24 hours
  • A “resolved” notification is sent when the condition clears
  • Every alert annotation includes runbook_url linking to its how-to doc
  • The ntfy message template renders the runbook URL as a clickable action button
  • Alerts are grouped by service to avoid notification storms

Migration Path

  1. Stand up the pipeline: Grafana alerting config, ntfy contact point, notification policy, message template
  2. Create the first alert + runbook as proof of concept (e.g., a blackbox probe failure)
  3. Port services-check health checks to Grafana alert rules, one by one, each with a runbook
  4. Refactor services-check to query the Grafana alerting API instead of doing its own probes

What services-check Covers Today

These checks will be migrated to alert rules:

CategoryChecksData Source
Local services (indri)forgejo, alloy, borgmatic, zot via brew/launchctlNeed new probes or textfile metrics
Metrics textfilesfreshness of .prom filesExisting node_textfile metrics
K8s cluster healthminikube API, k3s APIkube-state-metrics
HTTP endpoints~12 services via CaddyAlloy blackbox exporter (already exists)
RingtailSSH, tailscale, k3s healthNeed new probes
K3s podsntfy, authentik, frigate, etc.kube-state-metrics on ringtail
Public servicesdocs, cv, forge via Fly.ioAlloy on Fly.io or external probe
PostgreSQLCNPG readinessCNPG metrics (already scraped)
ArgoCD syncapp sync/health statusArgoCD metrics or API