Runbook: Service Probe Failure
Alert name: ServiceProbeFailure
A blackbox HTTP health check has failed for 2+ minutes, meaning a service is not responding to its health endpoint.
Affected Services
This alert covers services probed by the Alloy blackbox exporter on indri’s minikube cluster:
| Service | Health Endpoint |
|---|---|
| miniflux | /healthcheck |
| kiwix | / |
| transmission | /transmission/web/ |
| devpi | /+api |
| argocd | /healthz |
The failing service is identified by the service label in the alert (extracted from the job label).
Diagnostic Steps
-
Check which service is down — the alert label
servicetells you. You can also run:kubectl get pods -n <namespace> --context=minikube-indri -
Check pod status — look for CrashLoopBackOff, OOMKilled, or pending pods:
kubectl describe pod -n <namespace> <pod-name> --context=minikube-indri -
Check pod logs:
kubectl logs -n <namespace> <pod-name> --context=minikube-indri --tail=50 -
Check if minikube itself is healthy:
ssh indri 'minikube status' -
Check NFS mounts (kiwix, transmission depend on sifaka NFS):
ssh indri 'df -h | grep Volumes'
Common Causes
- Pod crashed — check logs, restart with
kubectl delete pod - NFS mount lost — sifaka offline or AutoMounter not running. SSH to indri and check
/Volumes/ - Resource exhaustion — check
kubectl top pods -n <namespace>for memory/CPU pressure - Minikube paused/stopped —
ssh indri 'minikube status', restart if needed
Silencing
For planned maintenance, silence this alert in Grafana:
- Go to Alerting → Silences → Create Silence
- Match label
alertname = ServiceProbeFailure - Optionally match
service = <specific-service>to silence only one - Set duration for your maintenance window
Related
- deploy-infra-alerting — Alerting pipeline overview
- configure-grafana-alerting-pipeline — Pipeline configuration