Runbook: Service Probe Failure

Alert name: ServiceProbeFailure

A blackbox HTTP health check has failed for 2+ minutes, meaning a service is not responding to its health endpoint.

Affected Services

This alert covers services probed by the Alloy blackbox exporter on indri’s minikube cluster:

The failing service is identified by the service label in the alert (extracted from the job label).

Check which service is down — the alert label service tells you. You can also run:
```
kubectl get pods -n <namespace> --context=minikube-indri
```

Check pod status — look for CrashLoopBackOff, OOMKilled, or pending pods:

kubectl describe pod -n <namespace> <pod-name> --context=minikube-indri

Check pod logs:

kubectl logs -n <namespace> <pod-name> --context=minikube-indri --tail=50

Check if minikube itself is healthy:
```
ssh indri 'minikube status'
```
Check NFS mounts (kiwix, transmission depend on sifaka NFS):
```
ssh indri 'df -h | grep Volumes'
```

Pod crashed — check logs, restart with kubectl delete pod
NFS mount lost — sifaka offline or AutoMounter not running. SSH to indri and check /Volumes/
Resource exhaustion — check kubectl top pods -n <namespace> for memory/CPU pressure
Minikube paused/stopped — ssh indri 'minikube status', restart if needed

For planned maintenance, silence this alert in Grafana: