Runbook: PostgreSQL Cluster Unhealthy

Alert name: PostgresClusterUnhealthy

The CNPG collector metrics endpoint is down, indicating the PostgreSQL cluster is not responding.

Affected Services

The blumeops-pg CNPG cluster on indri’s minikube runs databases for:

  • TeslaMate
  • Authentik (cross-cluster from ringtail)
  • Immich
  • Grafana dashboards (TeslaMate datasource)

Diagnostic Steps

  1. Check CNPG cluster status:

    kubectl get cluster blumeops-pg -n databases --context=minikube-indri
    kubectl get pods -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri
  2. Check pod logs:

    kubectl logs -n databases -l cnpg.io/cluster=blumeops-pg --context=minikube-indri --tail=30
  3. Check if pg_isready:

    pg_isready -h pg.ops.eblu.me -p 5432
  4. Check PVC storage:

    kubectl get pvc -n databases --context=minikube-indri

Common Causes

  • Pod crash — OOM, disk full, or configuration error
  • PVC storage full — check with kubectl exec into the pod and df -h
  • Minikube issue — if the node is under memory pressure, CNPG pods may be evicted
  • Network — Caddy L4 proxy (pg.ops.eblu.me) may be misconfigured

Silencing

For planned database maintenance:

  1. Grafana → Alerting → Silences → Create Silence
  2. Match alertname = PostgresClusterUnhealthy