Ollama

LLM inference server with GPU acceleration. Runs on ringtail with declarative model management via a sidecar.

Quick Reference

PropertyValue
URLhttps://ollama.ops.eblu.me
Tailscale URLhttps://ollama.tail8d86e.ts.net
Namespaceollama
Clusterringtail k3s
Imageollama/ollama:0.17.5
Upstreamhttps://github.com/ollama/ollama
Manifestsargocd/manifests/ollama/
API Port11434

Architecture

models.txt (ConfigMap, declarative)
    │
    ▼
model-sync sidecar ──ollama pull──► Ollama server (GPU)
    │                                    │
    │ reads /config/models.txt           │ serves /api/*
    │ polls every 30 min                 │ NVIDIA runtime (RTX 4080, time-sliced)
    │                                    │
    └────────────────────────────────────┘
                     │
                /models (200 Gi hostPath PV)
                /mnt/storage1/ollama on ringtail

Models

Declared in argocd/manifests/ollama/models.txt. The model-sync sidecar pulls missing models on startup and every 30 minutes.

ModelParameters
qwen2.5:14b14B
deepseek-r1:14b14B
phi4:14b14B
gemma3:12b12B

To add or remove models, edit models.txt and sync via ArgoCD.

GPU

Shares ringtail’s RTX 4080 with frigate via NVIDIA device plugin time-slicing (2 virtual slots). Constrained to one loaded model and one parallel request to avoid VRAM contention.

SettingValue
OLLAMA_MAX_LOADED_MODELS1
OLLAMA_NUM_PARALLEL1
GPU limitnvidia.com/gpu: "1" (time-sliced)

Storage

MountBackendSize
/modelshostPath PV (/mnt/storage1/ollama)200 Gi

PV reclaim policy is Retain — models survive PV deletion.

Networking

EndpointReachable from
https://ollama.ops.eblu.mePublic internet (Fly.io → Caddy)
https://ollama.tail8d86e.ts.netTailnet clients
http://ollama.ollama.svc.cluster.local:11434In-cluster (ringtail)

Tailscale ingress uses ProxyGroup ingress — no explicit host: field (see tailscale-operator).