Local Models — Cluster-Local Inference¶

Run LLMs directly inside your Kubernetes cluster — no API keys, no external providers, no data leaving your network. Sympozium downloads the model weights, deploys a llama.cpp inference server, and exposes an OpenAI-compatible endpoint that your agents use automatically.

Quickstart: Zero to Local Inference in 5 Minutes¶

1. Deploy a Model¶

Open the web UI and navigate to Models in the sidebar. Click Deploy Model.

Pick a preset to get started quickly:

Preset	Size	Context	Best for
Qwen3 8B (Q4)	5 GB	8K tokens	General use, good quality
Qwen3.5 9B (Q4)	5.7 GB	8K tokens	Latest Qwen, best quality
Phi-3 Mini 4K (Q4)	2.2 GB	4K tokens	Fast, lightweight
Qwen3 0.6B (Q8)	0.6 GB	4K tokens	Testing, minimal resources

Click a preset to auto-fill the form, then click Deploy.

The model will progress through phases: Pending → Downloading → Loading → Ready. Download time depends on model size and network speed.

2. Create an Instance¶

Once the model shows Ready, navigate to Instances → Create Instance.

The provider dropdown now shows your model at the top as (Local Model). Select it — the wizard automatically: - Skips the API key step (no key needed) - Skips the model selector step (model is already known) - Jumps straight to skills configuration

Complete the wizard and your instance is ready to use.

3. Run an Agent¶

From the instance detail page, trigger an ad-hoc run or connect a channel. The agent will use your cluster-local model for all inference — no external API calls.

4. (Optional) Deploy a Team¶

The local-inference ensemble provides a pre-configured 2-agent team (assistant + coder) that runs entirely on a local model. To use it:

Deploy a model named qwen3-8b-q4 (use the Qwen3 8B preset)
Navigate to Ensembles → local-inference → Activate
The ensemble waits for the model to be Ready, then stamps out instances automatically

How It Works¶

A Model is a declarative Kubernetes resource. The controller reconciles it into standard K8s primitives:

Model CR (you create)
  │
  ├─ PersistentVolumeClaim   (stores downloaded GGUF weights)
  ├─ Job                     (downloads GGUF from URL to PVC)
  ├─ Deployment              (llama-server serving the model)
  └─ Service                 (ClusterIP, OpenAI-compatible API)

Agents connect to the model via the Service endpoint (http://model-<name>.sympozium-system.svc:8080/v1). The LLMProvider interface treats it identically to OpenAI — no code changes needed.

Deployment Methods¶

Web UI¶

Models → Deploy Model → pick a preset or enter a custom GGUF URL → Deploy.

kubectl¶

apiVersion: sympozium.ai/v1alpha1
kind: Model
metadata:
  name: qwen3-8b-q4
  namespace: sympozium-system
spec:
  source:
    url: "https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"
  storage:
    size: "8Gi"
  resources:
    memory: "8Gi"
    cpu: "4"
  inference:
    contextSize: 8192

kubectl apply -f model.yaml
kubectl get models -n sympozium-system -w

Helm¶

# values-local.yaml
models:
  enabled: true
  items:
    - name: qwen3-8b-q4
      url: "https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"
      storageSize: "8Gi"
      memory: "8Gi"
      contextSize: 8192

helm upgrade sympozium ./charts/sympozium -f values-local.yaml

Connecting Models to Agents¶

Single Instance (modelRef)¶

Reference a Model by name on an AgentRun:

spec:
  model:
    modelRef: qwen3-8b-q4   # resolves to the Model's endpoint

The controller sets provider: openai, baseURL: <endpoint>, and requires no auth secret.

Ensemble (modelRef)¶

Set modelRef at the ensemble level — all personas in the team use the same local model:

apiVersion: sympozium.ai/v1alpha1
kind: Ensemble
metadata:
  name: my-local-team
spec:
  modelRef: qwen3-8b-q4    # all personas use this model
  personas:
    - name: researcher
      systemPrompt: "You are a research assistant..."
    - name: writer
      systemPrompt: "You are a technical writer..."

The controller waits for the Model to reach Ready before creating instances. If the model isn't deployed yet, the ensemble stays pending and retries every 10 seconds.

Web UI Auto-Wiring¶

When creating an Instance or activating an Ensemble via the web UI, Ready models appear automatically at the top of the provider dropdown as (Local Model). No manual endpoint configuration needed.

Model CRD Reference¶

Spec¶

Field	Type	Default	Description
`source.url`	string	required	GGUF download URL
`source.filename`	string	`model.gguf`	Target filename on PVC
`storage.size`	string	`10Gi`	PVC storage size
`storage.storageClass`	string	cluster default	PVC storage class
`inference.image`	string	`ghcr.io/ggml-org/llama.cpp:server`	Inference server image
`inference.port`	int	`8080`	Server listen port
`inference.contextSize`	int	`4096`	Max context window (tokens)
`inference.args`	[]string	`[]`	Extra llama-server args
`resources.gpu`	int	`0`	GPU count (`nvidia.com/gpu`). 0 = CPU-only
`resources.memory`	string	`16Gi`	Memory request/limit
`resources.cpu`	string	`4`	CPU request/limit
`nodeSelector`	map	`{}`	Node scheduling constraints
`tolerations`	[]Toleration	`[]`	Node taint tolerations

Status¶

Field	Description
`phase`	Pending, Downloading, Loading, Ready, or Failed
`endpoint`	Cluster-internal OpenAI-compatible URL (set when Ready)
`message`	Human-readable status detail
`conditions`	Downloaded, ServerReady

GPU Configuration¶

Models default to CPU-only inference (gpu: 0). For GPU acceleration:

spec:
  resources:
    gpu: 1
  inference:
    contextSize: 32768
    args:
      - "--n-gpu-layers"
      - "99"           # offload all layers to GPU
  nodeSelector:
    gpu-type: a100     # schedule on GPU nodes
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

Context Size¶

The contextSize field controls the maximum token window. Choose based on your model and available memory:

Context Size	Memory Impact	Use Case
2048	Minimal	Short tasks, testing
4096	~2 GB extra	Default, most tasks
8192	~4 GB extra	Long documents, code
32768	~16 GB extra	Full-context analysis (GPU recommended)

Larger contexts require more memory. If the inference server OOMs, reduce contextSize or increase resources.memory.

The "local-inference" Ensemble¶

Sympozium ships with a default ensemble designed for local models:

Personas: - Local Assistant — general Q&A with persistent memory - Local Coder — writes code with software-dev tools, delegates from assistant

Prerequisites: Deploy a Model named qwen3-8b-q4. The ensemble references it via modelRef and waits for it to be Ready.

Activate via UI: Ensembles → local-inference → Activate

Activate via kubectl:

kubectl patch ensemble local-inference -n sympozium-system \
  --type merge -p '{"spec":{"enabled":true}}'

Cleanup¶

Delete a Model CR and all owned resources (PVC, Job, Deployment, Service) are cleaned up automatically:

kubectl delete model qwen3-8b-q4 -n sympozium-system

Or via the web UI: Models → click model → Delete.

Troubleshooting¶

Symptom	Check	Fix
Stuck in Downloading	`kubectl logs -n sympozium-system job/model-<name>-download`	Check URL, network, PVC size
Stuck in Loading	`kubectl logs -n sympozium-system deploy/model-<name>`	Increase memory, check OOM
Failed phase	`kubectl describe model <name> -n sympozium-system`	See `status.message`
Context size error	Agent run logs show "exceeds available context"	Increase `inference.contextSize` or use a model with larger native context
Pod Pending (GPU)	`kubectl describe pod -n sympozium-system -l app.kubernetes.io/instance=<name>`	Set `gpu: 0` for CPU-only, or ensure GPU nodes are available
Download 404	Model phase shows Failed	Update `source.url` — controller retries on spec change