Skip to content

Using llama-server with Sympozium

Sympozium supports llama-server (the official llama.cpp HTTP server) as an LLM provider. llama-server exposes an OpenAI-compatible API, so Sympozium treats it identically to any OpenAI-compatible endpoint. This lets you run agents on local GPU hardware with GGUF models.


Prerequisites

  • A running Kubernetes cluster (Kind, minikube, etc.)
  • Sympozium installed (sympozium install)
  • llama-server installed on your host machine (brew install llama.cpp or build from source)
  • A GGUF model downloaded or a Hugging Face model reference

Starting llama-server

llama-server \
    -hf unsloth/gemma-4-26B-A4B-it-GGUF:Q8_0 \
    --port 8080 \
    --n-gpu-layers 999 \
    --ctx-size 65536 \
    --batch-size 8192 \
    --ubatch-size 2048 \
    --threads 16 \
    --flash-attn on \
    --cont-batching

By default llama-server binds to 127.0.0.1:8080. To allow cluster access, bind to all interfaces:

llama-server --host 0.0.0.0 --port 8080 -hf <model>

Verify the server is running

curl http://localhost:8080/health
curl http://localhost:8080/v1/models

Finding the host gateway IP

Agent pods need a routable IP to reach llama-server on the host.

Kind:

docker exec kind-control-plane ip route | grep default | awk '{print $3}'

This typically returns something like 172.18.0.1.

minikube:

minikube ssh -- ip route | grep default | awk '{print $3}'

Cloud clusters: llama-server must be reachable from the cluster network. Use the machine's private IP.

The base URL

Once you have the host gateway IP, the base URL is:

http://<host-gateway-ip>:8080/v1

For example: http://172.18.0.1:8080/v1


Node discovery

The node-probe DaemonSet automatically discovers llama-server instances running on port 8080 (via the llama-cpp probe target). When llama-server is detected, the web UI wizard will show the node as available under the "Installed on Node" inference mode.

No additional configuration is needed — the existing llama-cpp node-probe target covers llama-server since they share the same endpoints (/health and /v1/models).


Creating a SympoziumInstance

llama-server does not require an API key, but the authRefs field is mandatory — create a Secret with a placeholder value.

kubectl create secret generic llama-server-key \
  --from-literal=OPENAI_API_KEY=not-needed
apiVersion: sympozium.ai/v1alpha1
kind: SympoziumInstance
metadata:
  name: llama-server-agent
spec:
  agents:
    default:
      model: unsloth/gemma-4-26B-A4B-it-GGUF
      baseURL: "http://172.18.0.1:8080/v1"
  authRefs:
    - provider: llama-server
      secret: llama-server-key
  skills:
    - skillPackRef: k8s-ops
  policyRef: default-policy

Running an AgentRun

apiVersion: sympozium.ai/v1alpha1
kind: AgentRun
metadata:
  name: llama-server-test
spec:
  instanceRef: llama-server-agent
  task: "List all pods across every namespace and summarise their status."
  model:
    provider: llama-server
    model: unsloth/gemma-4-26B-A4B-it-GGUF
    baseURL: "http://172.18.0.1:8080/v1"
    authSecretRef: llama-server-key
  skills:
    - k8s-ops
  timeout: "5m"
kubectl apply -f llama-server-test.yaml
kubectl get agentrun llama-server-test -w

The phase transitions: Pending -> Running -> Succeeded (or Failed).


Network policies

The default Sympozium network policies may not include egress on port 8080 for agent pods. If needed, add an egress rule:

# Allow llama-server (default port 8080)
- to: []
  ports:
    - protocol: TCP
      port: 8080

Add this to both sympozium-agent-allow-egress and sympozium-agent-server-allow-egress, then apply:

kubectl apply -f config/network/policies.yaml

Sandbox note: Pods with sympozium.ai/sandbox: "true" use the sympozium-sandbox-restricted policy that only allows DNS and localhost IPC. Sandboxed agents cannot reach llama-server directly.


Using with PersonaPacks

You can point an entire PersonaPack at llama-server by setting baseURL during onboarding:

apiVersion: sympozium.ai/v1alpha1
kind: PersonaPack
metadata:
  name: my-team
spec:
  baseURL: "http://172.18.0.1:8080/v1"
  authRefs:
    - provider: llama-server
      secret: llama-server-key
  personas:
    - name: assistant
      displayName: "Local Assistant"
      systemPrompt: |
        You are a helpful assistant running on local hardware.
      skills:
        - k8s-ops
      schedule:
        type: heartbeat
        interval: "1h"
        task: "Check cluster health."

llama-server vs LM Studio vs Ollama

Feature llama-server LM Studio Ollama
Interface CLI Desktop GUI CLI + API
Default port 8080 1234 11434
Model format GGUF / HF download GGUF Ollama-native (GGUF)
GPU support CUDA, Metal, Vulkan Metal, CUDA Metal, CUDA
Tool calling Supported (model dependent) Supported (model dependent) Supported (model dependent)
In-cluster deployment Container image available Not supported (desktop app) Supported (container image)
Node discovery Auto-discovery via node-probe Manual baseURL Auto-discovery via node-probe
Continuous batching Yes No Yes

Use llama-server when you want maximum control over inference parameters (context size, batch size, GPU layers) and are comfortable with CLI tooling. Use LM Studio for a GUI-first experience. Use Ollama for the simplest setup or in-cluster deployment.


Supported models

Any GGUF model or Hugging Face model works with llama-server. Popular choices:

Model Parameters Tool calling Notes
Meta-Llama-3-8B-Instruct 8B Yes Good general-purpose model
Meta-Llama-3-70B-Instruct 70B Yes Higher quality, needs more VRAM
Qwen2.5-7B-Instruct 7B Yes Strong tool-calling support
gemma-4-26B-A4B-it 26B (4B active) Yes Efficient MoE architecture
DeepSeek-R1-Distill 7B No Reasoning model, no tool use

Tool calling: Sympozium agents rely on tool calling to execute commands, read files, and interact with the cluster. Models without tool-calling support can still answer questions but cannot use skills or execute actions.


Local development note

llama-server's default port (8080) may conflict with the Sympozium API server during local development. The make dev target runs the API on port 8081 by default, leaving port 8080 free for llama-server. If you encounter a port conflict, either start llama-server on a different port (--port 8087) or override the API port: API_ADDR=:8082 make dev.


Troubleshooting

Agent pod fails to connect to llama-server

Symptom: AgentRun fails with a connection refused or timeout error.

Check the server is running:

curl http://172.18.0.1:8080/health

If this fails, ensure llama-server is started and bound to 0.0.0.0.

Check from inside the cluster:

kubectl run -it --rm debug --image=busybox -- wget -qO- http://172.18.0.1:8080/v1/models

Model not loading

Symptom: llama-server exits with an error or runs out of memory.

  • Use a smaller quantization (e.g. Q4_K_M instead of Q8_0)
  • Reduce --n-gpu-layers to offload fewer layers to GPU
  • Reduce --ctx-size to lower memory usage

Network policy blocking traffic

Symptom: Agent pods timeout but llama-server is reachable from the host.

Verify the egress rule for port 8080 is in place:

kubectl get networkpolicy -A | grep sympozium

Slow responses

  • Use a smaller or more quantized model
  • Increase the AgentRun timeout: timeout: "15m"
  • Enable --flash-attn on for faster attention computation
  • Enable --cont-batching for better throughput with concurrent requests
  • Ensure GPU offloading is active (--n-gpu-layers 999)