Skip to content

Using Unsloth with Sympozium

Sympozium supports Unsloth as an LLM provider. Unsloth is primarily a fine-tuning library, but its Run Tutorials walk you through serving fine-tuned (or stock) models over an OpenAI-compatible HTTP API via llama.cpp's llama-server or vLLM. Sympozium treats Unsloth exactly like any other OpenAI-compatible endpoint, so you can point an instance at a locally-running Unsloth model and drive it with skills, channels, and schedules like any cloud-backed agent.


Prerequisites

  • A running Kubernetes cluster (Kind, minikube, etc.)
  • Sympozium installed (sympozium install)
  • Unsloth installed on your host machine (see the Unsloth install docs)
  • A model exported to GGUF (for llama.cpp) or served directly (for vLLM)

Starting the Unsloth server

Unsloth itself is a training library — it does not ship its own serve endpoint. Follow one of Unsloth's run tutorials (e.g. Run Gemma 3) to serve a model over HTTP. Two common paths:

Option A — llama.cpp llama-server (GGUF)

After exporting your model to GGUF with Unsloth:

./llama-server \
  --model ./unsloth.Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --jinja

This exposes an OpenAI-compatible API at http://localhost:8080/v1.

Option B — vLLM

vllm serve unsloth/gemma-3-12b-it --host 0.0.0.0 --port 8000

This exposes an OpenAI-compatible API at http://localhost:8000/v1.

Bind to 0.0.0.0: Agent pods cannot reach 127.0.0.1 on the host — always bind the server to 0.0.0.0 (or explicitly to the host gateway IP).

Finding the host gateway IP

Kind:

docker exec kind-control-plane ip route | grep default | awk '{print $3}'

minikube:

minikube ssh -- ip route | grep default | awk '{print $3}'

The base URL

http://<host-gateway-ip>:8080/v1    # llama-server
http://<host-gateway-ip>:8000/v1    # vLLM

Verify reachability:

curl http://172.18.0.1:8080/v1/models

Creating a SympoziumInstance

Unsloth-served models do not require an API key, but authRefs is mandatory — create a Secret with a placeholder value.

kubectl create secret generic unsloth-key \
  --from-literal=OPENAI_API_KEY=not-needed
apiVersion: sympozium.ai/v1alpha1
kind: SympoziumInstance
metadata:
  name: unsloth-agent
spec:
  agents:
    default:
      model: unsloth/gemma-3-12b-it
      baseURL: "http://172.18.0.1:8080/v1"
  authRefs:
    - provider: unsloth
      secret: unsloth-key
  skills:
    - skillPackRef: k8s-ops
  policyRef: default-policy

Note: The model field should match the ID reported by /v1/models — for llama-server this is usually the GGUF filename or the alias you passed via --alias; for vLLM it is the HuggingFace repo ID you loaded.


Running an AgentRun

apiVersion: sympozium.ai/v1alpha1
kind: AgentRun
metadata:
  name: unsloth-test
spec:
  instanceRef: unsloth-agent
  task: "List all pods across every namespace and summarise their status."
  model:
    provider: unsloth
    model: unsloth/gemma-3-12b-it
    baseURL: "http://172.18.0.1:8080/v1"
    authSecretRef: unsloth-key
  skills:
    - k8s-ops
  timeout: "5m"
kubectl apply -f unsloth-test.yaml
kubectl get agentrun unsloth-test -w

The phase transitions: PendingRunningSucceeded (or Failed).

Because Unsloth runs locally, Sympozium applies local-provider timeouts automatically (5 min per request, 30 min per run, 2 retries).


Network policies

The default Sympozium network policies do not open egress on 8080 or 8000. You need to add an egress rule for whichever port your Unsloth server listens on.

Add to both sympozium-agent-allow-egress and sympozium-agent-server-allow-egress in config/network/policies.yaml:

# Allow Unsloth via llama-server (port 8080) or vLLM (port 8000)
- to: []
  ports:
    - protocol: TCP
      port: 8080
    - protocol: TCP
      port: 8000

Apply:

kubectl apply -f config/network/policies.yaml

Sandbox note: Pods with sympozium.ai/sandbox: "true" use the sympozium-sandbox-restricted policy that only allows DNS and localhost IPC. Sandboxed agents cannot reach Unsloth directly.


Node discovery

Sympozium's node-probe DaemonSet already probes port 8080 under the llama-cpp target name and port 8000 under the vllm target — both of which will detect an Unsloth-served model running on those ports. The discovered models appear under the corresponding provider annotation on the node. There is intentionally no separate unsloth node-probe target to avoid port conflicts with those existing targets.


Using with PersonaPacks

apiVersion: sympozium.ai/v1alpha1
kind: PersonaPack
metadata:
  name: my-team
spec:
  baseURL: "http://172.18.0.1:8080/v1"
  authRefs:
    - provider: unsloth
      secret: unsloth-key
  personas:
    - name: assistant
      displayName: "Unsloth Assistant"
      systemPrompt: |
        You are a helpful assistant running on a locally-served Unsloth model.
      skills:
        - k8s-ops
      schedule:
        type: heartbeat
        interval: "1h"
        task: "Check cluster health."

Unsloth vs LM Studio vs Ollama

Feature Unsloth LM Studio Ollama
Primary role Fine-tuning + serve via llama.cpp/vLLM GUI model server CLI model server
GUI None (Python / Jupyter) Full desktop app CLI-first
Default port 8080 (llama-server) or 8000 (vLLM) 1234 11434
Model format GGUF / HF / vLLM GGUF Ollama-native
Tool calling Depends on serve layer (--jinja for llama-server) Supported (model dependent) Supported (model dependent)
In-cluster deployment Custom (requires packaging) Not supported Supported
Strengths Fast fine-tuning of your own LoRA, then serve Easy model browsing In-cluster + auto-discovery

Use Unsloth when you've fine-tuned a model with Unsloth and want to run Sympozium agents against that exact model.


Troubleshooting

Agent pod fails to connect

Symptom: AgentRun fails with connection refused or timeout.

curl http://172.18.0.1:8080/v1/models

If this fails, ensure your Unsloth serve process is running and bound to 0.0.0.0.

Tool calls never arrive

Symptom: Agent chats but never invokes skills.

Make sure llama-server was started with --jinja (for Gemma/Qwen/Llama3 tool-calling templates). Without this flag, tool-call JSON is emitted as plain text and never parsed into structured tool_calls.

Slow responses

  • Use a smaller quant (Q4_K_M instead of Q8_0)
  • Increase AgentRun timeout: timeout: "15m"
  • Verify GPU offload (--n-gpu-layers for llama-server)