Skip to main content

llama-server setup

I added llama-server as a provider because, after extensive testing, Ollama and LM Studio cannot control how long a model spends thinking. The result is 30-minute waits that always fall back to standard mode. llama-server is the only local inference tool with deterministic reasoning budget control via --reasoning-budget, enforced at the sampling pipeline level.

This is for users comfortable with the command line. If you prefer a graphical interface, use Ollama or LM Studio. They work well in standard analysis mode.


Step 1: Install llama-server

Mac (Apple Silicon)

brew install llama.cpp

Metal GPU acceleration is automatic on Apple Silicon. No extra drivers needed. Update later with brew upgrade llama.cpp.

To build from source instead (for custom builds or bleeding-edge features):

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release

Windows

winget install llama.cpp

The winget package uses Vulkan, which works with both NVIDIA and AMD GPUs. Run llama-server directly from the command prompt after install.

For NVIDIA GPUs, CUDA is typically faster than Vulkan. Build from source with -DGGML_CUDA=ON for CUDA support, or use the automated PowerShell installer (Danmoreng/llama.cpp-installer) which can detect your GPU and build with the correct CUDA version.


Step 2: Launch the server

Command prompt:

llama-server \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--flash-attn on \
--ctx-size 32768 \
--jinja \
--port 9090 \
--reasoning-budget 2048 \
--host 0.0.0.0 \
--parallel 1
Model Loaded

Confirm that model has been loaded and all slots are idle.

Partial output:

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.019 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008)
...
llama_model_loader: - kv 48: quantize.imatrix.file str = Qwen3.5-35B-A3B-GGUF/imatrix_unsloth....
llama_model_loader: - kv 49: quantize.imatrix.dataset str = unsloth_calibration_Qwen3.5-35B-A3B.txt
llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 510
llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 76
...
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:9090
main: starting the main loop...
srv update_slots: all slots are idle

Step 3: Verify llama-server is running

Open http://localhost:9090 in your browser. Type a message and you should see a response with a collapsible "Reasoning" section above it showing the model's thinking trace.

To verify via curl:

curl http://localhost:9090/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5",
"messages": [
{"role": "user", "content": "What is 25 times 47?"}
]
}'

Outputs:

{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"25 times 47 is **1,175**.\n\nHere is the breakdown:\n$$25 \\times 40 = 1,000$$\n$$25 \\times 7 = 175$$\n$$1,000 + 175 = 1,175$$","reasoning_content":"Thinking Process:\n\n1.  **Analyze the Request:** The user is asking for the product of 25 and 47 (25 times 47).\n\n2.  **Perform the Calculation:**\n    *   Method 1: Standard multiplication.\n        $$47 \\times 25$$\n        $$47 \\times 20 = 940$$\n        $$47 \\times 5 = 235$$\n        $$940 + 235 = 1175$$\n    *   Method 2: Using the property of 25.\n        $$25 \\times 47 = 25 \\times (4 \\times 12 - 1)$$? No, simpler:\n        $$25 \\times 47 = 25 \\times (48 - 1)$$? No.\n        $$25 \\times 47 = (100 / 4) \\times 47 = 100 \\times 47 / 4 = 4700 / 4$$\n        $$4700 / 2 = 2350$$\n        $$2350 / 2 = 1175$$\n    *   Method 3: Vertical multiplication.\n          47\n        x 25\n        ----\n         235  (5 * 47)\n         940  (20 * 47)\n        ----\n        1175\n\n3.  **Verify the Result:**\n    *   $$25 \\times 40 = 1000$$\n    *   $$25 \\times 7 = 175$$\n    *   $$1000 + 175 = 1175$$\n    *   Result is consistent.\n\n4.  **Formulate the Output:** State the answer clearly.\n\n5.  **Final Output:** 1175.cw\n"}}],"created":1774340570,"model":"unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M","system_fingerprint":"b8390-b6c83aad5","object":"chat.completion","usage":{"completion_tokens":516,"prompt_tokens":20,"total_tokens":536},"id":"chatcmpl-D1FZCYbFocAGXxQLIUscJQWgrxonnDNz","timings":{"cache_n":0,"prompt_n":20,"prompt_ms":296.017,"prompt_per_token_ms":14.80085,"prompt_per_second":67.56368722066638,"predicted_n":516,"predicted_ms":7545.334,"predicted_per_token_ms":14.62274031007752,"predicted_per_second":68.3866347069593}}

Try a reasoning-heavy prompt (maths, logic puzzle) to confirm the thinking trace appears and caps at 2048 tokens rather than running indefinitely.


Step 4: Configure the coaching platform

  1. Open the coaching platform and go to Settings > AI
  2. Select llama-server (Advanced) as your provider
  3. Enter the base URL from the table below
  4. Detect models and select the same model that the server was started with
  5. If you set an API key on the server, enter it in the API key field. If you did not use --api-key, leave it blank
  6. Set as active provider
  7. Click Save

Thinking is enabled by default for this provider. The server enforces the reasoning budget.

Editionllama-server URLNotes
Desktop (same Mac)http://localhost:9090/v1llama-server and the coaching platform run on the same machine
Server (Docker on same Mac)http://host.docker.internal:9090/v1Docker's special address for reaching the host machine
Server or K8s (different machine)http://MACHINE-IP:9090/v1Replace with the IP of the machine running llama-server. Requires --host 0.0.0.0

To test the connection, open any session with a recording and transcript, then click Generate AI Assessment.


Additional information

Command flags

  • -hf handles the download automatically on first launch. Two files are downloaded: the main model (~21 GB for Q4_K_M) and the vision projector (~857 MB). You will see a progress bar for each. This only happens once. Subsequent launches start immediately from the cache at ~/Library/Caches/llama.cpp/ on macOS or %LOCALAPPDATA%\llama.cpp\ on Windows
  • --jinja is required. It enables the chat template that makes reasoning mode work. Without it, the model will not produce a thinking trace
  • --reasoning-budget 2048 caps thinking at 2048 tokens. This is a good default for ACC and PCC assessments. For MCC assessments (39 markers), increase to 4096 for more thorough reasoning. 0 disables thinking, -1 is unlimited (not recommended)
  • --temp 0.6 — Controls how creative or random the AI's responses are. Lower values (closer to 0) make it more focused and predictable. 0.6 is a good balance between consistency and natural-sounding language
  • --top-p 0.95 — Works alongside temperature. It tells the AI to only consider words that make up the top 95% of likely choices, cutting out very unlikely ones. This helps keep responses coherent
  • --top-k 20 — Another creativity control. At each step, the AI only considers its top 20 word choices. This prevents it from picking something completely off-track
  • --min-p 0.00 — A cutoff for how unlikely a word can be before it's excluded. Set to 0.00, meaning no extra filtering. The other settings above are doing the work here
  • --flash-attn on — A speed optimisation. It makes the AI process text faster and use less memory, with no change to the quality of responses
  • --ctx-size 32768 — The AI's "working memory" — how much text it can read and write in one go (about 25,000 words). This needs to be large enough to fit the coaching transcript plus the assessment output
  • --port 9090 — The port the server listens on (default is 8080)
  • --host 0.0.0.0 — Makes the server available to other devices on your network, not just your own computer. Remove this flag and the server will only listen on localhost
  • --parallel 1 After download, the server loads the model onto GPU with one request slot, and starts serving

The sampling parameters (--temp, --top-p, --top-k, --min-p) are server defaults. The coaching platform sends its own values in each request, so these mainly affect the built-in web UI. The two parameters that matter most for the coaching platform are --reasoning-budget and --ctx-size.

Model downloads

GGUF model files are hosted on Hugging Face. No account is needed (Apache 2.0 licence, no gating). You can either download manually or let llama-server pull the model automatically on first launch (see Step 3).

For coaching assessments, Qwen3.5 35B A3B is the recommended model. It is a mixture-of-experts architecture that keeps active parameters low while maintaining quality.

QuantFile sizeGood for
Q4_K_M~22 GBBest balance of quality and size for most systems
Q5_K_M~26 GBSlightly better quality if you have room
Q3_K_M~16 GBTight on memory
UD-Q4_K_XL~22 GBUnsloth's dynamic quant, tested well in benchmarks

To download manually via the Hugging Face CLI:

huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
--include "Qwen3.5-35B-A3B-Q4_K_M.gguf" \
--local-dir ./models

Or download directly from your browser at huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF.

Choosing a reasoning budget

BudgetUse caseTypical time
0No thinking (standard mode)~2-3 min
1024Quick reasoning for routine sessions~5-8 min
2048Balanced, good default for ACC and PCC assessments~8-15 min
4096Thorough reasoning, recommended for MCC (39 markers)~15-25 min
-1Unlimited (not recommended)Unpredictable

--reasoning-budget is server-level and applies to all requests. For different budgets per task, run multiple instances on different ports.

llama-server automatically allocates parallel request slots based on available memory (typically 4 on systems with 64 GB+). This means multiple assessments can run at the same time. On the K8s edition with multiple coaches, assessments triggered by different users are handled concurrently without queuing. Each slot has its own context window, so one assessment does not affect another.

Current limitations

  • Server-level budget only (no per-request control yet, being discussed upstream)
  • Some models may ignore the budget flag (Qwen3.5 works correctly)
  • Per-request thinking can be toggled on/off via the API but per-request budget amounts are not yet supported

Version notes

  • --flash-attn now requires an explicit value: on, off, or auto. Older tutorials may show it as a bare flag without a value, which no longer works.
  • --reasoning-budget with positive values (e.g. 2048) requires build 8300+ (March 2026). Homebrew and winget packages include this as of mid-March 2026. Run llama-server --version to check your build number.
  • Confirmed working on build 8390 with Qwen3.5-35B-A3B on Apple M2 Ultra (173 GB unified memory).

Troubleshooting

"Connection refused" error

  • Check that llama-server is running (you should see output in the terminal where you launched it)
  • Verify the URL matches your setup (see table above)
  • Check that the model finished loading (wait for the "all slots are idle" message in the server log)

Assessment takes a long time

  • Reduce --reasoning-budget to 1024 or 0 for faster results
  • Check that no other memory-intensive applications are competing for GPU

Model not found

  • If using -hf, check your internet connection (the model downloads on first launch)
  • If using a local GGUF file, verify the file path is correct