llama_server
Manage a local llama.cpp server: start with specific model and cache types, stop the process, or check its status.
Instructions
Manage a local llama.cpp server: start with specific model, cache type (turbo2/turbo3/turbo4 for TurboQuant), GPU layers, and context size. Runs alongside LM Studio on a different port. Use "status" to check, "stop" to kill. Once started, use local_llm_run with the endpoint to query it.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| port | No | Port to run the server on (default: 8082, avoids conflict with LM Studio on 1234). | |
| action | Yes | "start" launches a llama-server process, "stop" kills it, "status" shows if running. | |
| extra_args | No | Additional CLI arguments to pass to llama-server (e.g. ["--threads", "8"]). | |
| gpu_layers | No | Number of layers to offload to GPU (0 = CPU only). Use -1 for all layers. | |
| model_path | No | Path to GGUF model file. Required for "start". | |
| cache_type_k | No | KV cache type for keys. Options: q8_0, q4_0, f16, turbo2, turbo3, turbo4 (TurboQuant fork). Default: f16. | |
| cache_type_v | No | KV cache type for values. Same options as cache_type_k. Asymmetric config (e.g. q8_0 keys + turbo4 values) often gives best results. | |
| context_size | No | Context window size in tokens (default: 4096). | |
| flash_attention | No | Enable flash attention (-fa). Recommended for long contexts. |