Skip to main content
Glama

mcp-llama-swap

PyPI version License Python 3.10+

Hot-swap llama.cpp models inside a running Claude Code session. No context loss. One command.

Plan with a reasoning model. Implement with a coding model. Same session, same context, zero manual overhead.

Supports macOS (launchctl) and Linux (systemd).

Why

Running local LLMs means choosing between a strong reasoning model and a fast coding model. You can't load both on a single machine. Manually swapping models kills your conversation context and flow.

mcp-llama-swap solves this by giving Claude Code a tool to swap the model behind llama-server via your system's service manager (launchctl on macOS, systemd on Linux), while preserving the full conversation history client-side.

Quick Start

Install

# Option A: Run directly with uvx (no install needed)
uvx mcp-llama-swap

# Option B: Install from PyPI
pip install mcp-llama-swap

Configure Claude Code

Add to ~/.claude.json:

{
  "mcpServers": {
    "llama-swap": {
      "command": "uvx",
      "args": ["mcp-llama-swap"],
      "env": {
        "LLAMA_SWAP_CONFIG": "/path/to/config.json"
      }
    }
  }
}

Configure Models

Create config.json (macOS):

{
  "plists_dir": "~/.llama-plists",
  "health_url": "http://localhost:8000/health",
  "health_timeout": 30,
  "models": {
    "planner": "qwen35-thinking.plist",
    "coder": "qwen3-coder.plist",
    "fast": "glm-flash.plist"
  }
}

Or on Linux:

{
  "services_dir": "~/.llama-services",
  "health_url": "http://localhost:8000/health",
  "health_timeout": 30,
  "models": {
    "planner": "llama-server-planner.service",
    "coder": "llama-server-coder.service"
  }
}

Use

Inside Claude Code:

You: list models
You: swap to planner
You: <discuss architecture, define interfaces>
You: swap to coder and implement the plan

That's it. Context is preserved across swaps.

You can also generate new model configs directly:

You: create a model config named "reasoning" for /models/qwen3-30b.gguf with 8192 context

How It Works

Claude Code CLI
    |
    | Anthropic Messages API
    v
LiteLLM Proxy (:4000)         <-- translates Anthropic -> OpenAI format
    |
    | OpenAI Chat Completions API
    v
llama-server (:8000)          <-- model weights swapped via service manager
    ^
    |
mcp-llama-swap                <-- this project (launchctl or systemd)

Claude Code speaks Anthropic format. LiteLLM translates to OpenAI format for llama-server. This MCP server manages which model service is loaded via launchctl (macOS) or systemd (Linux).

Conversation context survives swaps because Claude Code holds the full message history client-side and re-sends it with every request.

Model Configuration

Define aliases for your models. Only mapped models are available. Other service configs in the directory are ignored.

macOS:

{
  "plists_dir": "~/.llama-plists",
  "health_url": "http://localhost:8000/health",
  "health_timeout": 30,
  "models": {
    "planner": "qwen35-35b-a3b-thinking.plist",
    "coder": "qwen3-coder.plist",
    "fast": "glm-4-7-flash.plist"
  }
}

Linux:

{
  "services_dir": "~/.llama-services",
  "health_url": "http://localhost:8000/health",
  "health_timeout": 30,
  "models": {
    "planner": "llama-server-planner.service",
    "coder": "llama-server-coder.service"
  }
}

Swap using your aliases: "swap to coder", "swap to planner".

Directory Mode

Set "models": {} to auto-discover all service configs. Filenames (without extension) become the aliases.

macOS:

{
  "plists_dir": "~/.llama-plists",
  "models": {}
}

Linux:

{
  "services_dir": "~/.llama-services",
  "models": {}
}

MCP Tools

Tool

Description

list_models

Lists all configured models with load status and current mode

get_current_model

Returns the alias of the currently loaded model

swap_model

Unloads current model, loads the specified one, waits for health check

create_model_config

Generates a new launchd plist (macOS) or systemd unit (Linux) for a model

MCP Resources

Resource

Description

llama-swap://config

Current configuration as JSON

llama-swap://status

Current model status, health, and platform info

MCP Prompts

Prompt

Description

swap-workflow

Guided plan-then-implement workflow template

Full Setup Guide

Prerequisites

  • macOS with launchctl, or Linux with systemd

  • llama-server (llama.cpp) installed

  • Model configurations as service files (launchd plists or systemd units)

  • Python 3.10+

  • Claude Code CLI pointed at a LiteLLM proxy

1. Install mcp-llama-swap

pip install mcp-llama-swap

2. Install and start LiteLLM proxy

pip install litellm

Create litellm_config.yaml:

model_list:
  - model_name: "*"
    litellm_params:
      model: "openai/*"
      api_base: "http://localhost:8000/v1"
      api_key: "sk-none"

litellm_settings:
  drop_params: true
  request_timeout: 300

Start it:

litellm --config litellm_config.yaml --port 4000

On macOS, you can use the included ai.litellm.proxy.plist.template to run it as a persistent launchd service (see setup.sh).

3. Point Claude Code at LiteLLM

Add to ~/.zshrc (macOS) or ~/.bashrc (Linux):

export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_API_KEY="sk-none"
export ANTHROPIC_MODEL="local"

4. Add MCP server to Claude Code

Add to ~/.claude.json:

{
  "mcpServers": {
    "llama-swap": {
      "command": "uvx",
      "args": ["mcp-llama-swap"],
      "env": {
        "LLAMA_SWAP_CONFIG": "/absolute/path/to/config.json"
      }
    }
  }
}

5. Create your config.json

Copy config.example.json (macOS) or config.example.linux.json (Linux) and edit with your model aliases and service filenames.

6. Create model service configs

You can create service configs manually, or use the create_model_config MCP tool inside Claude Code:

You: create a model config named "coder" for /path/to/model.gguf with 8192 context

This generates the appropriate launchd plist (macOS) or systemd unit file (Linux) in your services directory.

Automated Setup (macOS)

If you prefer a one-shot setup on macOS, clone this repo and run:

git clone https://github.com/oussama-kh/mcp-llama-swap.git ~/mcp-llama-swap
cd ~/mcp-llama-swap
chmod +x setup.sh
./setup.sh

The script creates a virtual environment, installs dependencies, configures the LiteLLM launchd service, and prints the exact config to add.

Configuration Reference

config.json fields:

Field

Default

Description

services_dir

~/.llama-plists (macOS) / ~/.llama-services (Linux)

Directory containing model service configs

plists_dir

macOS alias for services_dir (backwards compatible)

units_dir

Linux alias for services_dir

health_url

http://localhost:8000/health

llama-server health endpoint

health_timeout

30

Seconds to wait for health check after loading

models

{}

Alias-to-filename map. Empty = directory mode

platform

auto

Service manager: auto, launchctl, or systemd

launchctl_mode

legacy

macOS only: legacy (load/unload) or modern (bootstrap/bootout)

Override config path via the LLAMA_SWAP_CONFIG environment variable.

Platform Details

macOS (launchctl)

Models are managed as launchd services via plist files. Two launchctl modes are available:

  • Legacy (default): Uses launchctl load/unload/list. Works on all macOS versions.

  • Modern: Uses launchctl bootstrap/bootout/print. The officially supported API on newer macOS. Enable with "launchctl_mode": "modern" in config.

Linux (systemd)

Models are managed as systemd user services. Unit files in services_dir are symlinked to ~/.config/systemd/user/ and managed via systemctl --user start/stop.

Troubleshooting

LiteLLM not translating correctly: Check /tmp/litellm.stderr.log. Verify llama-server is running: curl http://localhost:8000/health.

Model swap times out: Increase health_timeout in config.json. Large models may need 30+ seconds to load weights into memory.

Claude Code cannot find the MCP server: Verify the LLAMA_SWAP_CONFIG path is absolute. Test directly: python -m mcp_llama_swap.

Mapped model not found: The service filename in models must match an actual file in your services directory.

systemd service won't start: Check journalctl --user -u llama-server-<name> for errors. Ensure llama-server is in your PATH.

launchctl modern mode issues: If bootstrap/bootout commands fail, fall back to "launchctl_mode": "legacy" in config.

Development

# Install with test dependencies
pip install -e ".[test]"

# Run tests
pytest -v

Use Case

This project enables a two-phase AI coding workflow entirely on local hardware:

  1. Planning phase: Load a reasoning model (e.g., Qwen3.5-35B-A3B with thinking). Discuss architecture, define interfaces, decompose requirements.

  2. Implementation phase: Swap to a coding model (e.g., Qwen3-Coder-30B). Execute the plan file by file with full conversation context from the planning phase.

No cloud APIs. No data leaving your machine. No context loss between phases.

License

Apache-2.0

Install Server
A
license - permissive license
A
quality
C
maintenance

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/oussama-kh/mcp-llama-swap'

If you have feedback or need assistance with the MCP directory API, please join our Discord server