Skip to main content
Glama
pozii
by pozii

TokenSaver MCP

Cut your AI API costs by up to 97% — without changing a single prompt.

License Python MCP Tests

An MCP (Model Context Protocol) server that gives AI agents ten tools to measure, compress, cache, and prune token usage — so developers on limited plans can do more with less.


Why TokenSaver?

Every API call sends more tokens than necessary. Conversation history accumulates. Web pages arrive as raw HTML. Tool results get re-fetched on every turn. System prompts bloat over iterations.

TokenSaver intercepts each of these patterns and fixes them at the agent level — no model changes, no prompt engineering, no plan upgrades.

Scenario

Before

After

Saved

10-turn conversation history

40,000 tokens

8,000 tokens

80%

Webpage fetch (raw HTML)

22,000 tokens

1,200 tokens

94%

Bloated system prompt

600 tokens

220 tokens

63%

Repeated tool call (cached)

1,500 tokens

50 tokens

97%


Related MCP server: Manus Credit Optimizer

Tools

Tool

What it does

count_tokens

Measure token cost before sending — decide whether to compress first

compress_context

Shrink long text or conversation history with offline LSA summarization

cache_store / cache_get / cache_invalidate

Persist tool results to disk with TTL — never run the same lookup twice

extract_webpage

Fetch a URL and return only the readable content, not raw HTML

summarize_file

Get a structural + content summary of any file or directory

prune_conversation

Remove filler turns and compress old messages in conversation history

optimize_prompt

Shorten verbose system prompts while preserving constraints

advise_context_window

Diagnose token bloat and get targeted recommendations

All tools work fully offline — no API key required for core features.


Installation

git clone https://github.com/pozii/tokensaver.git
cd tokensaver
pip install -e .

Python 3.11+ required. On first use, compress_context will auto-download the NLTK punkt_tab tokenizer (~2 MB) if not already present.


How it connects to your AI client

TokenSaver has no URL and runs no background server by default. It uses stdio transport: the AI client reads your config, spawns python -m tokensaver as a child process, and talks to it through stdin/stdout. You never open a port or start anything manually — the client does it for you when it launches.

Your AI client  ──spawn──▶  python -m tokensaver  ──stdio──▶  tools available

The alternative is SSE transport, where you start the server yourself on a local port and the client connects over HTTP. This is useful for multi-agent setups or when multiple clients share the same server instance.


Setup

Claude Desktop

Config file location:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

  • Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "tokensaver": {
      "command": "python",
      "args": ["-m", "tokensaver"]
    }
  }
}

Save the file and restart Claude Desktop. The tokensaver tools will appear in the tool list.

Claude Code

claude mcp add tokensaver -- python -m tokensaver

Or add manually to ~/.claude/settings.json:

{
  "mcpServers": {
    "tokensaver": {
      "command": "python",
      "args": ["-m", "tokensaver"]
    }
  }
}

OpenCode

Config file: ~/.config/opencode/config.json

{
  "mcp": {
    "servers": {
      "tokensaver": {
        "type": "local",
        "command": ["python", "-m", "tokensaver"]
      }
    }
  }
}

Any MCP-compatible client (SSE mode)

Start the server once:

python -m tokensaver --transport sse --port 8765

Then point your client at:

http://localhost:8765/sse

Fixing Python path issues

If your system has multiple Python versions and python resolves to the wrong one, use the full path:

# Find the right Python
which python3        # macOS / Linux
where python         # Windows

Then use the full path in your config:

{
  "mcpServers": {
    "tokensaver": {
      "command": "/usr/local/bin/python3",
      "args": ["-m", "tokensaver"]
    }
  }
}
{
  "mcpServers": {
    "tokensaver": {
      "command": "C:\\Python314\\python.exe",
      "args": ["-m", "tokensaver"]
    }
  }
}

Usage

Each turn:
  1. count_tokens          → How large is my current context?
  2. advise_context_window → Am I approaching the model's limit?

Before expensive tool calls:
  3. cache_get             → Did I already run this?

When fetching web content:
  4. extract_webpage       → Clean text, not raw HTML

When history grows long:
  5. prune_conversation    → Drop filler turns, compress old ones
  6. compress_context      → Shrink large injected context blocks

When writing system prompts:
  7. optimize_prompt       → Remove redundant phrasing

Tool reference

{
  "content": "Some long text or list of messages...",
  "model": "claude-sonnet-4",
  "include_message_overhead": true
}

Returns token_count, encoding_used, model. Accepts a plain string or an OpenAI-format message list.

{
  "text": "3,000-token context block...",
  "target_tokens": 600,
  "mode": "extractive"
}

extractive (default) uses LSA sentence ranking — free, offline, no API call.
abstractive uses claude-haiku for higher quality — requires ANTHROPIC_API_KEY.

Returns compressed, original_tokens, compressed_tokens, reduction_pct.

# Standard pattern: check before running
key = cache_key("extract_webpage", {"url": "https://example.com"})
hit = cache_get(key=key)

if not hit["hit"]:
    result = extract_webpage(url="https://example.com")
    cache_store(key=key, value=str(result), ttl_seconds=3600)

Cache is stored on disk at ~/.tokensaver/cache/ and survives server restarts.

{
  "url": "https://example.com/article",
  "max_tokens": 2000,
  "include_links": false,
  "include_metadata": true
}

Uses trafilatura with BeautifulSoup as fallback. Returns content, title, token_count, truncated.

{
  "path": "/home/user/myproject",
  "mode": "both",
  "max_tokens": 500,
  "file_extensions": [".py", ".md"],
  "max_depth": 3
}

mode options: "structure" (tree only), "content" (summarized text), "both".

{
  "messages": [...],
  "max_output_tokens": 2000,
  "keep_last_n": 4,
  "prune_strategy": "hybrid"
}

"remove" drops filler turns ("Sure!", "Got it.").
"compress" summarizes older turns in place.
"hybrid" does both — recommended for most cases.

Returns the pruned messages list, original_tokens, pruned_tokens, counts of removed/compressed turns.

{
  "prompt": "Please make sure to always answer questions...",
  "optimization_level": "medium",
  "preserve_constraints": true,
  "output_format": "prose"
}

"light" removes filler phrases. "medium" deduplicates sentences. "aggressive" restructures.
preserve_constraints: true always keeps sentences containing never, must, always, do not.

{
  "model": "gpt-4o",
  "current_tokens": 110000,
  "messages": [...],
  "target_utilization": 0.75
}

Returns status ("ok" / "warning" / "critical"), headroom_tokens, prioritized recommendations, and a per-turn breakdown sorted by token cost.

Supports: GPT-4o, GPT-4o-mini, Claude 3–4 series, Gemini 1.5/2.0/2.5, O1/O3, Llama 3, Mistral.


Optional: LLM-backed summarization

For higher-quality abstractive compression on very large texts (>5,000 tokens):

pip install "tokensaver-mcp[llm]"

Set ANTHROPIC_API_KEY in your environment or a .env file, then use mode: "abstractive" in compress_context.


Running Tests

pip install -e ".[dev]"
python -m pytest tests/ -v

38 tests — all offline, no API key or network required.


Project Structure

src/tokensaver/
  server.py          # FastMCP app, tool registration
  models.py          # Context window table, shared types
  tools/
    counter.py       # count_tokens
    compress.py      # compress_context
    cache.py         # cache_store / cache_get / cache_invalidate
    extractor.py     # extract_webpage
    summarizer.py    # summarize_file
    pruner.py        # prune_conversation
    optimizer.py     # optimize_prompt
    advisor.py       # advise_context_window
  utils/
    token_utils.py   # tiktoken wrapper
    text_utils.py    # sentence splitting, deduplication
Install Server
A
license - permissive license
A
quality
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pozii/tokensaver'

If you have feedback or need assistance with the MCP directory API, please join our Discord server