PromptThrift MCP
Integrates with Ollama to enable local AI-powered conversation compression using Gemma 4 models. Automatically detects and utilizes locally running Ollama instances to generate high-quality summaries of conversation history, reducing token usage by 70-90% with zero API cost and 100% on-hardware processing.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@PromptThrift MCPcompress this conversation history to reduce token costs"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
PromptThrift MCP — Smart Token Compression for LLM Apps
Cut 70-90% of your LLM API costs with intelligent conversation compression. Now with Gemma 4 local compression — smarter summaries, zero API cost.
The Problem
Every LLM API call resends your entire conversation history. A 20-turn chat costs 6x more per call than a 3-turn one — you're paying for the same old messages over and over.
Turn 1: ████ 700 tokens ($0.002)
Turn 5: ████████████████ 4,300 tokens ($0.013)
Turn 20: ████████████████████████████████████████ 12,500 tokens ($0.038)
↑ You're paying for THIS every callThe Solution
PromptThrift is an MCP server with 4 tools to slash your API costs:
Tool | What it does | Impact |
| Compress old turns into a smart summary | 50-90% fewer input tokens |
| Track token usage & costs across 14 models | Know where money goes |
| Recommend cheapest model for the task | 60-80% on simple tasks |
| Pin critical facts that survive compression | Never lose key context |
Why PromptThrift?
PromptThrift | Context Mode | Headroom | |
License | MIT (commercial OK) | ELv2 (no competing) | Apache 2.0 |
Compression type | Conversation memory | Tool schema virtualization | Tool output |
Local LLM support | Gemma 4 via Ollama | No | No |
Cost tracking | Multi-model comparison | No | No |
Model routing | Built-in | No | No |
Pinned facts | Never-Compress List | No | No |
Quick Start
Install
git clone https://github.com/woling-dev/promptthrift-mcp.git
cd promptthrift-mcp
pip install -r requirements.txtOptional: Enable Gemma 4 Compression
For smarter AI-powered compression (free, runs locally):
# Install Ollama: https://ollama.com
ollama pull gemma4:4bPromptThrift auto-detects Ollama. If running → uses Gemma 4 for compression. If not → falls back to fast heuristic compression. Zero config needed.
Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"promptthrift": {
"command": "python",
"args": ["/path/to/promptthrift-mcp/server.py"]
}
}
}Cursor / Windsurf
Add to your MCP settings:
{
"mcpServers": {
"promptthrift": {
"command": "python",
"args": ["/path/to/promptthrift-mcp/server.py"]
}
}
}Real-World Example
A customer service bot handling olive oil product Q&A:
Before compression (sent every API call):
Q: Can I drink olive oil straight?
A: Yes! Our extra virgin is drinkable. We have 500ml and 1000ml.
Q: What's the difference between PET and glass bottles?
A: Glass is our premium line. 1000ml PET is for heavy cooking families.
Q: Which one do you recommend?
A: For drinking: Extra Virgin 500ml. For salads/cooking: 1000ml.
Q: I also do a lot of frying.
A: For high-heat frying, our Pure Olive Oil 500ml (230°C smoke point).~250 tokens × every subsequent API call
After Gemma 4 compression:
[Compressed history]
Customer asks about olive oil products. Key facts:
- Extra virgin (500ml glass) for drinking, single-origin available
- 1000ml PET for cooking/salads (lower grade, family-size)
- Pure olive oil 500ml for high-heat frying (230°C smoke point)
[End compressed history]~80 tokens — 68% saved on every call after this point
With 100 customers/day averaging 30 turns each on Claude Sonnet: ~$14/month saved from one bot.
Pinned Facts (Never-Compress List)
Some facts must never be lost during compression — user names, critical preferences, key decisions. Pin them:
You: "Pin the fact that this customer is allergic to nuts"
→ promptthrift_pin_facts(action="add", facts=["Customer is allergic to nuts"])
→ This fact will appear in ALL future compressed summaries, guaranteed.Supported Models (April 2026 pricing)
Model | Input $/MTok | Output $/MTok | Local? |
gemma-4-e2b | $0.00 | $0.00 | Ollama |
gemma-4-e4b | $0.00 | $0.00 | Ollama |
gemma-4-27b | $0.00 | $0.00 | Ollama |
gemini-2.0-flash | $0.10 | $0.40 | |
gpt-4.1-nano | $0.10 | $0.40 | |
gpt-4o-mini | $0.15 | $0.60 | |
gemini-2.5-flash | $0.15 | $0.60 | |
gpt-4.1-mini | $0.40 | $1.60 | |
claude-haiku-4.5 | $1.00 | $5.00 | |
gemini-2.5-pro | $1.25 | $10.00 | |
gpt-4.1 | $2.00 | $8.00 | |
gpt-4o | $2.50 | $10.00 | |
claude-sonnet-4.6 | $3.00 | $15.00 | |
claude-opus-4.6 | $5.00 | $25.00 |
How It Works
Before (every API call sends ALL of this):
┌──────────────────────────────────┐
│ System prompt (500 tokens) │
│ Turn 1: user+asst (600 tokens) │ ← Repeated every call
│ Turn 2: user+asst (600 tokens) │ ← Repeated every call
│ ... │
│ Turn 8: user+asst (600 tokens) │ ← Repeated every call
│ Turn 9: user+asst (new) │
│ Turn 10: user (new) │
└──────────────────────────────────┘
Total: ~6,500 tokens per call
After PromptThrift compression:
┌──────────────────────────────────┐
│ System prompt (500 tokens) │
│ [Pinned facts] (50 tokens) │ ← Always preserved
│ [Compressed summary](200 tokens) │ ← Turns 1-8 in 200 tokens!
│ Turn 9: user+asst (kept) │
│ Turn 10: user (kept) │
└──────────────────────────────────┘
Total: ~1,750 tokens per call (73% saved!)Compression Modes
Mode | Method | Quality | Speed | Cost |
Heuristic | Rule-based extraction | Good (50-60% reduction) | Instant | Free |
LLM (Gemma 4) | AI-powered understanding | Excellent (70-90% reduction) | ~2s | Free (local) |
PromptThrift automatically uses the best available method. Install Ollama + Gemma 4 for maximum compression quality.
Environment Variables
Variable | Required | Default | Description |
| No |
| Ollama model for LLM compression |
| No |
| Ollama API endpoint |
| No |
| Default model for cost estimates |
Security
All data processed locally by default — nothing leaves your machine
Ollama compression runs 100% on your hardware
Post-compression sanitizer strips prompt injection patterns from summaries
API keys read from environment variables only, never hardcoded
No persistent storage, no telemetry, no third-party calls
Roadmap
Heuristic conversation compression
Multi-model token counting (14 models)
Intelligent model routing
Gemma 4 local LLM compression via Ollama
Pinned facts (Never-Compress List)
Post-compression security sanitizer
Cloud-based compression (Anthropic/OpenAI API fallback)
Prompt caching optimization advisor
Web dashboard for usage analytics
VS Code extension
Contributing
PRs welcome! This project uses MIT license — fork it, improve it, ship it.
License
MIT License — Free for personal and commercial use.
Built by Woling Dev Lab
Star this repo if it saves you money!
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/woling-dev/promptthrift-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server