Skip to main content
Glama

🧬 Leptin

Token-budgeted MCP memory for AI coding agents.

The satiety hormone for agent memory: a drop-in MCP server that puts your AI agent's long-term memory on a token budget, shows you the receipts, and guarantees it never silently forgot anything that mattered.

CI PyPI python core deps tests benchmark license

Quickstart · Why I built this · How it works · Benchmark · Self-tuning · Security


Persistent-memory MCP servers fixed "the agent forgets between sessions." But for many setups they introduce a quieter problem: the memory store can inflate every prompt and bill you for it. As the store grows, a top-k recall tends to inject more matched memories into context — eating the very window it was meant to protect — and it's often hard to see what that's costing. Turning on forgetting helps, but a plain decay scanner can drop a fact you'll want next week with no easy way to check or undo it.

Plenty of good tools address pieces of this. Leptin's bet is to put the whole loop — diet + scale + safety net — into one lean, local sidecar. Like the hormone it's named after, it tells your memory store when it's had enough, so it stops hoarding.

  A memory layer can quietly grow forever and bill you in the dark,
  or forget things and hope you don't notice.

  Leptin puts your memory on a budget, shows you the receipts,
  and proves it didn't forget anything that mattered.

Why I built this

(This is the project's origin story. If you're reading the code, this is the "why".)

I build with AI coding agents every day. Like everyone, I got tired of my agent forgetting everything between sessions — the stack, my preferences, the decisions we'd already made — so I bolted on a persistent-memory MCP server. It worked. For about a week.

Then sessions started getting slower and more expensive. As the store grew, recall injected a bigger pile of matched memories into the context window on most turns. A top-k recall of ~10 memories at a few hundred tokens each adds up to thousands of tokens per query — paid over and over, brushing against context limits, reaching for /compact. The thing I added to help my agent was now quietly competing with it for context, and I didn't have an easy way to see what it was costing me.

The agent-memory ecosystem is genuinely good and moving fast, and several tools already tackle parts of this well — some add token controls, some ship dashboards, some prune old memories. I'm not claiming nobody has solved it. What I couldn't find was one drop-in piece that combined the specific things I wanted, for the way I actually work: a solo dev running coding agents all day against a local store I'd rather not migrate off of.

The combination I wanted:

  1. Keep memory lean automatically (dedup, merge, decay).

  2. Show me the bill — tokens and dollars, on my own data, not a benchmark slide.

  3. Forget safely — never silently lose something I'd need, and let me undo anything.

So I built Leptin to scratch that itch — a focused, local-first, zero-dependency take aimed at people whose setup looks like mine. It's not trying to replace the bigger memory platforms; it's the lean sidecar I wanted. I cleaned it up, wrote tests, made the headline reproducible with one command, and published it in case it helps you too.

@lionellau. PRs, issues, and "this saved me X tokens" stories all welcome.


Related MCP server: Linksee Memory

The headline (reproduce it yourself)

leptin bench
  Leptin benchmark — naive top-k store vs. Leptin (offline, deterministic)
  ----------------------------------------------------------------
  corpus            : 49 inserts, 24 probes
  active memories   : naive   47   leptin   39   (dedup kept 8 out)
  recall budget     : 1500 tokens   |   naive dumps top-10
  ----------------------------------------------------------------
  memory tokens     : naive   3396   leptin   1147
  TOKEN REDUCTION   : 66.2%   (target ≥ 60%)
  recall            : naive 1.000   leptin 1.000
  RECALL LOSS       : 0.0%   (target ≤ 2%)
  est. $ saved      : $0.006966  (priced at claude-sonnet-4-6)
  ----------------------------------------------------------------
  HEADLINE          : PASS ✅  ≥60% fewer memory tokens at ≤2% recall loss

≥60% fewer memory tokens at ≤2% recall loss — runs fully offline, no API key, deterministic. The corpus, prompts, and models are pinned in code so the number is the same on your machine as on ours.

  • The baseline is a naive top-k dump — exactly what stock persistent-memory MCP servers do today. That's the real status quo Leptin competes against, not a strawman.

  • Savings come from two real mechanisms: mostly budgeted, relevance-packed recall, plus write-time dedup/merge. The output shows the dedup contribution separately (dedup kept N out).

  • The corpus is synthetic and illustrative — a bundled, deterministic LoCoMo-style set. To measure your own numbers on real LoCoMo with hosted embeddings: leptin bench --dataset locomo.json --embedding-model text-embedding-3-small.


Quickstart

1. Install

pip install leptin-mcp                 # once published to PyPI
uvx leptin-mcp serve                   # zero-install run (uv)

# from source today:
pip install "git+https://github.com/lionellau/leptin"

# optional: hosted embeddings + LLM merge (OpenAI / Voyage / Claude)
pip install "leptin-mcp[hosted]"

2. Connect it to Claude Code / Codex

leptin init

That prints a ready-to-paste MCP config block:

{
  "mcpServers": {
    "leptin": { "command": "leptin", "args": ["serve", "--db", "~/.leptin/memory.db"] }
  }
}

Restart the client. The agent now has 8 memory tools. Ask it to "remember that I prefer dark mode", then later "what are my preferences?" — and run leptin report to see the tokens and dollars saved.

Savings show up once your store has overlap (so dedup fires) or recall hits the budget. On a brand-new store, report honestly says it hasn't saved anything yet.

3. See the receipts

leptin dashboard      # local savings dashboard at http://127.0.0.1:8765
leptin doctor         # health check: store, schema, models, hosted readiness
leptin report         # or print the ledger as JSON

How it works

Five mechanisms, all behind the MCP interface:

  1. Write-time dedup / merge. On remember, near-duplicates within a subject are merged into one canonical memory; contradictions supersede the older fact. The store stops accumulating restatements.

  2. Time-decay forgetting. Each memory has a strength that decays exponentially (Ebbinghaus-style, configurable half-life) and is boosted on access. Weak, unused memories become prune-eligible.

  3. Budgeted, packed recall. Candidates ranked by similarity × strength, then greedy-packed under a hard token budget with a relevance gate — so off-topic padding never makes the cut.

  4. Savings ledger. Every op logs baseline_tokens (what a naive store would have injected) vs. actual_tokens, converted to $ via a per-model price table.

  5. Recall guardrail. Before any prune commits, a probe set (question → expected_fact) is re-run against the post-diet store inside a transaction; if recall would drop past a threshold, the whole prune is rolled back.

The theory, and why it matters for your projects

Context windows are a budget, and memory spends it silently. Every token a memory layer injects is a token your agent can't use for code, and a token you pay for on every turn. The naive design — embed the query, return the top-k matches, inject all of them — has a brutal failure mode: as the store grows, "top-k" is drawn from an ever-larger pool, so the matches get bigger and less precise, and you re-pay for them on every recall.

Concretely, the kind of thing that bites real projects:

  • A months-long coding project. By month three your agent has "remembered" hundreds of overlapping facts about the codebase. A single "how does auth work?" recall now injects 2–3k tokens of half-relevant history every turn. Multiply by hundreds of turns a day. Leptin's dedup collapses the restatements and its budgeted recall injects only the on-topic few — the cost stops growing with the store.

  • Preferences and decisions that change. You said "use pnpm" in week one and "actually, use bun" in week six. A naive store now holds both and may inject the stale one. Leptin's supersede keeps the newer fact active and the older one auditable-but-out-of-context.

  • Multi-agent / multi-session setups. Several agents hammering one memory store re-inject the same boilerplate constantly. Dedup + a token ceiling caps the blast radius.

  • "Just turn on forgetting." Decay alone is dangerous — the fact you query once a month is exactly the one a dumb decay scanner deletes. Leptin only prunes behind the recall guardrail, and quarantines (never hard-deletes) within a retention window, so forgetting is safe and reversible.

The decay model is the classic Ebbinghaus forgetting curve (strength(t) = strength₀ · e^(−λt), reinforced on access) — the same spacing-and-recency intuition human memory uses, applied to keep the useful facts strong and let genuinely-cold ones fade. The budgeted packer is a greedy knapsack on relevance-per-token. The recall guardrail is the piece I most wanted and rarely saw: it turns "forgetting" from a leap of faith into a checked, reversible operation.


🧬 Self-tuning — Leptin learns its own diet

Leptin doesn't just measure itself — it evolves. The self-tuning loop replays your own data under candidate policies and commits a change only when held-out evals prove it's a net win (more savings, no recall loss), else it leaves the config alone. Same trust DNA as the guardrail, applied to the policy itself.

leptin tune --dry-run     # preview the proposed change
leptin tune               # apply it (only if it's a proven net win)
leptin tune --history     # the evolution ledger
leptin tune --rollback    # undo the last change, exactly
  • Held-out gate + dual-metric accept — no overfitting the eval, no recall regressions.

  • Locked safety rails — the optimizer can tune recall/decay knobs but can never touch guardrail_max_drop.

  • Reversible — every change is an evolution-ledger row; roll back to any prior config.

  • Token/context efficient by construction — read-only evals on a bounded sample, zero LLM calls offline, cadence-triggered, tiny scorecard output. Opt-in (self_tune_enabled); manual leptin tune always works.


The 8 MCP tools

Tool

What it does

remember

Store a fact. Write-time dedup/merge; contradictions supersede the older fact (kept, not deleted).

recall

Retrieve under a token budget — packed for relevance, with tokens_saved vs. a naive top-k dump.

compact

Guardrailed decay-prune + merge + supersede. Auto-rolls-back any prune that hurts recall.

forget

Soft-delete by id or query → quarantine (reversible), never a hard delete.

restore

Bring a forgotten/quarantined memory back.

inspect

Full provenance, current strength, and event history for any memory.

diet_report

The "show me the receipts" tool: tokens & $ saved, op breakdown, guardrail status.

self_tune

Self-evolve the memory policy — commit only on a proven net win, else revert. Offline, zero LLM calls.


Where Leptin fits

Other memory tools are good at what they do — this isn't a teardown, it's about which combination Leptin focuses on. Compared to the common approaches (not any one product):

Top-k memory store

Hosted memory platform

Decay-based "forgetting"

Leptin

Persistent memory across sessions

Hard token budget on recall

usually no

sometimes

usually no

Savings ledger (tokens & $ on your data)

rare

rare

rare

Forgetting with a recall guardrail + rollback

n/a

rare

rare

Self-tuning policy

no

no

no

Local sidecar, no migration, zero infra

partial

no

varies

Runs fully offline, zero deps

varies

no

varies

If you need a full managed memory platform, use one. Leptin is the lean, local, auditable sidecar for when you don't.


Design

  • Zero core dependencies. The engine, MCP server, ledger, guardrail, dashboard, benchmark, and self-tuner run on the Python standard library alone. pip install is instant; uvx leptin-mcp just works.

  • Offline by default, hosted by upgrade. Default embedder is a deterministic hashing vectorizer; merges are heuristic — so everything (including the benchmark) runs with no API key and is reproducible. Install leptin-mcp[hosted] for real OpenAI/Voyage embeddings + Claude/GPT merging.

  • Graceful degradation. If the embedding/LLM API is unreachable, remember/recall retry then fall back to local — they never throw to the agent.

  • Glass box, reversible. Every merge/decay/forget/tune is logged with a reason; nothing is hard-deleted within the retention window.

⚠️ Offline-mode caveat: the default hashing embedder merges near-lexical duplicates well, but not deep paraphrases ("dark mode" vs "night theme"). For semantic dedup, configure hosted embeddings. The conservative defaults err toward keeping data — consistent with "never silently forget."


Running it in production

Capability

What it gives you

leptin doctor

One-command health check (store, schema version, models, hosted SDK/key readiness). Non-zero exit if unhealthy.

Schema migrations

Versioned on-disk schema; older stores upgrade in place, data preserved.

Concurrency

WAL + busy_timeout so server + dashboard + CLI share one DB file safely.

Scale

Parsed-embedding cache keeps recall in the low-ms over thousands of memories.

Hardened hosted mode

Retries transient API errors with backoff before degrading; caches embeddings to avoid re-billing.

Structured logging

LEPTIN_LOG=DEBUG|INFO|WARNING|ERROR (stderr only — stdout stays a clean MCP channel).


Configuration

Every tunable has a sane default (env LEPTIN_*, the config table, or a Config object):

Key

Default

Meaning

token_budget_default

1500

Hard token ceiling per recall

dedup_threshold

0.86

Cosine τ for near-duplicate merge

decay_half_life_days

14

Strength halving time

guardrail_max_drop

0.02

Max tolerated recall drop before rollback

embedding_model

local-hash

or text-embedding-3-small, voyage-3, …

llm_model

heuristic

or claude-haiku-4-5, gpt-4o-mini, …

self_tune_enabled

false

Run the self-tuning loop automatically after compaction


Security

Leptin is local-first and designed to be safe by default:

  • The MCP server speaks JSON-RPC over stdio — no network listener.

  • The dashboard binds to 127.0.0.1 only and rejects non-localhost Host headers (DNS-rebinding mitigation). It has no auth and is for single-user local use — don't expose it.

  • Memory content is treated as data, never executed.

  • Hosted embedding/LLM calls (opt-in [hosted]) send memory text to the configured provider — review their data policy first. API keys are read from env vars, never stored.

  • A user's memory database is never committable (.gitignore excludes *.db / *.sqlite).

Found a vulnerability? See SECURITY.md — please report privately.


Testing

uv venv && uv pip install -e ".[dev]" && pytest

112 tests cover the PRD acceptance criteria: budget guarantees, savings-ledger math, dedup/merge/supersede, decay, the guardrail rollback/commit invariants, self-tuning (offline zero-cost, lock enforcement, reversibility, determinism), glass-box reversibility, the MCP protocol surface (incl. a real leptin serve subprocess), the dashboard HTTP layer, hosted integration + retry/degradation paths, schema migrations, concurrent writers, recall latency at scale, and the reproducible benchmark. CI runs the suite, the benchmark, a clean wheel install, and the TS build on Python 3.10–3.13.


FAQ

Does it work with anything other than Claude Code? Yes — it's a standard MCP server (stdio), so any MCP client (Codex, Cursor, etc.) works. There's also a @leptin/client TypeScript SDK and a Python API (from leptin.api import Leptin).

Do I need an API key? No. The default mode is fully offline and deterministic. Hosted embeddings/LLM are an opt-in upgrade for semantic dedup.

Will it delete something I need? Not silently. Decay only prunes behind the recall guardrail, prunes are quarantined (not hard-deleted) and restorable, and you can add probes for anything you want protected.

Is the 66% number real? It's reproducible offline on a bundled synthetic corpus (leptin bench), and the harness runs on real LoCoMo data too (--dataset). See the benchmark note.

Where does my data live? In a local SQLite file (default ~/.leptin/memory.db). Zero infra. Adapters for Mem0/pgvector are on the roadmap.


Roadmap

Shipped in v1.0 — drop-in MCP server + 8 tools · SQLite backend · dedup/merge/supersede · decay · token-budgeted packed recall · savings ledger · recall guardrail + reversibility · 🧬 self-tuning · leptin doctor · schema migrations · reproducible leptin bench (+ real LoCoMo) · local dashboard · TS SDK · 112 tests + CI.

Forward roadmap — backend adapters (Mem0, pgvector) so Leptin diets a store you already run · sqlite-vec fast path · hosted prompt/intent optimization for self-tuning · shared/team memory.


Contributing

PRs welcome — especially backend adapters. See CONTRIBUTING.md and the Code of Conduct. Keep the core dependency-free, add a test, and don't weaken the guardrail.

License

MIT — see LICENSE.

A
license - permissive license
-
quality - not tested
A
maintenance

Maintenance

Maintainers
Response time
0dRelease cycle
3Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lionellau/leptin'

If you have feedback or need assistance with the MCP directory API, please join our Discord server