Skip to main content
Glama

weighted-compact

Claude Code forgets everything at /compact. weighted-compact decides what to keep — by reading your corrections, not guessing.

A local-first, inspectable substrate built from ~/.claude/projects/. Six automatic signals + one optional human label. No daemon, no cloud, no auto-injection. 8 pp measured advantage over LLM-summary /compact on reconstruction-fidelity (full table below).

License: MIT Python 3.11+ v0.2.0-beta.2 outbound-zero

   ~/.claude/projects/                    7 signals (6 automatic)
   ┌──────────────────┐                  ┌────────────────────┐
   │ session_1.jsonl  │  ──bootstrap──▶  │ misstep            │
   │ session_2.jsonl  │                  │ density            │
   │     ...          │                  │ label (optional)   │  ─importance.npz─┐
   │ session_N.jsonl  │                  │ span_keep / maybe  │                  │
   └──────────────────┘                  │ span_skip / think  │                  ▼
                                         │ topic_decay        │       ┌──────────────────┐
                                         │ cosine             │  ───▶ │  compacted       │
                                         └────────────────────┘       │  markdown        │
                                                  ▲                   │  + budget meta   │
                                              REM-decay               └──────────────────┘
                                          (nightly, wall-clock)                ▲
                                                                               │
                                                                          MCP / CLI
                                                                          (you query
                                                                           with intent)
# 30-second install — six automatic signals, no labeling step required
pipx install 'weighted-compact[mcp]'
weighted-compact bootstrap        # extract correction pairs from ~/.claude/projects/
weighted-compact importance       # compose the seven-signal score
weighted-compact mcp-serve        # local stdio MCP for Claude Desktop / IDE clients

Docker / Windows install · Operating guide · Bench vs claude-mem · Stability promise · Extension recipe · Pick your door


How it compares

A 30-second product map vs the dominant adjacent tool:

weighted-compact

claude-mem (77 k ★)

Capture

reads ~/.claude/projects/ on demand

5 always-on lifecycle hooks

Importance ranking

7 inspectable signals + REM-decay

ORDER BY recency DESC LIMIT k

Compaction

top-K vector selection → markdown

LLM-summary, opaque

Auto-injection

none — client polls

every session start

Outbound network

zero (default)

configurable LLM provider per turn

Substrate inspection

numpy columns on disk

SQLite + Chroma

Idle RAM

0 MB (nothing runs)

~80–150 MB worker

3.0× judge_yes (measured)

not measured by them

The trade is real: claude-mem is hook-installed in seconds and feels magic; weighted-compact is a substrate you query with intent. Neither is wrong; they answer different questions. The numbers above are reproducible from this repo — see docs/bench-vs-claude-mem.md.


The substrate is the artifact

~/.claude/projects/ already contains the record of every correction you pushed back on, every flag you restated, every constraint the model lost track of. weighted-compact reads those files once, parses them into per-pair objects (correction text, premise text, span tiers, session anchors), and decorates each pair with seven independent signals: misstep probability, density features, span coverage, topic position, label state, recency, cosine neighbourhood.

The decorated substrate lives at $XDG_DATA_HOME/weighted-compact/ — gitignored, never uploaded. It is the artifact. Compaction is the first reader; other consumers read the same files.

Six of the seven signals run without any human input. The seventh — label — is an optional power-tier, not a requirement: the published ablation puts its 95 % paired CI [−0.004, +0.109] crossing zero on the lower bound, with 67 % of paired runs showing zero difference. The substrate stands on the six automatic signals; the labeler at :18890/ is opt-in for users who want to add an explicit human judgement track.

A nightly REM pass (weighted-compact rem-pass, default 04:00 via the bundled systemd-user timer) lays a wall-clock half-life multiplier on top: yesterday × 0.91, week ago × 0.50, month ago × 0.05. Independent of the seven-signal mixture — content signals stay stable, time refreshes every day. See docs/rem-decay.md.

Consumers reading this substrate today

Consumer

What it does with the substrate

Status

Compaction layerbuild_compacted_context_with_meta()

top-K importance selection → markdown context + budget meta (chars, tokens, top-3 signals). Exercised via the qa-gate evaluation harness; standalone session-start delivery is the next-targeted feature.

shipped — library + harness + MCP tool

Schema extraction — third retrieval tier

extracts reusable (trigger, rule, anti-pattern, stable_since) rules from your memory dir; sits above chunk/episode retrieval as the cheap top-tier — when a recurring pattern fires, you get the rule, not raw chunks

Both consumers ship in this repo today. Four additional readers are in flight in adjacent projects (misstep, session-narrative, FKMF, misstep-foreign-models) — see docs/consumers-roadmap.md for their status. They are listed there, not here, because nothing about this repo's install path depends on them.


The substrate is structurally personal

The artifact described above is a record of one person's corrections, stumbles, span decisions, and labels. That shape has four properties that make it a different category of object from a vendor-shipped memory feature — not better, different.

Local persistence. Conversation text lives on disk under $XDG_DATA_HOME/weighted-compact/, in a path no remote service writes to. A vendor whose revenue model is cloud uplink does not put a writable user-controlled persistent store on your machine. Comparable "local" vendor patterns (Apple Intelligence, Microsoft Recall) still route memory through their sync infrastructure.

Anti-drift labeling. The labeler's anti-drift sidebar shows the five cosine-nearest prior labels alongside each candidate. This optimizes for you converging to yourself over time, not for you converging to the model's policy. A vendor's incentive runs the other direction — the more predictable a user is to the model, the better the model serves that user inside the vendor's product. The mechanism is structurally adversarial to vendor goals.

Per-user predictor. The misstep signal is a logistic regression trained on your stumble events from your correction embeddings; AUC numbers are corpus-dependent. A vendor's foundation model is one model serving millions; per-user trained auxiliary predictors at that scale are not economic. Any cloud-side approximation collapses to an averaged stumble pattern — which is exactly the loss of per-user shape that makes the substrate yours.

Multi-consumer substrate. misstep / session-narrative / FKMF / misstep-foreign-models all read the same per-pair files. A vendor ships memory as a feature embedded in one product. Substrate-as- substrate, queried by multiple independently developed readers, is a different shipping unit — and it requires open, scriptable, user-owned substrate.

These four properties compose into one statement: the substrate carries you, not a generalized inference about you. A vendor's memory feature can be more polished, better integrated, faster to adopt — and still be a categorically different object from this one. That difference is not a moat claim; it is a description of the shape, and the shape is what defines the project.


The methodology is inspectable

Adjacent memory tools capture every tool call, ship the stream to an LLM, store the summary, and at session start inject "relevant" snippets back. The capture is automatic, the summary is opaque, and the relevance function is a ORDER BY recency DESC LIMIT k against a vector index. There is no way to ask why this snippet was kept and that one was not except by reading the LLM's mind.

This project takes the inverse stance. The same correction-pair substrate that the runtime consumes is also queryable by an evaluation harness: hide one pair, reconstruct the answer from the rest, ask two independently-trained model families (Qwen generator, Gemma judge) to agree that the answer still holds. The cross-family choice is a methodology contract, not a runtime dependency — when both families agree the pair was recoverable, the signal weighting that recovered it is the part being tested, not any single model's verdict. Cohen's κ between cheap-local and paid-cloud judge is published (κ=0.469); the noise envelope around every claim is named in the same table as the claim.

The runtime stores no opaque summaries; the per-pair scores are columns in a numpy array on disk. The evaluation harness is reproducible on any user corpus. There is no black box between "this pair was recovered" and "this is why."

The published consumer is compaction. Its proof is on the reconstruction-fidelity loop: hide a pair, ask whether the compacted context still answers questions about it.

Method

Per-Q fidelity (gemma3:4b judge, N=62, k_drop=0.5)

Random selection (seed 42)

12.9 % (8/62)

7-signal mixture (this project)

11.3 % (7/62)

Recency-only

11.3 % (7/62)

Cosine retrieval (e5)

11.3 % (7/62)

Density (single signal)

9.7 % (6/62)

BM25 retrieval

9.7 % (6/62)

Naive /compact analog (qwen summarize)

3.2 % (2/62)

Reading. The 8 pp gap between any structured selection and the summary-bypass is the strongest signal in the table: querying the substrate beats discarding it for a one-pass LLM summary by 5 questions out of 62. Within structured selection, the mixture does not yet outperform a uniform random ranker at this N under the gemma3 judge (κ=0.47 agreement vs Sonnet). The pre-registered narrative target — mixture beats cheap baselines by ≥0.05 absolute — is not met.

What this proves: the substrate-vs-summary axis (parsing into pairs and querying beats one-pass summary). What it does not yet prove: the mixture-vs-naive-ranker axis (which weighting on the substrate is best). Both axes matter. The first is the architectural bet; the second is open work for v0.3.

Honest scope of the measurement. /compact in the harness is a qwen2.5:7b-driven summariser, not Claude Code's actual /compact prompt (closed, not in the loop). Replacing the simulator with captured real-/compact traces is filed for v0.3 — this is the single change that would harden the headline most.

Sonnet calibration sidebar (different harness, 2026-05-21). The earlier Sonnet 4.6 calibration on the larger 1718-triple set reported 3.8 % per-question fidelity under strict vector-and-anchor policy when the source pair was hidden. The cheap-judge agreement against that Sonnet pass came in at Cohen κ = 0.469 (precision 0.70, recall 0.51, 1433 paired predictions) — the noise envelope around the table above. An earlier label-weight ablation under the same cheap judge gave Δfidelity = +0.053 (95 % paired CI [−0.004, +0.109], 3/3 corpora positive); under the strict-judge view that signal needs to re-test. The 3.8 % number sits on a different N, different judge, and different test condition (k_drop=0 hide-source-only vs k_drop=0.5 here); apples- to-apples comparison is filed under v0.3.

Numbers are this corpus; the methodology reproduces on any user's substrate.


The case for a substrate, not a feature

Today your session log is parsed once by a summariser and tossed. Whatever the summariser misses is gone — the hostname you corrected an hour ago, the flag you restated three turns up, the edge case you described slowly. The summariser is one consumer with one task; its output is throwaway.

The bet behind this project is that the parsed substrate is more valuable than any single output computed from it. Once a session becomes structured per-pair objects with seven signals attached, the same artifact can be queried for compaction and stumble prediction and narrative recall and knowledge-gap detection and foreign-model observability. Same data, different readers. The substrate becomes infrastructure.

Compaction is the first published reader because it is the most tractable proof point — a fidelity loop with a quantifiable answer. The numbers above are about that reader. The wider claim — that the substrate is the right primary object — is supported by the other consumers listed above existing as separate projects at all, not by a benchmark on this README.

Seven signals compose into one importance score: a per-user misstep predictor as backbone, sixteen density features, four span tiers, one sparse human label, modulated by an unsupervised topic-decay multiplier. Vectors first; the classifier is a refinement layer, not a gatekeeper. Reconstruction fidelity — can the compacted context still answer questions about what was hidden from it — is the metric for the compaction reader. Not compression ratio.

The v0.3 direction is cross-session correlation — corrections from one session inform ranking in the next, so recurring constraints accumulate weight instead of resetting at each compact. This is where substrate-as-substrate starts paying compound interest that a one-shot /compact cannot: information from session 17 lifting ranking in session 84. The v0.4+ direction is the same signals running in reverse — predict the corrections you are about to make, the constraints you are about to set, the paths you typically reject. Substrate on disk; an IDE-side assistant reads it as lookahead; nothing leaves the host. The substrate becomes a memory you can interrogate, then a forecaster you can override.


Numbers and their meaning

Eight tests, all on the compaction reader (the published consumer). Four shipped with numbers; the rest with the gap named. Each result cell has the number on line one and the reading on line two.

#

Test

Status

Result

1

Ground-truth fidelity under cross-family judge

shipped

Sonnet 4.6, 573 pairs → 3.8 % per-Q fidelity floor; failure split: ~40 % IDK / ~24 % paraphrase / ~6 % ranking error.

≈96 % of pair-specific detail vanishes when that pair is hidden from context — the absolute starting point, before the mixture does any work.

2

Coefficient ablationlabel_weight ∈ {0, 0.15}

shipped, partial sweep

Δ=+0.053, 95 % paired CI [−0.004, +0.109]; per-corpus signs 3 / 3 positive; 38 / 57 paired runs (67 %) are ties.

Turning the human-label term on shifts per-Q fidelity by ≈5 percentage points but the CI crosses zero on the lower bound, and two-thirds of paired runs are unaffected either way. Read as: the six automatic signals carry the substrate by themselves; label is an optional power-tier, not a precondition.

3

Cross-corpus consistency — 3 disjoint session corpora

shipped

A +0.100 / B +0.028 / C +0.021; sign breakdown 13 pos · 6 neg · 38 ties.

Three independent maintainer corpora all show positive Δ — the row-2 effect is not an artefact of one session window.

4

Cheap-judge calibration — gemma3:4b vs Sonnet 4.6

shipped

κ = 0.469, precision 0.70, recall 0.51, zero "other" verdicts.

The free local judge agrees with the paid cloud judge at "moderate" Landis-Koch level — cheap enough for continuous monitoring, not strict enough for definitive scoring.

5

Anti-baseline — mixture vs naive /compact summary + cheap structured rankers

shipped

qwen-summarized /compact analog: 3.2 % (2/62). Mixture: 11.3 % (7/62) — 8 pp gap. Within structured (random / recency / cosine / mixture / density / bm25): all within ±1 question. Same harness, gemma3:4b judge, k_drop=0.5.

Structured selection of any kind beats one-pass summarisation by a substantial margin. The mixture's edge over cheap structured baselines is not yet measurable at this N under the cheap-judge proxy — that's the open question for v0.3.

6

Full coefficient grid — all 7 signal weights

filed

weighted-compact ablation --weights one-shot wrapper, v0.3.

Only one of the seven mixture weights has been swept so far; the other six contribute under their heuristic defaults.

7

Compositional / long-run — fidelity across rolling 48-pair windows

scaffold

recon_qa/gate.py computes buckets; downstream routing partial.

Fidelity is scored per pair, not as a function of accumulated history — cannot yet answer "did the last 30 sessions improve or degrade older recall."

8

Multi-user scaling — reproduction on second corpus

open invitation

Methodology is the contribution; magnitudes are not portable.

If you run on your corpus, the absolute numbers will not match — but the direction of the label-weight effect should reproduce.

9

REM-decay multiplier — wall-clock half-life ageing

shipped 2026-05-23

weighted-compact rem-pass [--half-life-days 7]; nightly timer template; --rem-decay flag end-to-end through qa-gate and run_eval.

Independent of the seven-signal mixture. Content signals are stable; time refreshes every day. Half-life sweep on the fidelity gate is filed for v0.3 — for now, 7 days is heuristic.

10

MCP stdio server — local-only client surface

shipped 2026-05-23

weighted-compact mcp-serve; three tools (search_pairs, compact_session, substrate_info); optional extra pipx install 'weighted-compact[mcp]'.

11

Docker / Windows install path

shipped 2026-05-23

Dockerfile + docker-compose.yml; non-root wc user; substrate on named volume; ~/.claude/projects bind-mounted read-only; labeler at 127.0.0.1:18890; optional ollama sidecar.

12

Cross-tool bench harness vs claude-mem

shipped 2026-05-23, script + spec; partial-run numbers pending

scripts/bench_vs_claude_mem.sh + docs/bench-vs-claude-mem.md; deterministic 30-pair sample (seed 42); same gemma3:4b judge for every method.

The label-weight ablation gives a directional read: paired mean Δ=+0.053, per-corpus signs 3/3 positive (+0.100 / +0.028 / +0.021). But the 95 % paired CI [−0.004, +0.109] crosses zero on the lower bound, and 38 of 57 paired runs (67 %) are exact ties — flipping the human-label weight on or off changes nothing for two-thirds of evaluations under the cheap-judge κ=0.47 noise envelope. The published default treats label as optional: the six automatic signals (misstep, density, span tiers, topic position, recency, cosine neighbourhood) carry the substrate on their own, and bootstrap ships with no labeling step. Users who want to add the human-judgement track open the labeler at :18890/ as a deliberate power-tier action, not as a precondition for installation. This is the adoption-wall removal that distinguishes the substrate from any approach requiring per-pair human labeling at install time.


Pick your door

Same substrate, five readers. The one that names you is yours.

🌱   If you use Claude Code daily

You correct the model. You restate flags. You watch summaries lose the hostname you typed thirty turns ago. The pain is concrete.

Would make this irrelevant to you: your sessions never run long enough to hit compact, or you reset between threads instead of compacting. The substrate's whole job kicks in at that boundary.

The claim: the substrate parses every correction into a queryable object. Today the compaction reader uses that to build a refill context shaped by what you marked. Tomorrow (v0.3) a cross-session reader will spot when this session's correction was the third time you made the same one across three sessions and weight it accordingly.

Action: pipx installweighted-compact compatbootstrap. Done. No labeler, no twenty-minute sitting — the six automatic signals derive from your session files alone. The labeler at :18890/ is opt-in if you want to add the seventh (sparse human-judgement) signal; the published ablation says you don't have to.

Install · Q1 — find your own stumble


🔬   If you read papers on memory, distillation, compaction

The framing is substrate, not policy: the user's own corrections are the training signal, extracted from ~/.claude/projects/ rather than from preference data or RL rollouts. Cross-family judge by contract (gemma3:4b Gemma judges qwen2.5:7b Qwen reconstructions), cheap-judge calibrated against Sonnet 4.6 on identical predictions.

Would make this irrelevant to you: you want SOTA on a public benchmark. This is one corpus, one user, 573 pairs. No CIMemories-style scaling table; no cross-method shoot-out. Reproduction is invited, not packaged.

The number: 3.8 % per-Q fidelity floor (Sonnet) defines the open question — what raises it. The +0.053 Δ label ablation has its 95 % CI crossing zero — direction holds, magnitude not statistically distinguishable from the six-automatic-signal baseline at this N. Six remaining weights are uncharted; that is the open ablation grid.

Action: read docs/importance-mixture.md for the full ablation table, reproduce on your own corpus, file an issue with the sign agreement (or disagreement) across your sessions.

Results · docs/05-roadmap.md


🔧   If you write code and want a substrate you can extend

Eight modules, eight black-box contracts (input, output, entry point, dependencies). Each contract is at the top of its own file. Replace any single box without touching the rest, as long as the I/O shape holds.

Would make this irrelevant to you: you want a finished tool with a config UI. This is a workbench. The contract surface is named and documented; the framework around it is light on purpose.

The number: ~30 LOC to add a new density feature; ~60 LOC to add a whole new signal with its own features_X.npz producer plus a weight entry in importance.py:WEIGHTS.

Action: open weighted_compact/density_features.py, append a regex feature, rerun bootstrap + qa-gate. The Δfidelity vs the previous run tells you whether the feature earned its column.

What is pluggable · docs/02-pipeline.md


🕵️   If you think a dialog can be reverse-engineered from signals

Structurally, this is an inverse problem. Hide one pair, hand the rest to a model, ask it the questions whose answers lived in the hidden pair. If the model answers, the importance mixture chose the right context. If not, you are missing a feature.

Would make this irrelevant to you: you want compaction that works out-of-the-box. The reconstructor angle is for people who want to add a signal and prove it earns its slot.

The number: the cross-family judge keeps the loop honest at κ=0.47 against ground truth — high enough to be a useful fitness gate, low enough that magnitudes don't survive without a sign-agreement check.

Action: write a regex or a classifier for a pattern the mixture isn't catching (hesitation markers, intent-shifts, rhetorical reversals). Re-run the loop. The numbers tell you whether your hypothesis holds up under paired evaluation.

Q3 — add a signal in thirty lines · docs/04-grep-vs-judge.md


🔒   If you care about local-first and privacy

Substrate lives in $XDG_DATA_HOME/weighted-compact/. Gitignored. The runtime is the substrate plus markdown render — no LLM in the loop. The Ollama models (qwen2.5:7b generator + gemma3:4b cheap judge) are evaluation harness, not runtime: they exist so anyone can re-run the reconstruction-fidelity test (hide a pair, ask if the compacted context still answers questions about it) on their own corpus with two independently-trained model families cross-checking each other. Zero outbound calls on either path.

Would make this irrelevant to you: you noticed Sonnet 4.6 in the results table. Yes — the calibration in Headline is a maintainer-side one-time cloud-judge run, explicitly disclosed. Users who want a Sonnet-grade verdict opt in to their own API key; the default binds nothing.

The number: zero outbound network calls on the default path; one CI job (scripts/leak-scan.sh) scans every commit for substrate filename patterns and hardcoded personal-home paths. The remote repo is an orphan-cut branch carrying only framework code; the maintainer's substrate, with personal session history, lives on a private mirror and never touches GitHub.

Action: run weighted-compact bootstrap on an air-gapped host. Read docs/invariants.md for the architectural commitments behind the guarantee.

Architectural invariants · docs/invariants.md


⚙️   If you want this to run nightly without thinking about it

Three commands plus a timer, and the substrate stays current on its own. Recent sessions outweigh distant ones every morning; six automatic signals refresh content importance whenever bootstrap is re-run; nothing talks to the cloud.

Would make this irrelevant to you: you don't want a long-lived substrate at all — you reset state between sessions or use the model as a stateless chat tool. The whole point of nightly automation is that the substrate is the artifact you keep around.

The number: one timer, fires at 04:00 local with a 15-minute randomized delay; one oneshot service writes rem_decay.npz and rotates the previous version to rem_decay.npz.bak.<UTC-ts>. Hardcoded path — no ~/.bashrc injection, no crontab -e, no shell hook.

Action:

weighted-compact install-units --force
systemctl --user daemon-reload
systemctl --user enable --now weighted-compact-rem-pass.timer
systemctl --user list-timers weighted-compact-rem-pass.timer

REM-decay docs · systemd/weighted-compact-rem-pass.timer · Operating guide — how / what / how much / why


What we crossed

Concepts the project ran into on the way. Each one forced a design decision — named, so you can see what the choice was.

Goodhart's law. Optimize a metric long enough and the metric stops being a metric. We do not gradient-descent the mixture weights against the recon-QA fidelity score — the loop is held out as a fitness gate, not used as a training loss. Mixture weights are heuristic and fixed before recon-QA runs. The price is that the weights are not optimal; the price of the alternative is that the score becomes meaningless.

Cohen's κ and the Landis-Koch scale. A free local judge is cheap to run but suspect; a paid cloud judge is the reference but cannot be the default. Cohen's κ between gemma3:4b and Sonnet 4.6 on identical predictions came in at 0.469 — "moderate" agreement per the Landis-Koch convention. Usable as a routine cheap proxy; not a substitute for definitive scoring. We publish κ instead of raw accuracy so the dispersion is legible.

Same-family judge agreement. When the judge model and the generator model share a backbone or training corpus, they tend to concur unfairly. The pipeline avoids this by contract: Gemma judges Qwen reconstructions; no Qwen judging Qwen. The cross-family choice is the architecture, not a robustness ablation. The cheap-judge κ above measures how much this default already costs you.

Sign agreement under a noisy magnitude. The label-weight ablation produced Δ=+0.053 with 95 % paired CI [−0.004, +0.109] — the magnitude crosses zero on the lower bound. The per-corpus signs, separately, were 3 / 3 positive (+0.100 / +0.028 / +0.021). Direction reproduces; magnitude wobbles. We report both and call the CI a qualifier on the magnitude, not a retraction of the direction.

Graceful degradation as contract. A pipeline that breaks when one piece fails is brittle. Each signal in the mixture is optional — missing misstep, the weights re-balance; missing spans, the four tier-terms collapse to zero. The system degrades to a vector baseline. A single-classifier gatekeeper design has no analog for this; we picked the weighted-sum mixture in part for exactly this property.


Honest limitations

  • One corpus, one user. 573 pairs from the maintainer's Claude Code sessions. Magnitudes are not portable; the methodology is what reproduces. A scaling story across users does not exist yet — it requires others running on their own substrate and reporting back. What this means for you: your absolute numbers will be different; the direction of effects should reproduce, the magnitudes likely will not.

  • Four of five substrate consumers are not shipped here. misstep, session-narrative, FKMF, misstep-foreign-models are listed in the consumer table because they exist and they read this substrate format. They are not packaged in this repo and they are not publicly available. They support the architectural claim ("the substrate has more than one reader"); they do not give you running tools today. What this means for you: if you install weighted-compact, you get the compaction reader and the substrate. You do not get the other four. The substrate format itself is what you can build on.

  • The /compact comparator is a local-LLM simulation, not the real /compact. The 8-pp gap in the headline is against a qwen2.5:7b-driven summariser, because Anthropic's actual /compact prompt is closed and not in the harness. Replacing this with captured real-/compact traces is filed for v0.3 and is the single change that would harden the headline most. What this means for you: read the 8-pp result as "structured selection beats one-pass local-LLM summarisation on this corpus", not "weighted-compact beats Claude Code /compact". The second claim has not been measured.

  • Per-question fidelity floor is 3.8 %. Most pair-specific detail is genuinely lost on compaction; what survives is anchor-rich content (entities, paths, numbers, short verbatim). This is the absolute starting position, not a regression. What this means for you: expect that most fine detail vanishes at the compact boundary without intervention. The substrate's job is to make what survives match what you marked — not to restore everything.

  • Misstep AUC 0.665 on the maintainer's corpus — useful as a backbone signal, not yet calibrated cross-user. What this means for you: if your sessions look very different from the maintainer's (different language, very short, very technical), the misstep signal may be weaker for you. Drop its weight or replace with your own predictor.

  • Classifier-as-fidelity-proxy is parked. A first attempt at training a Sonnet-fidelity predictor from 411-dim engineered features landed at AUC ≈ 0.5. Either the sample is too small for the imbalance, the features are optimised for ranking rather than fidelity prediction, or fidelity is emergent from retrieval+generator interaction. Not a current dev target. What this means for you: the substrate runs on the seven-signal mixture; the parked classifier you may see in logs is not weighted into the score. Nothing user-facing depends on it shipping.

  • Iter-chain mode-distinction QC is parked. All three modes (complement / refine / deepen) cluster in cosine drift [0.95, 1.00] under the current generator+prompt; calibrated bands would be too tight (σ ≈ 0.005-0.012) to be useful. What this means for you: the iter-chain modes in the inspector look usable but their quality signal doesn't differentiate yet. Treat those rows as illustrative until the redesign ships.

  • 50-sample baseline is the threshold for individual scores to read as calibrated rather than illustrative. Below that, you have an importance ranking; you do not yet have a stable fidelity measurement. What this means for you: if you just installed and the fidelity number looks weird, that's expected — keep using Claude Code, let your corpus accumulate, re-bootstrap, then trust the number.

  • Mixture's edge over cheap structured baselines is not yet measurable. At N=62 under gemma3:4b judge, random / recency / cosine all match the mixture within ±1 question (12.9 % / 11.3 % / 11.3 % / 11.3 %). The 8-pp gap is between any structured ranker and the summary-bypass — that survives. The pre-registered narrative bar (mixture beats cheap baselines by ≥0.05 absolute) is not met by this measurement. What this means for you: if you're picking weighted-compact over a uniform random ranker, the case currently rests on (a) the substantial gap over LLM-summary compaction, and (b) the architecture being designed for Sonnet-grade re-judge and larger corpora where the mixture's signal can express — not on a present measured advantage over cheap baselines.


What the pipeline does

Session files in. Eight modules score each turn against seven signals. The most important content rebuilds the compacted context. A judge from a different model family checks whether the result still answers questions about what got cut.

The composer formula:

importance = 0.40 × misstep + 0.25 × density + 0.15 × label
           + 0.20 × span_keep + 0.10 × span_maybe
           − 0.15 × span_skip + 0.05 × span_think

The eight modules are independent black boxes — each documented at the top of its own file, replaceable individually. The quality metric driving development is reconstruction fidelity, not compression ratio: ratio is easy to game, fidelity is harder. See docs/03-quality-driver.md for the argument, docs/architecture.md for the module map.


Architectural invariants

Three commitments — full text in docs/invariants.md.

  1. The system won't break when something's missing. Every turn becomes an e5-multilingual-small embedding. The importance mixture runs on those vectors. The classifier is a weighting layer on top, not a gatekeeper. If it degrades or is missing, the pipeline degrades to a vector baseline. (Vectors first, classifier as refinement.)

  2. You label only when there's a real reason to. Two triggers: an inline marker in a live session, or classifier disagreement. Twenty pairs in one sitting when there is a reason. The UI shows five cosine-nearest prior labels alongside the current pair — anti-drift scaffolding so you match your own past judgment, not the model's. (CAPTCHA labeling: gap-fill + ambiguity-merge, not bulk.)

  3. No Anthropic-side dependencies. A harness is the side that hosts an assistant — Claude Code itself, ChatGPT's app, Cursor's IDE plugin. Each one decides what system message goes in, what tools are exposed, how output gets delivered. weighted-compact assumes none of those privileges: it writes Markdown, you paste it, it runs standalone. If a future harness exposes more privileged delivery, that is a bonus, not a dependency.


What is pluggable

Every replacement point is named. Swap a component, rerun the recon-QA loop, watch Δfidelity — the loop tells you whether your version helped.

Replacement point

Default

What you can replace with

Pair extractor

extract_pairs.py walks ~/.claude/projects/

Any source producing pairs.jsonl with the documented schema

Embedder

intfloat/multilingual-e5-small (384-dim)

bge-m3, qwen3-emb, gte-multilingual, any sentence-transformer

Density features

16 regex signals

Your own regex bag, LM-derived features, custom entropy variants

Misstep predictor

logistic regression on stumble events (per-user)

Any model returning P(stumble) per pair

Span tier set

KEEP / MAYBE / SKIP / THINK

Locked at 4 in current schema

Topic segmenter

sliding-window cosine on e5 vectors

BERTopic, supervised classifier, your own boundary detector

Importance composer

weighted sum of 7 signals

Custom mixture, additional signals, GBM ensemble

Reconstruction model

qwen2.5:7b via Ollama

Anything behind an Ollama /api/generate endpoint

Judge model

gemma3:4b (Gemma family)

Any other family — distinct-family constraint is the only rule

Difficulty filter

EvoEnv-style two-k_drop bucketing

Your own bucketer returning the four-bucket dict

→ Module-by-module contracts: docs/02-pipeline.md


Quiz / Quest

Three concrete invitations. The fastest way to understand the system is to run it against your own corpus and watch the numbers move.

Q1 — Find your own stumble. Bootstrap. Open the labeler. Sort by misstep_score descending. Top five pairs should be moments where you corrected the model sharply and the conversation stabilised after. Were they? If yes, the predictor caught your stumbles. If not, your corpus is below the training threshold — keep using Claude Code and re-bootstrap in a week.

Q2 — Reproduce the label-weight ablation on your own corpus. Maintainer corpus, gemma3-judged, N=57 paired: Δ=+0.053, sign positive in 3/3 corpora. Re-run on yours; the goal is the sign, not the magnitude. κ=0.47 cheap- judge noise will widen your CI; multiple seeds or an opt-in Sonnet pass tighten it.

# Pass A: shipped label weight (0.15)
weighted-compact importance && weighted-compact qa-gate --easy-k 0.0 --hard-k 0.9 --signal judge

# Pass B: drop to 0.0 in weighted_compact/importance.py:WEIGHTS, save, rerun
weighted-compact importance && weighted-compact qa-gate --easy-k 0.0 --hard-k 0.9 --signal judge

Q3 — Add a signal in thirty lines. Open density_features.py. Add a regex for reversal markers (r"\b(actually|wait|scratch that)\b"). Rerun bootstrap. The new column lands in features_density.npz automatically. Wire a weight in importance.py:WEIGHTS. Run recon-QA. Did your new signal change which pairs survive compaction at k_drop=0.5?


Install

pipx install git+https://github.com/zzallirog/weighted-compact
weighted-compact compat       # read-only sanity check
weighted-compact bootstrap    # build substrate from ~/.claude/projects/
weighted-compact serve        # open labeler at http://127.0.0.1:18890/

For ambient operation:

weighted-compact install-units
systemctl --user daemon-reload
systemctl --user enable --now weighted-compact

Requirements: Linux (Arch, Debian, Ubuntu in CI), Python 3.11-3.13, ~/.claude/projects/ populated. Optional: sentence-transformers for re-embedding; Ollama with qwen2.5:7b + gemma3:4b for the default local recon-QA loop; an Anthropic API key if you want the Sonnet-grade judge as an opt-in tier (the default binds nothing to it).

Platform matrix: docs/install.md.


Status

Component

Status

Substrate builder (parse sessions, embed, decorate with 7 signals)

shipped

Importance ranker (7-signal mixture — used by compaction reader)

shipped

Labeling UI (CAPTCHA-style, anti-drift)

shipped

Compaction reader (build_compacted_context() library + qa-gate harness)

shipped

Standalone session-start delivery CLI (W2 ambient render)

next (v0.2.x)

Fidelity check (cross-family judge, Sonnet baseline measured 2026-05-21)

shipped

Baseline comparison harness (6 rankers + mixture, incl. /compact sim)

measured 2026-05-21 — see Headline

Substrate consumed by external readers (misstep / narrative / FKMF / foreign-model)

substrate format stable; readers private — see Consumers reading this substrate today

Cross-session correlation reader (corrections accumulate across sessions)

v0.3 direction

Schema extraction (third retrieval tier — weighted-compact schema {build-bank,run,all})

shipped beta v0.2.0b3 — 14/20 strict MATCH = 70% on maintainer 20-case bank with same-model judge (gate ≥60% PASS by 10pp); cross-model judge drops to 1/20, surfaces judge-calibration as the next problem to solve

Beta. Memory builder, ranker, labeler, and fidelity check work end-to-end on a real corpus. Architectural invariants are locked; the numbers around them are not. Schema may still shift between beta releases; migration notes in CHANGELOG.md.

Contributor-grade full component table (17 rows): docs/status.md.


Where to read further

File

Topic

docs/01-substrate.md

Your sessions as a self-distillation corpus

docs/02-pipeline.md

Eight modules as black boxes, box by box

docs/03-quality-driver.md

Why fidelity, not compression ratio

docs/04-grep-vs-judge.md

Two-tier signal economics: cheap regex vs LLM judge

docs/05-roadmap.md

Open items, honest forward look, 2026-05-21 baseline

docs/baselines.md

Baseline rankers — methodology + how to run the comparison

docs/schema-extraction.md

Third retrieval tier — auto-extracted rules above chunks

docs/concept.md

Longer-form take on the problem

docs/invariants.md

Three locked design invariants

docs/architecture.md

Module map and substrate pipeline

docs/importance-mixture.md

Seven-signal mixture, weight by weight + ablation data

docs/reconstruction-qa.md

Compression-fidelity measurement loop

docs/span-annotation.md

Sub-turn char-range tier design

docs/topic-decay.md

Unsupervised topic segmentation

docs/claude-code-integration.md

How the bootstrap reads ~/.claude/projects/

docs/install.md

Platform matrix, install footprint

docs/faq.md

Troubleshooting

CONTRIBUTING.md

What lands easily, what needs discussion

CHANGELOG.md

Version history


License

MIT — see LICENSE.

A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

Maintainers
<1hResponse time
0dRelease cycle
5Releases (12mo)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/zzallirog/weighted-compact'

If you have feedback or need assistance with the MCP directory API, please join our Discord server