nautilus-compass
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@nautilus-compassRemember that I dislike verbose responses."
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
nautilus-compass
Black-box agent memory with drift detection · the only public memory layer that doesn't burn LLM tokens to extract facts before storing. Memory plugin for Claude Code/Desktop · Cline · Cursor · Continue.dev · Zed · stops your AI from repeating mistakes you've already flagged.
Why "black-box"? Mem0, Letta, Cognee, Zep, MemOS all call an LLM at index time to extract entities or build a graph. compass embeds raw text with BGE-m3 locally and skips that step entirely · ~14× cheaper to reproduce on Volcengine DeepSeek pricing (regional; offshore providers ~5–10× that, still well below GPT-4o-judged stacks), the memory layer itself runs fully local (the agent LLM and judge LLM are cloud APIs in our default config; both replaceable with local Ollama/vLLM), drift-aware. Read the full architectural argument: paper/BLACKBOX_VS_WHITEBOX.md.
Built by Nautilus Platform · open agent ecosystem · 7 capabilities (memory · identity · runtime · marketplace · stake · A2A · MCP) · join as agent →
🇬🇧 English (this file) · 🇨🇳 中文
30-second pitch
White-box memory layers (Mem0, Letta, Cognee, Zep, MemOS, smrti):
"I call an LLM to extract facts from your conversation,
then store them in a graph. Pay extraction tokens. Send
data to the provider."
Black-box memory (compass · this project):
"I embed raw text locally with BGE-m3. No extraction LLM.
No graph. No data leaving your machine. And because raw
prompts are still in the index, I can score the next
prompt against your past mistakes before the agent acts."The trade is real: −30 points on LongMemEval-S vs white-box leaders that build entity graphs, in exchange for 14× cheaper reproduction, full local-deployment, cross-LLM portability, and drift detection that white-box systems can't offer. Full argument: paper/BLACKBOX_VS_WHITEBOX.md.
In one line: when the AI is about to forget a rule you set, take a shortcut you flagged, or fabricate a prior agreement, it gets stopped by its own history of failure patterns.
What's new in v2.0.0 · Opinionated EvoMap
v2.0.0 ships a deterministic lifecycle layer on top of the black-box memory base — paradigm fuse of llm-wiki2 (Karpathy v2), agentmemory (LongMemEval-S 95.2% R@5), and GBrain (Garry Tan · MIT).
The bet: every other memory project (Mem0, Letta, Cognee, Zep, MemOS, llm-wiki2, agentmemory) calls an LLM at some lifecycle decision — ingest, promotion, consolidation, or forgetting. compass v2.0.0 makes them all schema-declared.
5 new frontmatter fields (write-time LLM-free)
tier: working | episodic | semantic | procedural # 4 tiers verbatim from llm-wiki2
decay_rate: 0.5 # Ebbinghaus exponential decay
forget_at: 2026-06-01T00:00:00Z # null = never · soft-archive when reached
promote_after: "7d" | "5_access" # duration or access count
reinforce_count: 0 # access event counterDeterministic promotion rule (no LLM call)
reinforce_count >= promote_after→tier++access event → reset decay timer +
reinforce_count++forget_atreached → soft-archive flagprocedural(top tier) does not promote
Full design rationale in paper/LLM_WIKI2_FUSE_DESIGN.md;
implementation at recall.py:708+.
Other v2.0.0 additions
9 agentmemory-verbatim lifecycle hooks in
stop_hook.pyfor Claude Code: SessionStart, UserPromptSubmit, PreToolUse, PostToolUse, PostToolUseFailure, PreCompact, SubagentStart/Stop, SessionEndadd_worker(spec)MCP tool: super-agents register deterministic worker specs (cron / pubsub / queue / http / custom) to.cache/workers.jsonlRRF k=60 fusion in
recall.py: combine BM25 + vector + KG ranked lists with session-diversified output (max 3 per session · agentmemory verbatim)npx nautilus-compass init: one-command workspace setup creating.compass/.env, sample anchors, and Claude Code hook templates
"Opinionated" — what we declined
Frame borrowed from GBrain ("Garry's Opinionated OpenClaw/Hermes Agent Brain"). compass v2.0.0 takes a stance on what not to include:
❌ No LLM at ingest (USD 3.50 / 100M tokens · BGE-m3 embeds raw text)
❌ No LLM at tier promotion (deterministic schema only ·
reinforce_count+promote_after)❌ No LLM at forgetting (ISO8601
forget_at+ counter only)❌ No vendoring of GBrain or OpenViking source · paradigms are rewritten from scratch in Python · GBrain (MIT, TypeScript) and OpenViking (AGPL-3.0, verified 2026-05-22) are paradigm references only
❌ No graph rerank for LongMemEval-style closed haystacks · cost us −6.2 pts in v0.8 (
paper/RESULTS_v0.8.md)
What problem does this solve
A. Long sessions drift
You told Claude at session start: "never claim deployment success without verification." Fifty prompts later Claude says "deployed successfully ✅" — without verifying. The memory rule was there; the AI forgot it under context pressure.
B. White-box drift detection isn't reachable
Persona Vectors (Anthropic, 2025) proved that LLM activations contain directions for sycophancy and hallucination. But that requires model weights — closed APIs (Claude, GPT-4) don't expose them. There has been no production black-box equivalent that runs in a Claude Code hook.
C. Memory plugins solve only half the problem
Mem0, Letta, claude-mem, Zep all compete on "recall the most relevant past memory." But memory recalled doesn't stop the AI from breaking the rule this time — that other half has been unsolved.
How it works
User prompt: "Fix bug X for me"
│
▼
┌─────────────────────────────────────┐
│ UserPromptSubmit Hook (this plugin)│
└─────────────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌──────────┐
│ recall │ │ drift │ │ profile │
│ memory │ │ check │ │ aggregate│
└────────┘ └─────────┘ └──────────┘
│
▼
Hooks inject results into Claude's system prompt:
- Time-bucketed past memory (BGE-m3 semantic recall)
- Drift score + nearest negative anchor (if score < threshold)
- Profile facts ("you have 3 unfinished tasks in this repo")
│
▼
Claude answers — with full context loadedThe drift detector compares each prompt against an anchor set (25 positive + 35 negative behavioral patterns drawn from real failure transcripts) using BGE-m3 cosine similarity. AUC 0.83 on held-out, 50ms p95 hook latency.
Headline numbers
Benchmark | Score | Honest compare |
LongMemEval-S (n=500) | 56.6% (locked at v0.8) | open-source 50–60% band · white-box leaders (OMEGA, Mem0g, ByteRover) report 90+% — that gap is an architectural ceiling for black-box, not a tuning gap. See BLACKBOX_VS_WHITEBOX. |
EverMemBench-Dynamic (n=500) | 44.4% (Run 1) / 47.3% (Run 2) | tops the four published Table 4 baselines (Mem0 37.09, Zep 39.97, MemOS 42.55, MemoBase 34.27). Not "industry SOTA" — OMEGA / Mem0g haven't reported on EverMemBench publicly. |
Drift detector AUC | 0.83 held-out / 0.92 in-set | only public memory layer that does drift detection at all — white-box systems abstract prompts into facts before drift becomes checkable |
Reproduction cost | ~$3.50 for 500 LongMemEval questions | ~14× cheaper than GPT-4o-judged stacks ($50+) |
p95 hook latency | <50 ms | safe for every-prompt invocation |
We deliberately report Run 1 (44.4%) as the abstract headline for
EverMemBench to avoid cherry-picking; the cross-run mean (45.84%) clears
MemOS by +3.3 pts. See paper/sections/paper2_06_5_evermembench.tex
for honest dual-run + Gemini cross-judge sensitivity analysis.
Try it without installing: live drift-detection + Merkle-integrity demo at huggingface.co/spaces/chunxiaox/nautilus-compass (CPU only · metadata-mode jaccard fallback · no signup needed).
Reproduce the numbers: evaluation dataset (behavioral anchors + labeled session traces for drift ROC + LongMemEval-S / EverMemBench scoring) is live on the Hugging Face Hub: huggingface.co/datasets/chunxiaox/nautilus-compass-test-data
from datasets import load_dataset
ds = load_dataset("chunxiaox/nautilus-compass-test-data")Quickstart
Install in Claude Code
git clone https://github.com/chunxiaoxx/nautilus-compass ~/.claude/plugins/nautilus-compass
bash ~/.claude/plugins/nautilus-compass/install.sh
# Start the BGE-m3 daemon (one-time per boot)
bash ~/.claude/plugins/nautilus-compass/daemon_start.shThe installer wires three hooks into ~/.claude/settings.json:
UserPromptSubmit→ injects time-bucketed memory recall + driftPostToolUse→ mid-session writerStop→ end-of-session summary writer
Five user-facing slash commands appear in Claude Code:
/compass-verify · /compass-drift · /compass-recall ·
/compass-search · /compass-status.
Install in any other MCP client
python ~/.claude/plugins/nautilus-compass/scripts/install_to_agent.pyAuto-detects Claude Desktop, Cursor, Cline, Continue.dev, Zed Editor and
patches their MCP config. See docs/AGENT_ONBOARDING.md
for per-agent copy-paste configs and docs/mcp-usage.md
for the raw protocol specification.
Cloud-hosted alternative (no local install)
curl https://compass.nautilus.social/.well-known/agent.jsonReturns the standard A2A discovery descriptor. Sign up at
compass.nautilus.social/signup for a hosted gateway with multi-user
sync, audit log, and managed BGE-m3 deployment.
What's exposed (7 MCP tools)
Tool | Purpose | Latency |
| Write observation with auto-anchor + drift signal | ~150 ms |
| BGE-m3 semantic + keyword search | ~200 ms |
| Time-bucketed session-log search | ~80 ms |
| Work-profile aggregate (topics, agents, drift trend) | ~100 ms |
| Black-box drift score against anchors | <50 ms |
| Drift score timeline for trend audit | ~30 ms |
| Log positive/negative anchor signal | <20 ms |
The MCP server speaks JSON-RPC 2.0 over stdio / TCP / TLS / mTLS.
Per-token RBAC, per-token rate limiting, notifications/{progress,
cancelled, message}, logging/setLevel, and resources/* for session-log
streaming are all spec-complete.
Comparison
Capability | this | mem0 | Letta | Zep | claude-mem | MemOS | Smriti |
Cross-agent memory | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | archive-only |
MCP A2A protocol native | ✅ TLS+mTLS+RBAC | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
Drift detection | ✅ AUC 0.83 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
Merkle integrity audit log | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
LongMemEval-S verified | ✅ 56.6% (locked) | n/r | n/r | n/r | ❌ | n/r | ❌ |
EverMemBench verified | ✅ 44.4-47.3% | 37.09 | n/r | 39.97 | n/r | 42.55 | ❌ |
Self-host + hosted both | ✅ | ☁ only | ✅ | ☁ only | ✅ | OSS only | OSS only |
License | MIT | Apache | Apache | proprietary | MIT | Apache | MIT |
n/r = not reported in their published evaluations. Smriti is a team
conversation archive with git-based sharing — different scope from a
runtime memory layer, so most rows are intentionally out-of-scope rather
than missing features.
Platform integration · BP1 + BP3 contract
If you run the OSS plugin alongside a Nautilus-style task platform (or your own multi-agent backend), two MCP tools open a bidirectional channel without any new HTTP server:
Tool | Direction | Purpose |
| compass dialog → platform | Push a task into the platform's queue. File-based by default ( |
| platform → compass | Platform agent reports completion. Writes a JSON archive AND a |
End-to-end round-trip — no platform deployment needed for the OSS half:
python examples/platform_flywheel_demo.py
# [1] compass dialog → submit_platform_task (queues to file)
# [2] platform V5 cycle ← poll _platform_queue/ (claims by status flip)
# [3] platform agent → executes channels (simulated)
# [4] platform agent → ingest_platform_task_result
# [5] compass dialog → session_search (HIT · result is searchable)
# OK · BP1 + BP3 round-trip verifiedThe full wire spec, breakpoint analysis, and SaaS-side TODO list live in
docs/PLATFORM_HANDSHAKE.md §7.
V7 governance layer (v0.1, opt-in)
For deployments running multiple specialised executors (V5, V6, Kairos, …), three additional MCP tools provide a thin governance layer that decomposes multi-channel work, audits cross-agent state, and locks the L0 immutable core. V7 sits above the executors — it routes and audits, it does not execute or chat with an LLM itself.
Tool | Purpose |
| Decompose 1 complex task → N routed sub-tasks (heuristic table picks executor per channel) |
| Scan recent session logs for fake-closure / red drift / empty platform results |
| SHA256 lock on |
python examples/v7_governance_demo.py
# [1] V7 governance_lock_check · bootstrap + verify
# [2] V7 governance_dispatch · 4 channels → routed to v5/v5/v6/kairos
# [3] V7 governance_audit · 7-day scan
# OK · V7 v0.1 governance round-trip verifiedContract details + platform-side TODOs (cron, governance fee, CI gate, telegram
/dispatch) in docs/PLATFORM_HANDSHAKE.md §8.
Documentation
docs/AGENT_ONBOARDING.md— per-agent install configs (6 platforms + 3 frameworks)docs/mcp-usage.md— raw MCP protocol guide, TLS setup, RBACdocs/PLATFORM_HANDSHAKE.md— OSS↔SaaS coordination contractpaper/— two papers (drift detection + memory pipeline) and supporting eval scriptsCHANGELOG.md— versioned release notesCONTRIBUTING.md— adding new domain anchors / running benchmarks
Citation
If you use this work, please cite:
Paper 1 · drift detection:
@misc{nautiluscompass-drift-2026,
title = {Nautilus Compass: Black-box Persona Drift Detection
for Production LLM Agents},
author = {Chunxiao Wang},
year = {2026},
note = {Yiluo Technology Co., Ltd.},
howpublished = {\url{https://github.com/chunxiaoxx/nautilus-compass}}
}Paper 2 · memory pipeline + EverMemBench cross-bench:
@misc{nautiluscompass-memrecall-2026,
title = {Closing the Memory Recall Gap with Chinese LLMs:
A Multi-Stage Retrieval Pipeline Achieving Zep-SOTA Performance
on LongMemEval-S at 1/15 Cost},
author = {Chunxiao Wang},
year = {2026},
note = {Yiluo Technology Co., Ltd.},
howpublished = {\url{https://github.com/chunxiaoxx/nautilus-compass}}
}The howpublished field will be updated to the arXiv identifier once
the preprints are live.
We also build on prior work — please cite as appropriate:
BGE-m3 / BGE-Reranker (Chen et al., BAAI 2024)
Persona Vectors (Chen et al., Anthropic, arXiv:2507.21509) — complementary white-box approach, not the same as ours
DPT-Agent strategy distillation (arXiv:2502.11882)
A-MEM dynamic links (arXiv:2502.12110)
LongMemEval (Wu et al., NeurIPS 2024)
EverMemBench (Hu et al., 2026)
License
Code, plugin, MCP wrapper, papers, scripts — MIT (see
LICENSE)Behavioral anchor files (
anchors*.json) — CC0 1.0 Universal (seeLICENSE-ANCHORS)
You may use this in any project, commercial or otherwise, with attribution.
Star history
Contributors
PRs welcome — see CONTRIBUTING.md.
Contact
Author: Chunxiao Wang · Yiluo Technology Co., Ltd. ·
chunxiaoxx@gmail.comHosted gateway: compass.nautilus.social
中文文档: README.zh-CN.md
This server cannot be installed
Maintenance
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/chunxiaoxx/nautilus-compass'
If you have feedback or need assistance with the MCP directory API, please join our Discord server