Waggle-mcp
Why waggle-mcp?
Most LLMs forget everything when the conversation ends.waggle-mcp fixes that by giving your AI a persistent knowledge graph it can read and write through any MCP-compatible client.
Waggle's key advantage is token efficiency with structured context:
Without waggle-mcp | With waggle-mcp |
Context stuffed into a 200k-token prompt | ~4× fewer tokens — compact subgraph, only relevant nodes retrieved |
"What did we decide about the DB schema?" → ❌ Lost when the session ended | ✅ Recalls the decision node, when it was made, and what it contradicts |
Flat bullet-list memory | Typed edges: |
One session, one agent | Multi-tenant, multi-session, multi-agent |
Note on retrieval: Waggle trades some raw recall coverage for dramatically lower token cost and richer relational context. See the benchmark section for honest numbers.
Quick start — 30 seconds
pip install waggle-mcp
waggle-mcp initThe init wizard detects your MCP client, writes its config file, and creates
the database directory — no JSON editing required. Supports Claude Desktop,
Cursor, Codex, and a generic JSON fallback.
After init, restart your MCP client and your AI has persistent memory.
No cloud service. No API key. Semantic search runs fully locally.
See it in action
Here's a concrete before/after for a developer using the AI daily:
Session 1 — April 10
User: Let's use PostgreSQL. MySQL replication has been painful.
Agent: [calls observe_conversation()]
→ stores decision node: "Chose PostgreSQL over MySQL"
→ stores reason node: "MySQL replication painful"
→ links them with a depends_on edgeSession 2 — April 12 (fresh context window, no history)
User: What did we decide about the database?
Agent: [calls query_graph("database decision")]
→ retrieves the decision node + linked reason from April 10
"You decided on PostgreSQL on April 10. The reason recorded was
that MySQL replication had been painful."Session 3 — April 14
User: Actually, let's reconsider — the team is more familiar with MySQL.
Agent: [calls store_node() + store_edge(new_node → old_node, "contradicts")]
→ conflict is flagged automatically; both positions are preserved in the graphThe agent never needed explicit instructions to remember or retrieve — it called the right tools based on the conversation, and the graph gave it the right context.
How it works
Memory doesn't just get stored — it flows through a lifecycle:
You talk to your AI
│
▼
observe_conversation() ← AI drops the turn in; facts extracted via structured LLM (regex fallback)
│
▼
Graph nodes are created ← "Chose PostgreSQL" becomes a decision node
Edges are inferred ← linked to the "database" entity node
│
▼
Future conversation starts
│
▼
query_graph("DB schema") ← semantic search finds the node from 3 sessions ago
│
▼
AI answers with full context ← "You decided on PostgreSQL on Apr 10, here's why…"Every node carries semantic embeddings computed locally using
all-MiniLM-L6-v2 —
a fast, lightweight model that runs entirely on-device with no API key or network
call required. This means semantic search works offline, costs nothing per query,
and keeps your data private.
The magic tool: observe_conversation
This is the tool you'll use most. You don't have to manually store facts — just tell the agent to observe each conversation turn and it handles the rest.
observe_conversation(user_message, assistant_response)Under the hood, it:
Extracts atomic facts from both sides of the conversation
Deduplicates against existing nodes using semantic similarity
Creates typed edges between related concepts
Flags contradictions with existing stored beliefs
No instructions needed. No schema to define. Just observe.
Under the hood, every call runs a Pydantic-validated LLM extraction pass (with a regex fallback) to pull structured facts out of messy dialogue.
Example: "Let's use PostgreSQL because MySQL replication is too painful."
{
"facts": [
{
"label": "PostgreSQL for generic events",
"content": "Chose PostgreSQL over MySQL because MySQL replication is too painful.",
"node_type": "decision",
"confidence": 0.95,
"tags": ["llm-extracted", "confidence:0.95"]
}
]
}Any extraction with confidence < 0.5 or an invalid schema is silently dropped to prevent hallucination noise.
Memory model
Node types — what gets stored:
Type | Example |
| "The API uses JWT tokens" |
| "User prefers dark mode" |
| "Chose PostgreSQL over MySQL" |
| "Project: waggle-mcp" |
| "Rate limiting" |
| "Should we add GraphQL?" |
| "TODO: add integration tests" |
Edge types — how nodes connect:
relates_to · contradicts · depends_on · part_of · updates · derived_from · similar_to
MCP tools
Your AI calls these directly — you don't need to use them manually.
Tool | What it does |
| Drop a conversation turn in — facts extracted, stored, and linked |
| Semantic + temporal search across the graph |
| Manually save a fact, preference, decision, or note |
| Link two nodes with a typed relationship |
| Traverse edges from a specific node |
| Update content or tags on an existing node |
| Remove a node and all its edges |
| Break long content into atomic nodes automatically |
| See what changed in the last N hours |
| Generate a compact brief for a new conversation |
| Detect topic clusters via community detection |
| Node/edge counts and most-connected nodes |
| Interactive browser visualization |
| Portable JSON backup |
| Restore from a JSON backup |
Performance & Benchmarking
All numbers below are reproducible from the checked-in fixtures in benchmarks/fixtures/ using the harness at scripts/benchmark_extraction.py. Saved output artifacts live in tests/artifacts/.
One command produces all the tables below (extraction regex baseline, retrieval, dedup, and the comparative token-efficiency pilot):
PYTHONPATH=src .venv/bin/python scripts/benchmark_extraction.py \
--extraction-backend regex \
--systems waggle rag_naive \
--output tests/artifacts/benchmark_current.jsonThe LLM extraction row (75%) requires a separate run with a local Ollama instance — it is not included in benchmark_current.json:
# Requires Ollama running locally with qwen2.5:7b pulled
PYTHONPATH=src .venv/bin/python scripts/benchmark_extraction.py \
--extraction-backend llm --ollama-model qwen2.5:7b --ollama-timeout-seconds 30Extraction accuracy
Corpus: 12 dialogue pairs covering simple recall, interruptions, reversals, vague statements, and conflicting signals (benchmarks/fixtures/extraction_cases.json).
Backend | Cases | Accuracy |
Regex (fallback) | 12 | 33% |
LLM ( | 12 | 75% |
Retrieval accuracy
Corpus: 18 nodes, 18 queries — 6 easy (direct paraphrase) and 12 hard (adversarial: semantic generalization, temporal disambiguation, indirect domain translation, privacy framing). Source: benchmarks/fixtures/retrieval_cases.json.
Difficulty | Queries | Hit@k |
Easy | 6 | 6/6 = 100% |
Hard (adversarial) | 12 | 9/12 = 75% |
Overall | 18 | 15/18 = 83% |
Token efficiency vs. naive chunked-vector RAG
The retrieval accuracy table above measures Waggle's standalone search quality. The comparison below uses a separate multi-session corpus designed to test token efficiency against a chunked-vector baseline.
Corpus: 24 multi-session scenarios, 66 retrieval queries across 7 task families (benchmarks/fixtures/comparative_eval.json).
Task family | Queries | Waggle Hit@k | RAG Hit@k |
| 18 | 18/18 = 100% | 100% |
| 19 | 17/19 = 89% | 100% |
| 11 | 10/11 = 91% | 100% |
| 8 | 8/8 = 100% | 100% |
| 4 (small n) | 4/4 = 100% | 100% |
| 4 (small n) | 2/4 = 50% | 100% |
| 2 (small n) | 1/2 = 50% | 100% |
Overall | 66 | 60/66 = 91% | 100% |
System | Mean tokens | Median tokens | p95 tokens | Hit@k | Exact support |
Waggle | 36.9 | 37.0 | 42.0 | 91% | 74% |
Naive chunked-vector RAG | 152.8 | 155.0 | 162.8 | 100% | 100% |
Waggle uses ~4× fewer tokens per retrieval than the naive chunked baseline on this corpus.
The gap between Waggle's Hit@k (91%) and exact support (74%) indicates that graph retrieval finds the right topic but sometimes returns insufficient supporting detail — most visibly on cross_scenario_synthesis queries (8/8 hit, 1/8 exact). Improving context assembly — specifically edge traversal depth and multi-hop subgraph expansion — is a tracked next step.
The tradeoff is honest: the chunked baseline achieves 100% Hit@k on this corpus because at top_k=5 every fact is retrievable from its own session chunk. The token efficiency advantage is real and reproducible; the retrieval superiority claim requires a corpus where chunk coverage can't compensate for missing relational context. Corpus hardening is ongoing.
When extraction fails
User: "Yeah, let's just do that thing we talked about."
The LLM assigns low confidence (confidence < 0.5) to ambiguous input; Waggle drops the extraction silently rather than storing a guess. The pipeline does not silently fall back to regex on timeout — backend failures surface as explicit errors that are logged.
Corpus: 22 node pairs — 11 true duplicates (synonym, paraphrase, domain equivalence) and 11 false friends (same technology category, different technology). Source: benchmarks/fixtures/dedup_cases.json.
The pipeline runs five layers:
Layer 0 — Entity-key hard block — if both nodes name different technologies in the same category (e.g.
postgresqlvsmysql), merge is blocked unconditionally.Layer 0b — Numeric-conflict guard — same entity but different critical numbers (e.g.
jwt15 min vs 1 hr) → block. Guards against merging distinct facts that share a technology but differ on a key value.Layer 1 — Exact string match — normalized content or label equality.
Layer 2 — Substring containment — one sentence is a strict subset of the other.
Layer 3 — Semantic similarity — cosine via
all-MiniLM-L6-v2:Same-entity aggressive path: if both reference the same entity token, merge at cosine ≥ 0.60 (catches paraphrase true-dups like "fastapi was chosen" / "we chose fastapi because async")
Type-aware threshold:
decision/preference→ 0.82;fact→ 0.92;entity→ 0.97Jaccard-boosted path: word overlap ≥ 0.35 AND cosine ≥ (type threshold − 0.05)
Conservative global fallback
Best measured: 18/22 = 82% at threshold 0.82. fp=0 across all thresholds — no false-friend merges at any tested threshold.
The remaining 4 false-negatives are pure-paraphrase pairs with no recognisable entity anchor ("user prefers dark mode" / "user wants dark mode UI", "async non-negotiable" / "concurrent without blocking"). These require either semantic similarity fine-tuning or a learned paraphrase classifier to close.
Full threshold sweep and detailed methodology: tests/artifacts/README.md.
Full artifacts, methodology, and rag_tuned comparison:
tests/artifacts/README.md
Improvement roadmap (dedup → context assembly → corpus hardening):docs/evaluation-plan.md
Temporal queries — built-in, not bolted on
Most memory systems answer "what do you know about X?" — but can't answer when you learned it or how knowledge changed over time.
waggle-mcp timestamps every node and understands temporal natural language:
Query | What happens |
| Filters nodes updated in the last 24–48h |
| Retrieves the earliest version of relevant nodes |
| Returns a diff of nodes created/updated in that window |
| Explicit changelog: added nodes, updated nodes, new conflicts |
Testing
Beyond empirical benchmarks, waggle-mcp ships with a comprehensive pytest suite covering both memory logic and server protocols. This guarantees core behaviours — multi-tenant isolation, conflict detection, semantic deduplication, MCP protocol handling, and explicit LLM backend failure — remain stable across updates.
============================= test session starts ==============================
collected 43 items
tests/test_benchmark_harness.py::test_fixture_loading_is_auditable PASSED
tests/test_benchmark_harness.py::test_benchmark_report_includes_backend_labels_and_case_counts PASSED
tests/test_benchmark_harness.py::test_markdown_summary_includes_comparative_systems PASSED
tests/test_benchmark_harness.py::test_llm_benchmark_failure_is_explicit PASSED
tests/test_benchmark_harness.py::test_dedup_threshold_sweep_tracks_positive_and_negative_cases PASSED
tests/test_embeddings.py::test_embedding_bytes_round_trip PASSED
tests/test_embeddings.py::test_cosine_similarity_handles_orthogonal_vectors PASSED
tests/test_graph.py::test_add_query_and_related PASSED
tests/test_graph.py::test_update_delete_and_stats PASSED
tests/test_graph.py::test_exact_duplicate_nodes_are_reused_and_tags_are_merged PASSED
tests/test_graph.py::test_semantic_duplicate_nodes_reuse_existing_entry PASSED
tests/test_graph.py::test_entity_resolution_reuses_acronym_matches PASSED
tests/test_graph.py::test_query_ranking_uses_label_lexical_overlap PASSED
tests/test_graph.py::test_decompose_and_store_creates_nodes_and_edges PASSED
tests/test_graph.py::test_export_and_import_backup_round_trip PASSED
tests/test_graph.py::test_export_graph_html_creates_visualization_file PASSED
tests/test_graph.py::test_conflict_detection_creates_contradiction_edge PASSED
tests/test_graph.py::test_observe_conversation_extracts_nodes PASSED
tests/test_graph.py::test_query_supports_temporal_latest_and_oldest_bias PASSED
tests/test_graph.py::test_graph_diff_and_prime_context PASSED
tests/test_graph.py::test_get_topics_returns_clusters PASSED
tests/test_platform.py::test_api_key_hashing_round_trip PASSED
tests/test_platform.py::test_rate_limiter_enforces_request_and_concurrency_limits PASSED
tests/test_platform.py::test_tenant_scoping_isolated_within_same_sqlite_database PASSED
tests/test_platform.py::test_backup_round_trip_preserves_schema_and_tenant_metadata PASSED
tests/test_platform.py::test_http_app_health_auth_and_metrics PASSED
tests/test_platform.py::test_http_app_rate_limit_and_payload_limit PASSED
tests/test_server.py::test_store_node_and_stats_tool PASSED
tests/test_server.py::test_export_graph_html_tool PASSED
tests/test_server.py::test_decompose_and_store_tool_persists_subgraph PASSED
tests/test_server.py::test_export_and_import_backup_tools PASSED
tests/test_server.py::test_store_node_reports_deduplication PASSED
tests/test_server.py::test_store_node_reports_conflicts PASSED
tests/test_server.py::test_observe_conversation_tool PASSED
tests/test_server.py::test_graph_diff_prime_context_and_topics_tools PASSED
tests/test_server.py::test_recent_resource_serialization PASSED
tests/test_server.py::test_unknown_tool_raises PASSED
tests/test_server.py::test_invalid_tool_inputs_return_structured_errors PASSED
tests/test_server.py::test_tool_payload_limit_is_enforced PASSED
tests/test_server.py::test_default_graph_uses_sqlite_backend_by_default PASSED
tests/test_server.py::test_default_graph_can_build_neo4j_backend PASSED
tests/test_server.py::test_default_graph_requires_neo4j_connection_settings PASSED
tests/test_stdio_integration.py::test_server_stdio_initialize_and_basic_calls PASSED
============================== 43 passed in 4.92s ==============================Installation
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
waggle-mcp init # ← writes your client config automaticallyKey variables for local mode:
Variable | What it does |
| Local file DB, zero setup |
| Connects to desktop MCP clients |
| Where the graph is stored (default: |
pip install -e ".[dev,neo4j]"
WAGGLE_TRANSPORT=http \
WAGGLE_BACKEND=neo4j \
WAGGLE_DEFAULT_TENANT_ID=workspace-default \
WAGGLE_NEO4J_URI=bolt://localhost:7687 \
WAGGLE_NEO4J_USERNAME=neo4j \
WAGGLE_NEO4J_PASSWORD=change-me \
waggle-mcpdocker build -t waggle-mcp:latest .
docker run --rm -p 8080:8080 \
-e WAGGLE_TRANSPORT=http \
-e WAGGLE_BACKEND=neo4j \
-e WAGGLE_DEFAULT_TENANT_ID=workspace-default \
-e WAGGLE_NEO4J_URI=bolt://host.docker.internal:7687 \
-e WAGGLE_NEO4J_USERNAME=neo4j \
-e WAGGLE_NEO4J_PASSWORD=change-me \
waggle-mcp:latestClaude Desktop — claude_desktop_config.json
{
"mcpServers": {
"waggle": {
"command": "/path/to/.venv/bin/python",
"args": ["-m", "waggle.server"],
"env": {
"PYTHONPATH": "/path/to/waggle-mcp/src",
"WAGGLE_TRANSPORT": "stdio",
"WAGGLE_BACKEND": "sqlite",
"WAGGLE_DB_PATH": "~/.waggle/memory.db",
"WAGGLE_DEFAULT_TENANT_ID": "local-default",
"WAGGLE_MODEL": "all-MiniLM-L6-v2"
}
}
}
}Codex — codex_config.toml
[mcp_servers.waggle]
command = "/path/to/.venv/bin/python"
args = ["-m", "waggle.server"]
cwd = "/path/to/waggle-mcp"
env = {
PYTHONPATH = "/path/to/waggle-mcp/src",
WAGGLE_TRANSPORT = "stdio",
WAGGLE_BACKEND = "sqlite",
WAGGLE_DB_PATH = "~/.waggle/memory.db",
WAGGLE_DEFAULT_TENANT_ID = "local-default",
WAGGLE_MODEL = "all-MiniLM-L6-v2"
}A pre-filled example is in codex_config.example.toml.
Environment variables
Core
Variable | Default | Description |
|
|
|
|
|
|
|
| sentence-transformers model (local inference) |
|
| default tenant |
| — | optional export directory |
SQLite
Variable | Default | Description |
|
| path to the SQLite file |
HTTP service
Variable | Default | Description |
|
| bind host |
|
| bind port |
|
| log level |
|
| global rate limit (req/min) |
|
| write-tool rate limit |
|
| concurrency cap |
|
| max request size |
|
| per-request timeout |
Neo4j
Variable | Description |
| Bolt URI, e.g. |
| Neo4j username |
| Neo4j password |
| Neo4j database name |
LLM Extraction
Variable | Default | Description |
|
|
|
|
| Ollama model name |
|
| float 0–1, facts below this are dropped |
|
| Base URL for local Ollama |
|
| Timeout in seconds for Ollama requests; used by extractor and benchmark harness |
# Create a tenant
waggle-mcp create-tenant --tenant-id workspace-a --name "Workspace A"
# Issue an API key (raw key returned once — store it securely)
waggle-mcp create-api-key --tenant-id workspace-a --name "ci-agent"
# List keys for a tenant
waggle-mcp list-api-keys --tenant-id workspace-a
# Revoke a key
waggle-mcp revoke-api-key --api-key-id <id>
# Migrate SQLite data → Neo4j
WAGGLE_BACKEND=neo4j WAGGLE_NEO4J_URI=bolt://localhost:7687 \
WAGGLE_NEO4J_USERNAME=neo4j WAGGLE_NEO4J_PASSWORD=change-me \
waggle-mcp migrate-sqlite --db-path ./memory.db --tenant-id workspace-aFull production deployment assets are in deploy/:
Path | What's inside |
| Deployment, Service, Ingress (TLS), NetworkPolicy, HPA, PDB, cert-manager, ExternalSecrets — see |
| Prometheus scrape config, Grafana dashboard, one-command Docker Compose observability stack |
Operational runbooks are in docs/runbooks/:
API key rotation — zero-downtime create-then-revoke
Incident response — Neo4j down, OOM, rate storm, auth failures
Backup & restore — manual and automated drill
Tenant onboarding — new tenant checklist
Secret management — External Secrets + cert-manager
waggle-mcp
├── Core domain graph CRUD · dedup · local embeddings · conflict detection · export/import
├── Transport stdio MCP (Codex/Desktop) · streamable HTTP MCP (Kubernetes)
└── Platform config · auth · tenant isolation · rate limiting · logging · metricsBackend:
Local/dev → SQLite (zero config, instant start)
Production → Neo4j (
WAGGLE_TRANSPORT=httprequiresWAGGLE_BACKEND=neo4j)
waggle-mcp/
├── assets/ ← banner + demo SVG
├── benchmarks/fixtures/ ← checked-in eval datasets
├── deploy/
│ ├── kubernetes/ ← full K8s manifests + guide
│ └── observability/ ← Prometheus + Grafana stack
├── docs/runbooks/ ← operational runbooks
├── scripts/
│ ├── benchmark_extraction.py
│ ├── load_test.py / .sh
│ └── backup_restore_drill.py / .sh
├── src/waggle/ ← server, graph, neo4j_graph, auth, config …
├── tests/artifacts/ ← saved benchmark runs
├── Dockerfile
├── pyproject.toml
└── README.mdRunning tests
.venv/bin/pytest -qCoverage: graph CRUD, deduplication, conflict detection, tenant isolation, backup/import, stdio MCP, HTTP auth/health/metrics, payload limits.
License
MIT — see LICENSE.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/Abhigyan-Shekhar/Waggle-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server