Powers LLM-driven memory summarization, automated memory extraction through an AI judge, and serves as a fallback embedding provider when local options are unavailable.
Provides local embedding generation for semantic search using models like qwen3-embedding, with GPU acceleration support for privacy-focused vector operations.
Memory-MCP Server
Version: 3.2.0
Status: Production-Ready
License: MIT
A state-of-the-art persistent memory system for AI agents using hybrid search (vector embeddings + BM25 FTS), neural reranking, and optional LLM-driven automated memory extraction.
Table of Contents
Overview
Memory-MCP is a production-grade persistent memory system for AI coding agents (Claude Code, Cursor, Windsurf, custom agents, etc.) that stores and retrieves valuable insights across sessions. It combines semantic vector search with keyword matching (BM25) for optimal retrieval accuracy.
What Problems Does This Solve?
Lost Knowledge: Valuable insights from debugging sessions, configurations, and patterns are forgotten between sessions
Context Switching: Hard to recall what worked in previous projects
Duplicate Effort: Solving the same problems repeatedly
Scattered Notes: Knowledge lives in different formats across different projects
How It Works
Intelligent Storage: Stores insights with 1024-dimensional semantic embeddings
Hybrid Retrieval: Searches using both semantic similarity AND keyword matching
Neural Reranking: CrossEncoder re-ranks results for maximum relevance
Optional Auto-Save: Hook analyzes agent actions and extracts memories automatically
Key Features
Feature | Description |
7 MCP Tools | Full CRUD operations + stats + health monitoring |
Hybrid Search | Vector (70%) + BM25 FTS (30%) with RRF fusion |
Neural Reranking | CrossEncoder (mxbai-reranker-base-v2, BEIR SOTA) |
Local Embeddings | Ollama support for privacy and speed |
GPU Acceleration | Works with any CUDA-capable GPU |
Duplicate Prevention | 90% similarity threshold prevents redundant saves |
TTL Management | 365-day expiry with automatic cleanup |
Fallback Chain | Ollama → Google → Hash (always available) |
Project Scoping | Search across all projects or project-specific |
Auto-Save Hook | Optional PostToolUse hook for automatic extraction |
Architecture
Quick Start
Installation
Prerequisites
Python 3.11 or higher
uv package manager (recommended)
Ollama (optional, for local embeddings) OR Google API key
Step 1: Clone and Setup
Step 2: Install Ollama (Recommended for Privacy)
With GPU (systemd service):
Step 3: Configure MCP Client
Add to your MCP configuration file:
Claude Code / Factory (~/.factory/mcp.json):
Cursor (~/.cursor/mcp.json):
Configuration
Environment Variables
Variable | Default | Description |
|
| Database location |
|
|
or
|
|
| Embedding model name |
|
| Embedding dimensions |
|
| Ollama API endpoint |
| - | Google Gemini API key |
Server Configuration
These are set in the Config class in server.py:
Setting | Default | Description |
|
| LLM for summarization |
|
| Memory time-to-live |
|
| Duplicate similarity threshold |
|
| FTS weight in RRF fusion |
|
| Default results per query |
|
| Maximum results per query |
MCP Tools Reference
Overview
Tool | Description | Read-Only |
| Save a memory with semantic embedding | No |
| Search across ALL projects | Yes |
| Search in CURRENT project only | Yes |
| Delete a memory by ID | No |
| Update an existing memory | No |
| Get statistics by category/project | Yes |
| Get system health status | Yes |
memory_save
Save a new memory with automatic embedding and duplicate detection.
Parameters:
content(required): Memory contentcategory: One ofPATTERN,CONFIG,DEBUG,PERF,PREF,INSIGHT,API,AGENTtags: List of tags for categorizationsummarize: Use LLM to summarize verbose content
Example:
memory_recall
Search across all projects using hybrid search.
Parameters:
query(required): Search query (semantic + keywords)category: Optional category filterlimit: Max results (default 5, max 50)
Example:
memory_recall_project
Same as memory_recall but scoped to current project only.
memory_delete
Delete a memory by full or partial ID.
Example:
memory_update
Update content, category, or tags of an existing memory.
Example:
memory_stats
Get memory statistics.
Example:
memory_health
Get system health and configuration status.
Example:
Hook System (Auto-Save)
The hook system enables automatic memory extraction from agent actions. This is optional but recommended for hands-free learning.
How It Works
The hooks/memory-extractor.py hook:
Triggers after tool executions (Edit, Write, Bash, MultiEdit)
Analyzes the action using an LLM judge (Gemini Flash)
Extracts category, content, and tags if memory-worthy
Checks for duplicates (90% similarity threshold)
Saves automatically to the same LanceDB database
Installation
For Factory/Droid users:
Update the shebang in
hooks/memory-extractor.pyto point to your venv:
Important: The hook requires dependencies (lancedb, numpy, etc.) from the project's virtual environment. Using
#!/usr/bin/env python3will fail withModuleNotFoundErrorunless those packages are installed system-wide.
Symlink or copy the hook:
Configure in
~/.factory/settings.json:
Make executable:
LLM Judge Criteria
The judge saves memories ONLY when they match:
Bug fix with non-obvious cause/solution
New coding pattern or architecture insight
Configuration that took effort
Error resolution with reusable fix
Performance optimization
User preference explicitly stated
It SKIPS:
Simple file reads/listings
Trivial edits or formatting
Status checks
Actions without learning value
Memory Categories
Category | When to Use |
| Coding patterns, architectures, design decisions |
| Tool configurations, environment settings |
| Error resolutions, debugging techniques |
| Performance optimizations |
| User preferences, coding style |
| Cross-project learnings |
| LLM/external API usage patterns |
| Agent design patterns, workflows |
Hook Limits
Setting | Value | Description |
Rate Limit | 30s | Minimum time between extractions |
Timeout | 30s | Max execution time |
Context | 5 messages | Recent transcript context |
Hook Logs
Monitor hook activity:
Search Technology
Hybrid Search Pipeline
Memory-MCP uses true hybrid search combining multiple retrieval methods:
Components
Vector Search (70% weight)
1024-dimensional embeddings (qwen3-embedding or Gemini)
Cosine similarity
Captures semantic meaning
BM25 FTS (30% weight)
Tantivy-based full-text search
TF-IDF keyword matching
Catches exact phrases and rare terms
RRF Fusion
Reciprocal Rank Fusion combines results
Weighted scoring prevents either method dominating
Neural Reranking
CrossEncoder:
mixedbread-ai/mxbai-reranker-base-v2BEIR benchmark SOTA (reinforcement learning trained)
Improves relevance 10-15%
Performance
Operation | Time |
Embedding (GPU) | ~10ms |
Embedding (CPU) | ~30-50ms |
Vector Search | 20-30ms |
FTS Search | 2-5ms |
RRF Fusion | <1ms |
Neural Rerank | 20-50ms |
Total Recall | 50-130ms |
Memory Lifecycle
Usage Examples
Basic Usage
With Category Filtering
Project-Specific Search
Using Summarization
Auto-Save Example
With the hook configured, after you fix a bug:
Testing
Run All Tests
Test Database
Tests use an isolated database separate from production:
Variable | Default Location |
(production) |
|
(tests) |
(project folder) |
The test database is automatically created and wiped before each test run. It's excluded from git via .gitignore.
Current Test Results
Test Suites
Suite | Tests | Coverage |
TestMemorySave | 7 | Save validation, deduplication |
TestMemoryRecall | 6 | Search, filtering, project scope |
TestMemoryUpdate | 3 | Update operations |
TestMemoryDelete | 3 | Delete by ID, partial match |
TestMemoryStats | 1 | Statistics |
TestEmbeddings | 2 | Generation, similarity |
TestSummarization | 1 | LLM summarization |
TestConcurrency | 3 | Thread safety |
TestFullLifecycle | 1 | End-to-end CRUD |
TestHookIntegration | 2 | Hook configuration |
TestMCPConfig | 1 | Config validation |
Troubleshooting
"GOOGLE_API_KEY not found"
"Ollama connection refused"
"Duplicate detected" too often
Lower the threshold in server.py:
Hook not triggering
Check permissions:
chmod +x ~/.factory/hooks/memory-extractor.pyCheck logs:
tail -f ~/.factory/logs/memory-extractor.logVerify settings.json configuration
Embedding dimension mismatch
Reset database (will lose existing memories):
Health Check
Always start troubleshooting with:
API Reference
Memory Schema
Config Schema
FAQ
Q: Can I use this without Ollama?
A: Yes, set EMBEDDING_PROVIDER=google and provide GOOGLE_API_KEY.
Q: Is GPU required?
A: No, but recommended. CPU embeddings are ~3x slower.
Q: What happens if all embedding providers fail?
A: Hash-based fallback ensures saves always work (reduced semantic quality).
Q: How do I backup my memories?
A: Copy ~/.memory-mcp/lancedb-memory/ directory.
Q: Can multiple agents share the same database?
A: Yes, LanceDB supports concurrent access.
Q: Is the hook required?
A: No, it's optional. You can use memory_save() manually.
Contributing
Fork the repository
Run tests:
uv run pytest -vFormat code:
uv run ruff format .Lint:
uv run ruff check .Submit PR
License
MIT License - See LICENSE file.
Acknowledgments
LanceDB - Vector database
Tantivy - Full-text search
Sentence-Transformers - CrossEncoder reranking
Ollama - Local embeddings
Google Gemini - LLM judge & embeddings
MCP - Model Context Protocol
Version History:
v3.2.0 - SOTA reranker (mxbai-reranker-base-v2), path updates
v3.1.0 - Tantivy FTS, embedding cache, TTL cleanup
v3.0.0 - Ollama integration, 1024-dim embeddings
v2.0.0 - Hook system, LLM judge
v1.0.0 - Initial release