Skip to main content
Glama

Continuo Memory System

by GtOkAi
continuo.markdownβ€’14.3 kB
# Effectively Continuous Context for AI in Development Environments **Authors:** D.D. & Gustavo Porto **Affiliation:** Independent Research / Open Collaboration **Date:** October 2025 **Categories:** cs.AI, cs.SE, cs.HC ## Abstract We present an architecture for persistent memory and hierarchical compression applied to development assistants in IDEs. By separating reasoning (LLM) from long-term persistent memory (Vector DB + hierarchical compression), the system maintains knowledge indefinitely, circumventing context window limitations. We propose two reproducible implementations: (A) 100% local and free (embeddings + ChromaDB), and (B) low-cost hybrid (OpenAI Embeddings + Chroma/Pinecone). We integrate via Model Context Protocol (MCP), describe chunking pipelines, hierarchical retrieval, condensation and an autonomous mode for auto-code/auto-debug. We provide API specifications, code examples and evaluation protocol. **Keywords:** LLM, RAG, persistent memory, MCP, development assistance, continuous context. ## 1. Introduction LLMs enhance development productivity but suffer from context limits. In extensive projects, decisions and patterns are lost between sessions. Our proposal adds a pluggable persistent memory layer, enabling semantic retrieval of conversations, code, commits and docs β€” keeping LLM token budget lean through hierarchical compression. The focus is on transforming this system into a development agent, highlighting autonomous mode as the main innovation. ## 2. Fundamentals and Notation - **LLM:** Language model used for responses (e.g., GPT/Claude/local). - **Embedding:** d-dimensional vector representing text/code. - **Vector DB:** ANN index (FAISS/Chroma/Qdrant) with `{doc, emb, meta}`. - **RAG:** Retrieval-Augmented Generation (retrieve β†’ generate). - **MCP:** protocol to expose tools to agents/IDEs (open standard). - **Working set:** context subset that fits in the prompt. - **Ξ£:** set of chunks; **E(Β·)** embedding function; **sim(Β·)** similarity. ## 3. Problem & Contributions **Problem.** IDE assistants lose history and require re-explanation. **Contributions:** 1. Persistent memory layer with hierarchical compression + working set policy, reproducible. 2. Two implementations: (A) local free, (B) low-cost hybrid. 3. Standardized MCP integration with `search_memory` and `store_memory` tools. 4. Autonomous mode: orchestrator for auto-code/auto-debug with observeβ†’planβ†’actβ†’reflect cycle. 5. Evaluation protocol (metrics, ablations, cost). ## 4. Related Work Internal memories offer persistent context, but not open RAG with hierarchical compression or Vector DB. MCP projects with Qdrant/"memory bank" show basic storage/search. LangChain / LlamaIndex address persistent memories and RAGs, but without hierarchical compression. OpenAI MCP allows exposing tools to models, but doesn't define memory persistence. ReAct / Reflexion Loop are conceptual frameworks for autonomous observation and reflection; here applied in IDE environment. Our difference: dual pipeline (local/cloud) + hierarchical compression + documented integration + executable appendix. The main innovation is practical integration of hierarchical compression for context management, persistent memory via MCP and autonomous mode with telemetry. ## 5. System Architecture ``` IDE Chat ──► MCP Adapter ──► Memory Server ──► Vector DB β–² β–² β”‚ β”‚ β”‚ └──── tools β—„β”€β”€β”€β”€β”€β”˜ β”‚ └───── generated response ◄──── LLM ◄──── context β”€β”˜ β”‚ β–² └─────────────── logs (OBSERVE, PLAN, ACT, REFLECT) β”€β”˜ ``` ### Main Components 1. **Memory Server** - Implemented in Python. - Stores and searches embeddings using ChromaDB and sentence-transformers. - Exposes HTTP endpoints: `/store`, `/retrieve`. 2. **MCP Adapter (Python or Node.js)** - Acts as bridge between IDE and Memory Server. - Exposes `search_memory(...)` and `store_memory(...)` tools accessible via MCP. - Enables autonomous and persistent operation. 3. **Hierarchical Compression** - Levels: N0 (raw chunks), N1 (micro-summaries), N2 (meta-summaries). - Temporal decay and reranking policies keep working set lean. 4. **Autonomous Mode** - Orchestrator with Observe β†’ Plan β†’ Act β†’ Reflect cycle. - Modes: Auto-Debug, Auto-Document, Auto-Plan. 5. **Telemetry and Security** - Logs OBSERVE, PLAN, ACT, REFLECT with IDs and hashes for auditing and monitoring. - Policies: `max_auto_edits`, rate limiting, confirmation before commits. ## 6. Practical Impact The proposed architecture has real applications, including: - **Pair programming with memory:** continuous collaborative support. - **Continuous debugging:** error identification and correction with history. - **Living documentation:** automatic README and changelog updates. MCP integration enables extension to VS Code, Zed and JetBrains IDEs, expanding reach. ## 7. Autonomous Mode Autonomous mode is an optional automation module, designed to operate with explicit user authorization and under configurable security policy. ### 7.1 Autonomous Mode Architecture ``` Developer / IDE ↔ LLM ↕ (via MCP tools) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Orchestrator (Auto Mode) β”‚ β”‚ β€’ Loop Observeβ†’Planβ†’Actβ†’Reflectβ”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ 1) Memory (Chroma/Vector DB) │───► logs (OBSERVE, PLAN, ACT, REFLECT) β”‚ 2) Actions (MCP tools) │───► logs (ACTION_RESULTS) β”‚ 3) Reasoning (LLM) │───► logs (REASONING_STEPS) β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↕ Git / Build / Test / Docs ``` #### Auto Mode Cycle ``` Observe β†’ Plan β†’ Act β†’ Reflect (Repetition with monitoring) ``` ### 7.2 Behavior 1. Observe: collect events and store (`store_memory`). 2. Plan: query (`search_memory`) + generate plan. 3. Act: execute MCP tools. 4. Reflect: summarize results and record learning. ### 7.3 Implementation Example Snippet ```python # auto_mode.py β€” simplified loop import time, requests from logging_utils import log_event MEM_URL = "http://127.0.0.1:8000" def store(text, meta=None): requests.post(f"{MEM_URL}/store", json={"text": text, "metadata": meta or {}}) def retrieve(query, k=5): r = requests.post(f"{MEM_URL}/retrieve", json={"query": query, "top_k": k}) return "\n".join(r.json().get("documents", [])) while True: event = get_latest_project_event() if not event: time.sleep(10) continue log_event("OBSERVE", f"Event detected: {event['type']} - {event['description']}", {"file_path": "build.log"}) store(f"EVENT: {event['type']} β€” {event['description']}") context = retrieve(f"What is the next step for: {event['description']}?") prompt = "Retrieved context:\n" + context + "\nGenerate a JSON PLAN..." plan = call_mcp_tool("generate_plan", {"prompt": prompt}) log_event("PLAN", "Plan generated", {"plan_structure": plan}) for task in plan.get("tasks", []): run_result = call_mcp_tool(task["action"], task.get("params", {})) log_event("ACT", f"Action executed: {task['action']}", {"execution_outcome": run_result[:500]}) store(f"REFLECT: {task['action']} => {run_result[:500]}") time.sleep(30) ``` ## 8. Experimental Evaluation ### 8.1 Metrics - Recall@K (target in Top-K). - Latency P50 / P95 (ms). - Storage cost (tokens/time). - Context reuse (# of memory items cited/session). - Context cost: ratio between relevant tokens and total tokens sent to model after compression. ## 9. Ethical Considerations and Authorship The human author reviewed all content and assumes full responsibility. The system was designed under the principle of 'monitored autonomy', ensuring no critical action is executed without explicit consent. All interactions and edits are auditable. ## 10. Conclusion The project presents a solution to maintain effectively continuous context in integrated models, combining persistent memory, hierarchical compression and autonomous mode, with focus on security and traceability. ## 11. References 1. ChromaDB Documentation. Available at: https://docs.trychroma.com/ 2. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. *arXiv preprint arXiv:1908.10084*. DOI: 10.48550/arXiv.1908.10084 3. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. *arXiv preprint arXiv:2005.11401*. DOI: 10.48550/arXiv.2005.11401 ## Appendices (A–I) ### A: Local Step-by-Step - **Prerequisites:** Install Python 3.9+, pip and an IDE like VS Code. - **Step 1:** Create virtual environment: `python -m venv venv && source venv/bin/activate`. - **Step 2:** Install dependencies: `pip install chromadb sentence-transformers uvicorn fastapi`. - **Step 3:** Clone MCP SDK: `git clone https://github.com/modelcontextprotocol/python-sdk` and install with `pip install -e python-sdk`. - **Step 4:** Save `mcp_memory.py` code (section 7.2) and run: `python mcp_memory.py`. - **Step 5:** Configure IDE with `.qoder/mcp.json` (section 8) and test with `@continuo-memory search: example`. - **Verification:** Confirm logs in terminal and responses in IDE. ### B: Modular Step-by-Step - **Prerequisites:** Node.js 16+ and Python 3.9+ installed. - **Step 1:** Configure Memory Server (Python) as in local step-by-step, but use `memory_server.py` with HTTP endpoint (e.g.: `uvicorn memory_server:app --host 0.0.0.0 --port 8000`). - **Step 2:** Create `mcp_adapter.js` (simplified example): ```javascript const { spawn } = require('child_process'); const server = spawn('python', ['memory_server.py']); server.stdout.on('data', (data) => console.log(`Server: ${data}`)); process.stdin.pipe(server.stdin); ``` - **Step 3:** Install Node dependencies: `npm init -y && npm install`. - **Step 4:** Integrate with `.qoder/mcp.json` and start with `node mcp_adapter.js`. - **Step 5:** Test HTTP calls (e.g.: `curl -X POST http://localhost:8000/store -d '{"text": "test"}'`). ### C: API Specification - **Endpoint /store** - **Method:** POST - **Body:** `{"text": str, "metadata": dict}` - **Response:** `{"ok": bool}` - **Example:** `curl -X POST http://localhost:8000/store -d '{"text": "test function", "metadata": {"file": "test.py"}}'` - **Endpoint /retrieve** - **Method:** POST - **Body:** `{"query": str, "top_k": int}` - **Response:** `{"documents": list, "metadatas": list}` - **Example:** `curl -X POST http://localhost:8000/retrieve -d '{"query": "test", "top_k": 3}'` ### D: Prompt Template - **Base structure:** ``` Retrieved context: {meta-summary} {quoted excerpts} {pointers (doc_id:line)} Instruction: {specific task} ``` - **Example:** ``` Retrieved context: Micro-summary: Test function implemented in test.py. Excerpt: def test(): print("ok") Pointers: test.py:1 Instruction: Generate a plan to add logging to test function. ``` ### E: Evaluation Script ```python import time from memory_server import store, retrieve def evaluate_recall(): test_docs = ["function 1", "function 2", "function 3"] for doc in test_docs: store(doc, {"file": "test.py"}) query = "function" start = time.time() results = retrieve(query, top_k=3) latency = time.time() - start recall = len([r for r in results["documents"] if query in r]) / len(test_docs) print(f"Recall@3: {recall}, Latency: {latency}s") if __name__ == "__main__": evaluate_recall() ``` ### F: Troubleshooting - **Error: "Connection refused"** - Verify Memory Server is running (`ps aux | grep python`). - Confirm address/port in `mcp_adapter.js`. - **No response in IDE** - Validate `.qoder/mcp.json` and restart IDE. - Check server logs for errors. - **Slow performance** - Reduce `top_k` or optimize embeddings with smaller models. ### H: Glossary - **Embedding:** Vector representation of text/code for semantic analysis. - **RAG:** Technique combining document retrieval with text generation. - **MCP:** Protocol for tool integration in AI models. - **Chunking:** Dividing text into logical units for processing. - **Working Set:** Context subset loaded in LLM prompt. - **Autonomous mode:** Automation mode with Observe-Plan-Act-Reflect cycle. ### I: Log Management - **Suggested format (JSON):** ```json { "event_timestamp": "2025-10-10T15:20:00-03:00", "log_severity": "INFO", "action_phase": "OBSERVE", "unique_event_id": "123e4567-e89b-12d3-a456-426614174000", "context_details": "Event detected: new build error", "source_context": {"file_path": "test.py", "line_number": 10} } ``` - **Python implementation:** ```python import json import logging from datetime import datetime import uuid logging.basicConfig(filename='auto_mode.log', level=logging.INFO) def log_event(action_phase, context_details, source_context=None): log_entry = { "event_timestamp": datetime.utcnow().isoformat() + "-03:00", "log_severity": "INFO", "action_phase": action_phase, "unique_event_id": str(uuid.uuid4()), "context_details": context_details, "source_context": source_context or {} } logging.info(json.dumps(log_entry)) store(json.dumps(log_entry)) # Store in persistent memory # Example usage in autonomous mode event = {"type": "error", "description": "build failed"} log_event("OBSERVE", f"Event detected: {event['type']} - {event['description']}", {"file_path": "build.log"}) ``` - **Purpose:** - **Primary:** Record for auditing, security and monitoring. - **Secondary:** Metrics analysis and continuous learning support. - **Optional:** Logs in persistent memory can be queried by LLM via `search_memory` (e.g., search for previous failures), but this requires filtering to avoid noise in context. Consider as future work.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/GtOkAi/continuo-memory-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server