Skip to main content
Glama
orneryd

M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

by orneryd
EMBEDDING_INDEX_FIX.md7.32 kB
# Neo4j Vector Index & Embedding Issue - Fix Guide ## Problem Summary The Neo4j vector index `node_embedding_index` was configured to only index nodes with the `:Node` label, but many nodes in the database were created with only type-specific labels (`:memory`, `:preamble`, `:todo`, `:todoList`, `:FileChunk`) without the `:Node` label. This caused: 1. **Vector search not finding most nodes** - Only 41 out of 3,573 nodes were searchable 2. **Missing embeddings on FileChunks** - 3,069 FileChunk nodes had no embeddings at all 3. **Inconsistent data model** - Old nodes used type as label, new code creates `:Node` label properly ## Root Cause **Historical Data Issue**: Older code created nodes with only type-specific labels: - `CREATE (n:memory {...})` - `CREATE (n:preamble {...})` - `CREATE (n:todo {...})` **Current Code (Correct)**: Modern code creates all nodes with `:Node` label: - `CREATE (n:Node {...})` in `GraphManager.ts` line 296 - `MERGE (f:File:Node {...})` in `FileIndexer.ts` line 121 - `CREATE (c:FileChunk:Node)` in `FileIndexer.ts` line 170 **Vector Index Limitation**: The index only covers `:Node` labeled nodes: ```cypher // Current index configuration CREATE VECTOR INDEX node_embedding_index FOR (n:Node) ON (n.embedding) OPTIONS {indexConfig: {`vector.dimensions`: 1024, `vector.similarity_function`: "COSINE"}} ``` ## Diagnosis Commands Check your database status with these commands: ```bash # Find Neo4j container name docker ps --format "{{.Names}}" | grep neo4j # Count total nodes with Node label echo "MATCH (n:Node) RETURN count(n) as total;" | docker exec -i <container_name> cypher-shell -u neo4j -p password # Count nodes with embeddings echo "MATCH (n) WHERE n.embedding IS NOT NULL RETURN count(n) as withEmbedding;" | docker exec -i <container_name> cypher-shell -u neo4j -p password # Check all labels in database echo "CALL db.labels() YIELD label RETURN label ORDER BY label;" | docker exec -i <container_name> cypher-shell -u neo4j -p password # Check vector index configuration echo "SHOW INDEXES YIELD name, labelsOrTypes, properties, type WHERE type = 'VECTOR' RETURN name, labelsOrTypes, properties;" | docker exec -i <container_name> cypher-shell -u neo4j -p password # Check distribution of nodes by label echo "MATCH (n) WHERE n.embedding IS NOT NULL RETURN DISTINCT labels(n) as labels, count(*) as count ORDER BY count DESC;" | docker exec -i <container_name> cypher-shell -u neo4j -p password ``` ## Fix: Migrate Old Data Run these Cypher commands to add `:Node` label to all nodes (replace `<container_name>` with your Neo4j container): ```bash # 1. Add Node label to all memory nodes echo "MATCH (n:memory) WHERE NOT n:Node SET n:Node RETURN count(n) as updated;" | docker exec -i <container_name> cypher-shell -u neo4j -p password # 2. Add Node label to all preamble nodes echo "MATCH (n:preamble) WHERE NOT n:Node SET n:Node RETURN count(n) as updated;" | docker exec -i <container_name> cypher-shell -u neo4j -p password # 3. Add Node label to all todo/todoList nodes echo "MATCH (n) WHERE (n:todo OR n:todoList) AND NOT n:Node SET n:Node RETURN count(n) as updated;" | docker exec -i <container_name> cypher-shell -u neo4j -p password # 4. Add Node label to all FileChunk nodes (if needed) echo "MATCH (n:FileChunk) WHERE NOT n:Node SET n:Node RETURN count(n) as updated;" | docker exec -i <container_name> cypher-shell -u neo4j -p password # 5. Add Node label to all File nodes (if needed) echo "MATCH (n:File) WHERE NOT n:Node SET n:Node RETURN count(n) as updated;" | docker exec -i <container_name> cypher-shell -u neo4j -p password ``` ## Verification After migration, verify the fix: ```bash # Check total nodes with Node label echo "MATCH (n:Node) RETURN count(n) as total;" | docker exec -i <container_name> cypher-shell -u neo4j -p password # Check embedding coverage echo "MATCH (n:Node) RETURN count(n) as total, count(n.embedding) as withEmbedding, count(n.embedding) * 100.0 / count(n) as percentWithEmbedding;" | docker exec -i <container_name> cypher-shell -u neo4j -p password # Check FileChunk status echo "MATCH (fc:FileChunk) RETURN count(fc) as total, count(fc.embedding) as withEmbedding;" | docker exec -i <container_name> cypher-shell -u neo4j -p password ``` **Expected Results After Migration:** - All nodes should have `:Node` label - Vector index can now find all nodes - Most nodes will still need embeddings generated ## Generate Missing Embeddings After fixing the label issue, generate embeddings for nodes that don't have them: ```bash # 1. Check current embedding status npm run embeddings:check # 2. Generate embeddings for all nodes/chunks without them npm run embeddings:generate ``` The generation script will: - Find all `:Node` labeled nodes without embeddings - Include both regular nodes (memory, todo, preamble) and FileChunks - Generate embeddings using the configured model (mxbai-embed-large) - Store embeddings in the database - Show progress and verification statistics **Note**: For 3,000+ nodes, this may take 30-60 minutes depending on your embeddings service performance. ## Prevention: Code Already Fixed The current codebase is correct and prevents this issue: **✅ Correct patterns already in use:** - `GraphManager.addNode()` creates all nodes with `:Node` label (line 296) - `FileIndexer` creates `File:Node` and `FileChunk:Node` labels (lines 121, 170) - All new nodes will automatically have the `:Node` label **No code changes needed** - just migrate the old data once. ## Summary of What Changed ### Before Migration ``` Total Nodes: 3,573 - With :Node label: 3,279 - Without :Node label: 294 (not searchable) - With embeddings: 134 (3.75%) - FileChunks without embeddings: 3,069 ``` ### After Migration ``` Total Nodes: 3,573 - With :Node label: 3,573 ✅ - Without :Node label: 0 ✅ - With embeddings: 134 (3.75%) - Need embeddings: 3,439 (96.25%) ``` ### After Generating Embeddings ``` Total Nodes: 3,573 - With :Node label: 3,573 ✅ - With embeddings: 3,573 (100%) ✅ - Vector search: Fully functional ✅ ``` ## Troubleshooting ### If nodes still not searchable after migration: 1. Verify `:Node` label was added: `MATCH (n) WHERE NOT n:Node RETURN count(n);` should return 0 2. Check vector index exists: `SHOW INDEXES WHERE type = 'VECTOR';` 3. Verify embeddings exist: `MATCH (n:Node) WHERE n.embedding IS NULL RETURN count(n);` ### If embedding generation fails: 1. Check embeddings service is running: `docker ps | grep llama` 2. Verify service URL: `echo $MIMIR_EMBEDDINGS_SERVICE_URL` 3. Check logs: `docker logs mimir-server` 4. Test embedding service directly: `curl http://localhost:11434/api/embeddings -d '{"model":"mxbai-embed-large","prompt":"test"}'` ### If FileChunks still missing embeddings: 1. FileChunks use `text` property, not `content` 2. The generation script now handles both: `coalesce(n.content, n.text)` 3. Run check to verify: `npm run embeddings:check` ## Additional Resources - **Vector Index Documentation**: Neo4j Vector Search documentation - **Embeddings Configuration**: `src/indexing/EmbeddingsService.ts` - **GraphManager Implementation**: `src/managers/GraphManager.ts` - **FileIndexer Implementation**: `src/indexing/FileIndexer.ts` - **Embedding Scripts**: `scripts/check-and-reset-embeddings.js`

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server