Skip to main content
Glama
orneryd

M.I.M.I.R - Multi-agent Intelligent Memory & Insight Repository

by orneryd
DUPLICATE_EDGES_FIX.md8.03 kB
# Duplicate Edges Issue - Root Cause & Fix **Date**: November 20, 2025 **Issue**: File chunk edges are duplicated when folders are re-indexed **Impact**: Database bloat, potential query performance degradation **Status**: ✅ Fixed --- ## Problem When removing and re-adding folders to Mimir's file indexing system, **duplicate edges were created for file chunks**. Each re-index would create an additional set of edges instead of cleaning up old edges first, leading to database bloat. ### Example ``` First index: FileChunk-1 --[NEXT_CHUNK]--> FileChunk-2 Second index: FileChunk-1 --[NEXT_CHUNK]--> FileChunk-2 (duplicate!) Third index: FileChunk-1 --[NEXT_CHUNK]--> FileChunk-2 (another duplicate!) ``` ### Root Causes 1. **Missing `DETACH DELETE`** - Old code used `DELETE` instead of `DETACH DELETE` - `DELETE` removes nodes but leaves orphaned edges - `DETACH DELETE` removes edges THEN nodes 2. **Old embedding structure** - Query looked for `[:HAS_EMBEDDING]` relationships that no longer exist - Embeddings are now stored as node properties, not separate nodes 3. **Path matching bug** - Used simple `STARTS WITH` without path separator - Could cause false matches (e.g., `/src` matching `/src-other`) 4. **Missing NodeChunk cleanup** - Only handled `FileChunk`, not `NodeChunk` nodes - New universal chunking system wasn't being cleaned up --- ## Detected Statistics **Scan Results** (from `db:cleanup-edges:dry-run`): - **Total duplicate edge sets**: 2,926 - **Primary edge type**: `NEXT_CHUNK` between FileChunk nodes - **Secondary duplicates**: `depends_on` between todo nodes - **Estimated duplicate edges to remove**: ~2,926+ **Breakdown by type**: - `NEXT_CHUNK`: ~2,900 duplicates - `depends_on`: ~26 duplicates - Others: Negligible --- ## Solution ### 1. Fixed `DELETE /api/indexed-folders` Endpoint **File**: `src/api/index-api.ts` **Changes**: ```typescript // OLD (problematic) MATCH (f:File) WHERE f.path STARTS WITH $folderPath OPTIONAL MATCH (f)-[:HAS_CHUNK]->(c:FileChunk) OPTIONAL MATCH (c)-[:HAS_EMBEDDING]->(e) // ❌ Wrong structure DETACH DELETE f, c, e // NEW (fixed) // Ensure path ends with separator to avoid false matches const folderPathWithSep = path.endsWith('/') ? path : path + '/'; // Delete File nodes and their FileChunk children MATCH (f:File) WHERE f.path STARTS WITH $folderPathWithSep OR f.path = $exactPath OPTIONAL MATCH (f)-[:HAS_CHUNK]->(c:FileChunk) DETACH DELETE f, c RETURN count(DISTINCT f) as fileCount, count(DISTINCT c) as chunkCount ``` **Key improvements**: - ✅ Removed obsolete `[:HAS_EMBEDDING]` relationship matching - ✅ Added path separator to prevent false matches - ✅ Added exact path match option - ✅ Returns deletion stats for logging - ✅ Properly uses `DETACH DELETE` ### 2. Fixed Path Translation Functions **File**: `src/api/index-api.ts` **Functions updated**: - `translateToHostPath()` helper - File migration path calculation **Changes**: ```typescript // Ensure root ends with separator to avoid false matches const rootWithSep = containerWorkspaceRoot.endsWith('/') ? containerWorkspaceRoot : `${containerWorkspaceRoot}/`; // Check if path starts with root (with separator) or is exact match if (containerPath.startsWith(rootWithSep) || containerPath === containerWorkspaceRoot) { return containerPath.replace(containerWorkspaceRoot, hostWorkspaceRoot); } ``` ### 3. Created Cleanup Script **File**: `scripts/cleanup-duplicate-edges.js` **Features**: - Scans entire database for duplicate edges - Shows detailed statistics by relationship type - Supports `--dry-run` mode for safe preview - 5-second confirmation before destructive operations - Verification after cleanup - Returns summary of deleted edges **Usage**: ```bash # Preview what would be deleted (safe) npm run db:cleanup-edges:dry-run # Actually perform cleanup (destructive) npm run db:cleanup-edges ``` **Cypher query used**: ```cypher MATCH (source)-[r]->(target) WITH source, target, type(r) as relType, collect(r) as rels WHERE size(rels) > 1 WITH source, target, relType, rels UNWIND rels[1..] as duplicateRel // Keep first edge, delete rest DELETE duplicateRel RETURN count(*) as deletedCount ``` --- ## Prevention The fix prevents future duplicates by: 1. **Proper cleanup on removal** - `DETACH DELETE` ensures all edges are removed 2. **Accurate path matching** - Path separator prevents false matches 3. **Modern schema** - No longer looks for obsolete embedding relationships 4. **Comprehensive cleanup** - Handles both FileChunk and NodeChunk nodes --- ## Deployment Steps ### 1. Backup Database (Recommended) ```bash # Export Neo4j database before cleanup docker exec neo4j_db neo4j-admin database dump neo4j --to-path=/backups ``` ### 2. Preview Duplicates ```bash npm run db:cleanup-edges:dry-run ``` ### 3. Run Cleanup ```bash npm run db:cleanup-edges ``` ### 4. Rebuild & Restart Mimir ```bash npm run build docker compose build mimir-server docker compose up -d mimir-server ``` ### 5. Verify - Remove a folder from indexing - Re-add the same folder - Check Neo4j Browser for duplicate `NEXT_CHUNK` edges - Should see NO duplicates --- ## Verification Query Run this in Neo4j Browser to check for remaining duplicates: ```cypher // Find duplicate edges MATCH (source)-[r]->(target) WITH source, target, type(r) as relType, collect(r) as rels WHERE size(rels) > 1 RETURN labels(source) as sourceLabels, labels(target) as targetLabels, relType, size(rels) as duplicateCount ORDER BY duplicateCount DESC LIMIT 20 ``` **Expected result after fix**: 0 rows --- ## Performance Impact **Before fix**: - Database size: Growing with each re-index - Query performance: Degrading over time - Duplicate edges: Accumulating indefinitely **After fix**: - Database size: Stable - Query performance: Consistent - Duplicate edges: None (cleaned up) --- ## Related Files ### Modified - `src/api/index-api.ts` - Fixed deletion endpoint and path translation - `package.json` - Added `db:cleanup-edges` scripts ### Created - `scripts/cleanup-duplicate-edges.js` - Database cleanup utility - `docs/bugfixes/DUPLICATE_EDGES_FIX.md` - This document --- ## Additional Notes ### Why This Happened 1. **Legacy code** - Original implementation used simple `DELETE` 2. **Schema evolution** - Embeddings moved from nodes to properties 3. **Incremental development** - New features (NodeChunk) not fully integrated into cleanup ### Lessons Learned 1. ✅ Always use `DETACH DELETE` when removing nodes with relationships 2. ✅ Keep cleanup logic in sync with schema changes 3. ✅ Add path separators to prevent false `STARTS WITH` matches 4. ✅ Create verification queries for critical operations 5. ✅ Provide dry-run modes for destructive database operations --- ## Testing ### Manual Test ```bash # 1. Index a folder curl -X POST http://localhost:9042/api/index-folder \ -H "Content-Type: application/json" \ -d '{"path": "/workspace/test", "hostPath": "/Users/test"}' # 2. Wait for indexing to complete # 3. Remove the folder curl -X DELETE http://localhost:9042/api/indexed-folders \ -H "Content-Type: application/json" \ -d '{"path": "/workspace/test"}' # 4. Re-index the same folder curl -X POST http://localhost:9042/api/index-folder \ -H "Content-Type: application/json" \ -d '{"path": "/workspace/test", "hostPath": "/Users/test"}' # 5. Check for duplicates in Neo4j Browser MATCH (c1:FileChunk)-[r:NEXT_CHUNK]->(c2:FileChunk) WHERE c1.path STARTS WITH '/workspace/test' WITH c1, c2, collect(r) as rels WHERE size(rels) > 1 RETURN c1.id, c2.id, size(rels) ``` **Expected**: 0 rows (no duplicates) --- ## Support If you encounter issues: 1. Check Neo4j logs: `docker compose logs neo4j_db` 2. Run diagnostic: `npm run db:cleanup-edges:dry-run` 3. Verify API is running: `curl http://localhost:9042/health` 4. Check this document for verification queries --- **Status**: ✅ Issue resolved, prevention measures in place, cleanup script available

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/orneryd/Mimir'

If you have feedback or need assistance with the MCP directory API, please join our Discord server