Graph Merge Suggestions
graph_merge_suggestionsIdentifies potential duplicate entity pairs by combining embedding similarity, shared neighbor overlap, and name token Jaccard. Use to review and triage entity consolidation before merging.
Instructions
Surface candidate pairs of entities likely to be duplicates. Read-only — never auto-merges. Combines embedding similarity, shared-neighbor overlap, and name-token Jaccard. Same-type only. Use to triage entity-explosion before running graph_merge (destructive consolidation) or graph_relate with ALIAS_OF (soft alias).
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| entity_id | No | Scope to one entity's potential duplicates | |
| entity_type | No | Scope to one entity type (Person, Project, etc.) | |
| min_score | No | Combined-score threshold to surface (default 0.8) | |
| min_embedding_similarity | No | Embedding-similarity floor for candidates (default 0.85) | |
| limit | No | Max suggestions to return (default 20, max 100) | |
| weights | No | Override default weights (0.4 / 0.4 / 0.2) | |
| log_to_audit | No | Emit merge_flagged audit events for surfaced pairs (default true) |
Implementation Reference
- src/mcp-server/index.ts:698-749 (registration)Tool registration for graph_merge_suggestions on the MCP server. Defines the input schema with parameters like entity_id, entity_type, min_score, min_embedding_similarity, limit, weights, and log_to_audit. Calls client.mergeSuggestions() and optionally logs merge_flagged audit events.
// ─── Tool: graph_merge_suggestions ─── server.registerTool("graph_merge_suggestions", { title: "Graph Merge Suggestions", description: "Surface candidate pairs of entities likely to be duplicates. Read-only — never auto-merges. Combines embedding similarity, shared-neighbor overlap, and name-token Jaccard. Same-type only. Use to triage entity-explosion before running graph_merge (destructive consolidation) or graph_relate with ALIAS_OF (soft alias).", inputSchema: { entity_id: z.string().optional().describe("Scope to one entity's potential duplicates"), entity_type: z.string().optional().describe("Scope to one entity type (Person, Project, etc.)"), min_score: z.number().optional().describe("Combined-score threshold to surface (default 0.8)"), min_embedding_similarity: z.number().optional().describe("Embedding-similarity floor for candidates (default 0.85)"), limit: z.number().optional().describe("Max suggestions to return (default 20, max 100)"), weights: z.object({ embedding: z.number().optional(), neighbor_jaccard: z.number().optional(), name: z.number().optional(), }).optional().describe("Override default weights (0.4 / 0.4 / 0.2)"), log_to_audit: z.boolean().optional().describe("Emit merge_flagged audit events for surfaced pairs (default true)"), }, annotations: { readOnlyHint: true, destructiveHint: false, idempotentHint: true }, }, async (args) => { try { const tenantId = currentTenant(); const result = await client.mergeSuggestions(tenantId, { entity_id: args.entity_id, entity_type: args.entity_type as EntityType | undefined, min_score: args.min_score, min_embedding_similarity: args.min_embedding_similarity, limit: args.limit, weights: args.weights, }); if (args.log_to_audit !== false) { for (const s of result.suggestions) { try { appendAuditEvent({ event: "merge_flagged", timestamp: new Date().toISOString(), tenant_id: tenantId, entity_a: s.entity_a.id, entity_b: s.entity_b.id, reason: `score=${s.score} (emb=${s.signals.embedding_similarity}, neighbor_jaccard=${s.signals.neighbor_jaccard}, name=${s.signals.name_similarity})`, }); } catch { /* audit is best-effort */ } } } return toolResult(result); } catch (err) { return toolError(`graph_merge_suggestions failed: ${err instanceof Error ? err.message : String(err)}`); } }); - src/shared/neo4j-client.ts:1815-2066 (handler)The actual implementation of mergeSuggestions() on the Neo4jClient class. Combines embedding similarity (vector index), hub-aware shared-neighbor Jaccard (down-weighting high-degree hubs), and name-token Jaccard to surface candidate duplicate entity pairs. Returns suggestions with signals and scores, same-type only.
async mergeSuggestions( tenantId: string, options: { entity_id?: string; entity_type?: EntityType; min_score?: number; min_embedding_similarity?: number; limit?: number; weights?: { embedding?: number; neighbor_jaccard?: number; name?: number }; } = {}, ): Promise<{ suggestions: Array<{ entity_a: { id: string; name: string; type: string; edge_count: number }; entity_b: { id: string; name: string; type: string; edge_count: number }; score: number; signals: { embedding_similarity: number; name_similarity: number; neighbor_jaccard: number; shared_neighbors: Array<{ id: string; relation: string; degree: number; weight: number }>; }; recommended_action: "review"; }>; total_pairs_evaluated: number; threshold_used: number; scope: { entity_id?: string; entity_type?: string; global: boolean }; }> { const minScore = options.min_score ?? 0.8; const minEmbSim = options.min_embedding_similarity ?? 0.85; const limit = Math.min(options.limit ?? 20, 100); const w = { embedding: options.weights?.embedding ?? 0.4, neighbor_jaccard: options.weights?.neighbor_jaccard ?? 0.4, name: options.weights?.name ?? 0.2, }; // Cap the number of seed entities to avoid runaway scans on large graphs. const MAX_SEEDS = 200; // Step 0: precompute per-entity edge degrees for hub-aware Jaccard. // A neighbor with degree D contributes weight 1/(1+log(D)) to the // intersection/union sums — so a 1-edge specific neighbor contributes // 1.0, while a 50-edge hub (e.g. the user's own Person node) contributes // ~0.20. Shared neighbors that are everyone's neighbor add little signal. const degreeRows = await this.run( ` MATCH (n:Entity {tenant_id: $tenantId}) RETURN n.id AS id, count{(n)-[]-(:Entity {tenant_id: $tenantId})} AS degree `, { tenantId }, ); const degrees = new Map<string, number>(); for (const r of degreeRows) { degrees.set(String(r["id"]), Number(r["degree"] ?? 0)); } const neighborWeight = (degree: number): number => { const d = Math.max(degree, 1); return 1 / (1 + Math.log(d)); }; // Step 1: collect seed entities. Constrained by entity_id / entity_type. const seedRows = await this.run( ` MATCH (n:Entity {tenant_id: $tenantId}) WHERE n.embedding IS NOT NULL AND ($entityId IS NULL OR n.id = $entityId) AND ($entityType IS NULL OR $entityType IN labels(n)) RETURN n.id AS id, n.name AS name, n.embedding AS embedding, [l IN labels(n) WHERE l <> 'Entity'][0] AS type LIMIT $maxSeeds `, { tenantId, entityId: options.entity_id ?? null, entityType: options.entity_type ?? null, maxSeeds: MAX_SEEDS, }, ); // Step 2: for each seed, find vector-similar same-type neighbors and // build a deduped pair map. Pairs are canonicalised so a.id < b.id. type Pair = { idA: string; idB: string; embSim: number }; const pairs = new Map<string, Pair>(); for (const row of seedRows) { const seedId = String(row["id"]); const seedType = String(row["type"] ?? ""); const seedEmbedding = row["embedding"] as number[] | null | undefined; if (!Array.isArray(seedEmbedding) || seedEmbedding.length === 0) continue; if (!seedType) continue; const similar = await this.vectorSearch(tenantId, seedEmbedding, { top_k: 10, min_similarity: minEmbSim, entity_types: [seedType as EntityType], }); for (const candidate of similar) { if (candidate.id === seedId) continue; // self-match const [idA, idB] = seedId < candidate.id ? [seedId, candidate.id] : [candidate.id, seedId]; const key = `${idA}::${idB}`; const existing = pairs.get(key); // Keep the higher embedding score if we see the same pair twice. if (!existing || candidate.score > existing.embSim) { pairs.set(key, { idA, idB, embSim: candidate.score }); } } } const totalPairsEvaluated = pairs.size; // Step 3: per-pair feature query — names, types, edge counts, neighbors. const suggestions: Array<{ entity_a: { id: string; name: string; type: string; edge_count: number }; entity_b: { id: string; name: string; type: string; edge_count: number }; score: number; signals: { embedding_similarity: number; name_similarity: number; neighbor_jaccard: number; shared_neighbors: Array<{ id: string; relation: string; degree: number; weight: number }>; }; recommended_action: "review"; }> = []; for (const pair of pairs.values()) { const featureRows = await this.run( ` MATCH (a:Entity {tenant_id: $tenantId, id: $idA}) MATCH (b:Entity {tenant_id: $tenantId, id: $idB}) OPTIONAL MATCH (a)-[ra]-(na:Entity {tenant_id: $tenantId}) WHERE na.id <> b.id WITH a, b, collect(DISTINCT na.id + '|' + type(ra)) AS neighborsA OPTIONAL MATCH (b)-[rb]-(nb:Entity {tenant_id: $tenantId}) WHERE nb.id <> a.id WITH a, b, neighborsA, collect(DISTINCT nb.id + '|' + type(rb)) AS neighborsB RETURN a.name AS nameA, [l IN labels(a) WHERE l <> 'Entity'][0] AS typeA, b.name AS nameB, [l IN labels(b) WHERE l <> 'Entity'][0] AS typeB, neighborsA, neighborsB `, { tenantId, idA: pair.idA, idB: pair.idB }, ); if (featureRows.length === 0) continue; const f = featureRows[0]!; const neighborsA = (f["neighborsA"] as string[] | null | undefined ?? []) .filter((s) => typeof s === "string" && s.length > 0); const neighborsB = (f["neighborsB"] as string[] | null | undefined ?? []) .filter((s) => typeof s === "string" && s.length > 0); const setA = new Set(neighborsA); const setB = new Set(neighborsB); const intersection = neighborsA.filter((x) => setB.has(x)); const unionSet = new Set([...neighborsA, ...neighborsB]); // Hub-aware weighted Jaccard. Each neighbor entry is "id|relation"; we // look up the global degree of the neighbor entity (id portion) and // weight its contribution inversely. A pair that shares only a hub // (everyone's neighbor) gets little credit; a pair that shares a // specific low-degree neighbor gets near-full credit. const idOf = (entry: string): string => { const sep = entry.lastIndexOf("|"); return sep >= 0 ? entry.slice(0, sep) : entry; }; let weightedInter = 0; for (const entry of intersection) { const deg = degrees.get(idOf(entry)) ?? 1; weightedInter += neighborWeight(deg); } let weightedUnion = 0; for (const entry of unionSet) { const deg = degrees.get(idOf(entry)) ?? 1; weightedUnion += neighborWeight(deg); } const neighborJaccard = weightedUnion === 0 ? 0 : weightedInter / weightedUnion; const nameA = String(f["nameA"] ?? ""); const nameB = String(f["nameB"] ?? ""); const tokensA = new Set( nameA.toLowerCase().split(/[^a-z0-9]+/).filter((t) => t.length > 0), ); const tokensB = new Set( nameB.toLowerCase().split(/[^a-z0-9]+/).filter((t) => t.length > 0), ); const tokenInter = [...tokensA].filter((t) => tokensB.has(t)).length; const tokenUnion = new Set([...tokensA, ...tokensB]).size; const nameSim = tokenUnion === 0 ? 0 : tokenInter / tokenUnion; const score = w.embedding * pair.embSim + w.neighbor_jaccard * neighborJaccard + w.name * nameSim; if (score < minScore) continue; const sharedNeighbors = intersection.map((entry) => { const sep = entry.lastIndexOf("|"); const id = sep >= 0 ? entry.slice(0, sep) : entry; const relation = sep >= 0 ? entry.slice(sep + 1) : ""; const deg = degrees.get(id) ?? 1; return { id, relation, degree: deg, weight: Number(neighborWeight(deg).toFixed(4)), }; }); suggestions.push({ entity_a: { id: pair.idA, name: nameA, type: String(f["typeA"] ?? "?"), edge_count: setA.size, }, entity_b: { id: pair.idB, name: nameB, type: String(f["typeB"] ?? "?"), edge_count: setB.size, }, score: Number(score.toFixed(4)), signals: { embedding_similarity: Number(pair.embSim.toFixed(4)), name_similarity: Number(nameSim.toFixed(4)), neighbor_jaccard: Number(neighborJaccard.toFixed(4)), shared_neighbors: sharedNeighbors, }, recommended_action: "review", }); } suggestions.sort((a, b) => b.score - a.score); const truncated = suggestions.slice(0, limit); return { suggestions: truncated, total_pairs_evaluated: totalPairsEvaluated, threshold_used: minScore, scope: { entity_id: options.entity_id, entity_type: options.entity_type, global: !options.entity_id && !options.entity_type, }, }; } - src/shared/dream-audit.ts:95-100 (helper)The merge_flagged audit event type definition used by graph_merge_suggestions when log_to_audit is enabled. Logs entity_a, entity_b, and reason to the audit trail.
| (BaseEvent & { event: "merge_flagged"; entity_a: string; entity_b: string; reason: string; }) - src/shared/dream-audit.ts:96-100 (helper)The merge_flagged event type used when graph_merge_suggestions logs potential duplicate pairs to the audit trail.
event: "merge_flagged"; entity_a: string; entity_b: string; reason: string; })