Skip to main content
Glama

Graph Merge Suggestions

graph_merge_suggestions
Read-onlyIdempotent

Identifies potential duplicate entity pairs by combining embedding similarity, shared neighbor overlap, and name token Jaccard. Use to review and triage entity consolidation before merging.

Instructions

Surface candidate pairs of entities likely to be duplicates. Read-only — never auto-merges. Combines embedding similarity, shared-neighbor overlap, and name-token Jaccard. Same-type only. Use to triage entity-explosion before running graph_merge (destructive consolidation) or graph_relate with ALIAS_OF (soft alias).

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
entity_idNoScope to one entity's potential duplicates
entity_typeNoScope to one entity type (Person, Project, etc.)
min_scoreNoCombined-score threshold to surface (default 0.8)
min_embedding_similarityNoEmbedding-similarity floor for candidates (default 0.85)
limitNoMax suggestions to return (default 20, max 100)
weightsNoOverride default weights (0.4 / 0.4 / 0.2)
log_to_auditNoEmit merge_flagged audit events for surfaced pairs (default true)

Implementation Reference

  • Tool registration for graph_merge_suggestions on the MCP server. Defines the input schema with parameters like entity_id, entity_type, min_score, min_embedding_similarity, limit, weights, and log_to_audit. Calls client.mergeSuggestions() and optionally logs merge_flagged audit events.
    // ─── Tool: graph_merge_suggestions ───
    
    server.registerTool("graph_merge_suggestions", {
      title: "Graph Merge Suggestions",
      description:
        "Surface candidate pairs of entities likely to be duplicates. Read-only — never auto-merges. Combines embedding similarity, shared-neighbor overlap, and name-token Jaccard. Same-type only. Use to triage entity-explosion before running graph_merge (destructive consolidation) or graph_relate with ALIAS_OF (soft alias).",
      inputSchema: {
        entity_id: z.string().optional().describe("Scope to one entity's potential duplicates"),
        entity_type: z.string().optional().describe("Scope to one entity type (Person, Project, etc.)"),
        min_score: z.number().optional().describe("Combined-score threshold to surface (default 0.8)"),
        min_embedding_similarity: z.number().optional().describe("Embedding-similarity floor for candidates (default 0.85)"),
        limit: z.number().optional().describe("Max suggestions to return (default 20, max 100)"),
        weights: z.object({
          embedding: z.number().optional(),
          neighbor_jaccard: z.number().optional(),
          name: z.number().optional(),
        }).optional().describe("Override default weights (0.4 / 0.4 / 0.2)"),
        log_to_audit: z.boolean().optional().describe("Emit merge_flagged audit events for surfaced pairs (default true)"),
      },
      annotations: { readOnlyHint: true, destructiveHint: false, idempotentHint: true },
    }, async (args) => {
      try {
        const tenantId = currentTenant();
        const result = await client.mergeSuggestions(tenantId, {
          entity_id: args.entity_id,
          entity_type: args.entity_type as EntityType | undefined,
          min_score: args.min_score,
          min_embedding_similarity: args.min_embedding_similarity,
          limit: args.limit,
          weights: args.weights,
        });
    
        if (args.log_to_audit !== false) {
          for (const s of result.suggestions) {
            try {
              appendAuditEvent({
                event: "merge_flagged",
                timestamp: new Date().toISOString(),
                tenant_id: tenantId,
                entity_a: s.entity_a.id,
                entity_b: s.entity_b.id,
                reason: `score=${s.score} (emb=${s.signals.embedding_similarity}, neighbor_jaccard=${s.signals.neighbor_jaccard}, name=${s.signals.name_similarity})`,
              });
            } catch { /* audit is best-effort */ }
          }
        }
    
        return toolResult(result);
      } catch (err) {
        return toolError(`graph_merge_suggestions failed: ${err instanceof Error ? err.message : String(err)}`);
      }
    });
  • The actual implementation of mergeSuggestions() on the Neo4jClient class. Combines embedding similarity (vector index), hub-aware shared-neighbor Jaccard (down-weighting high-degree hubs), and name-token Jaccard to surface candidate duplicate entity pairs. Returns suggestions with signals and scores, same-type only.
    async mergeSuggestions(
      tenantId: string,
      options: {
        entity_id?: string;
        entity_type?: EntityType;
        min_score?: number;
        min_embedding_similarity?: number;
        limit?: number;
        weights?: { embedding?: number; neighbor_jaccard?: number; name?: number };
      } = {},
    ): Promise<{
      suggestions: Array<{
        entity_a: { id: string; name: string; type: string; edge_count: number };
        entity_b: { id: string; name: string; type: string; edge_count: number };
        score: number;
        signals: {
          embedding_similarity: number;
          name_similarity: number;
          neighbor_jaccard: number;
          shared_neighbors: Array<{ id: string; relation: string; degree: number; weight: number }>;
        };
        recommended_action: "review";
      }>;
      total_pairs_evaluated: number;
      threshold_used: number;
      scope: { entity_id?: string; entity_type?: string; global: boolean };
    }> {
      const minScore = options.min_score ?? 0.8;
      const minEmbSim = options.min_embedding_similarity ?? 0.85;
      const limit = Math.min(options.limit ?? 20, 100);
      const w = {
        embedding: options.weights?.embedding ?? 0.4,
        neighbor_jaccard: options.weights?.neighbor_jaccard ?? 0.4,
        name: options.weights?.name ?? 0.2,
      };
      // Cap the number of seed entities to avoid runaway scans on large graphs.
      const MAX_SEEDS = 200;
    
      // Step 0: precompute per-entity edge degrees for hub-aware Jaccard.
      // A neighbor with degree D contributes weight 1/(1+log(D)) to the
      // intersection/union sums — so a 1-edge specific neighbor contributes
      // 1.0, while a 50-edge hub (e.g. the user's own Person node) contributes
      // ~0.20. Shared neighbors that are everyone's neighbor add little signal.
      const degreeRows = await this.run(
        `
        MATCH (n:Entity {tenant_id: $tenantId})
        RETURN n.id AS id, count{(n)-[]-(:Entity {tenant_id: $tenantId})} AS degree
        `,
        { tenantId },
      );
      const degrees = new Map<string, number>();
      for (const r of degreeRows) {
        degrees.set(String(r["id"]), Number(r["degree"] ?? 0));
      }
      const neighborWeight = (degree: number): number => {
        const d = Math.max(degree, 1);
        return 1 / (1 + Math.log(d));
      };
    
      // Step 1: collect seed entities. Constrained by entity_id / entity_type.
      const seedRows = await this.run(
        `
        MATCH (n:Entity {tenant_id: $tenantId})
        WHERE n.embedding IS NOT NULL
          AND ($entityId IS NULL OR n.id = $entityId)
          AND ($entityType IS NULL OR $entityType IN labels(n))
        RETURN n.id AS id,
               n.name AS name,
               n.embedding AS embedding,
               [l IN labels(n) WHERE l <> 'Entity'][0] AS type
        LIMIT $maxSeeds
        `,
        {
          tenantId,
          entityId: options.entity_id ?? null,
          entityType: options.entity_type ?? null,
          maxSeeds: MAX_SEEDS,
        },
      );
    
      // Step 2: for each seed, find vector-similar same-type neighbors and
      // build a deduped pair map. Pairs are canonicalised so a.id < b.id.
      type Pair = { idA: string; idB: string; embSim: number };
      const pairs = new Map<string, Pair>();
    
      for (const row of seedRows) {
        const seedId = String(row["id"]);
        const seedType = String(row["type"] ?? "");
        const seedEmbedding = row["embedding"] as number[] | null | undefined;
        if (!Array.isArray(seedEmbedding) || seedEmbedding.length === 0) continue;
        if (!seedType) continue;
    
        const similar = await this.vectorSearch(tenantId, seedEmbedding, {
          top_k: 10,
          min_similarity: minEmbSim,
          entity_types: [seedType as EntityType],
        });
    
        for (const candidate of similar) {
          if (candidate.id === seedId) continue; // self-match
          const [idA, idB] = seedId < candidate.id
            ? [seedId, candidate.id]
            : [candidate.id, seedId];
          const key = `${idA}::${idB}`;
          const existing = pairs.get(key);
          // Keep the higher embedding score if we see the same pair twice.
          if (!existing || candidate.score > existing.embSim) {
            pairs.set(key, { idA, idB, embSim: candidate.score });
          }
        }
      }
    
      const totalPairsEvaluated = pairs.size;
    
      // Step 3: per-pair feature query — names, types, edge counts, neighbors.
      const suggestions: Array<{
        entity_a: { id: string; name: string; type: string; edge_count: number };
        entity_b: { id: string; name: string; type: string; edge_count: number };
        score: number;
        signals: {
          embedding_similarity: number;
          name_similarity: number;
          neighbor_jaccard: number;
          shared_neighbors: Array<{ id: string; relation: string; degree: number; weight: number }>;
        };
        recommended_action: "review";
      }> = [];
    
      for (const pair of pairs.values()) {
        const featureRows = await this.run(
          `
          MATCH (a:Entity {tenant_id: $tenantId, id: $idA})
          MATCH (b:Entity {tenant_id: $tenantId, id: $idB})
          OPTIONAL MATCH (a)-[ra]-(na:Entity {tenant_id: $tenantId})
          WHERE na.id <> b.id
          WITH a, b, collect(DISTINCT na.id + '|' + type(ra)) AS neighborsA
          OPTIONAL MATCH (b)-[rb]-(nb:Entity {tenant_id: $tenantId})
          WHERE nb.id <> a.id
          WITH a, b, neighborsA,
               collect(DISTINCT nb.id + '|' + type(rb)) AS neighborsB
          RETURN a.name AS nameA,
                 [l IN labels(a) WHERE l <> 'Entity'][0] AS typeA,
                 b.name AS nameB,
                 [l IN labels(b) WHERE l <> 'Entity'][0] AS typeB,
                 neighborsA,
                 neighborsB
          `,
          { tenantId, idA: pair.idA, idB: pair.idB },
        );
    
        if (featureRows.length === 0) continue;
        const f = featureRows[0]!;
        const neighborsA = (f["neighborsA"] as string[] | null | undefined ?? [])
          .filter((s) => typeof s === "string" && s.length > 0);
        const neighborsB = (f["neighborsB"] as string[] | null | undefined ?? [])
          .filter((s) => typeof s === "string" && s.length > 0);
    
        const setA = new Set(neighborsA);
        const setB = new Set(neighborsB);
        const intersection = neighborsA.filter((x) => setB.has(x));
        const unionSet = new Set([...neighborsA, ...neighborsB]);
        // Hub-aware weighted Jaccard. Each neighbor entry is "id|relation"; we
        // look up the global degree of the neighbor entity (id portion) and
        // weight its contribution inversely. A pair that shares only a hub
        // (everyone's neighbor) gets little credit; a pair that shares a
        // specific low-degree neighbor gets near-full credit.
        const idOf = (entry: string): string => {
          const sep = entry.lastIndexOf("|");
          return sep >= 0 ? entry.slice(0, sep) : entry;
        };
        let weightedInter = 0;
        for (const entry of intersection) {
          const deg = degrees.get(idOf(entry)) ?? 1;
          weightedInter += neighborWeight(deg);
        }
        let weightedUnion = 0;
        for (const entry of unionSet) {
          const deg = degrees.get(idOf(entry)) ?? 1;
          weightedUnion += neighborWeight(deg);
        }
        const neighborJaccard = weightedUnion === 0 ? 0 : weightedInter / weightedUnion;
    
        const nameA = String(f["nameA"] ?? "");
        const nameB = String(f["nameB"] ?? "");
        const tokensA = new Set(
          nameA.toLowerCase().split(/[^a-z0-9]+/).filter((t) => t.length > 0),
        );
        const tokensB = new Set(
          nameB.toLowerCase().split(/[^a-z0-9]+/).filter((t) => t.length > 0),
        );
        const tokenInter = [...tokensA].filter((t) => tokensB.has(t)).length;
        const tokenUnion = new Set([...tokensA, ...tokensB]).size;
        const nameSim = tokenUnion === 0 ? 0 : tokenInter / tokenUnion;
    
        const score =
          w.embedding * pair.embSim +
          w.neighbor_jaccard * neighborJaccard +
          w.name * nameSim;
    
        if (score < minScore) continue;
    
        const sharedNeighbors = intersection.map((entry) => {
          const sep = entry.lastIndexOf("|");
          const id = sep >= 0 ? entry.slice(0, sep) : entry;
          const relation = sep >= 0 ? entry.slice(sep + 1) : "";
          const deg = degrees.get(id) ?? 1;
          return {
            id,
            relation,
            degree: deg,
            weight: Number(neighborWeight(deg).toFixed(4)),
          };
        });
    
        suggestions.push({
          entity_a: {
            id: pair.idA,
            name: nameA,
            type: String(f["typeA"] ?? "?"),
            edge_count: setA.size,
          },
          entity_b: {
            id: pair.idB,
            name: nameB,
            type: String(f["typeB"] ?? "?"),
            edge_count: setB.size,
          },
          score: Number(score.toFixed(4)),
          signals: {
            embedding_similarity: Number(pair.embSim.toFixed(4)),
            name_similarity: Number(nameSim.toFixed(4)),
            neighbor_jaccard: Number(neighborJaccard.toFixed(4)),
            shared_neighbors: sharedNeighbors,
          },
          recommended_action: "review",
        });
      }
    
      suggestions.sort((a, b) => b.score - a.score);
      const truncated = suggestions.slice(0, limit);
    
      return {
        suggestions: truncated,
        total_pairs_evaluated: totalPairsEvaluated,
        threshold_used: minScore,
        scope: {
          entity_id: options.entity_id,
          entity_type: options.entity_type,
          global: !options.entity_id && !options.entity_type,
        },
      };
    }
  • The merge_flagged audit event type definition used by graph_merge_suggestions when log_to_audit is enabled. Logs entity_a, entity_b, and reason to the audit trail.
    | (BaseEvent & {
        event: "merge_flagged";
        entity_a: string;
        entity_b: string;
        reason: string;
      })
  • The merge_flagged event type used when graph_merge_suggestions logs potential duplicate pairs to the audit trail.
      event: "merge_flagged";
      entity_a: string;
      entity_b: string;
      reason: string;
    })
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already indicate read-only and non-destructive. The description adds 'never auto-merges' and details the algorithm (embedding similarity, shared-neighbor overlap, name-token Jaccard). No contradictions.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three sentences: purpose, read-only and algorithm, usage guidance. Every sentence provides essential information with no redundancy. Ideal conciseness for an AI agent.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a suggestion tool without output schema, the description covers purpose, safety, algorithm, and usage. Minor gap: does not describe return format (likely a list of pairs). But given the clarity and annotations, it is nearly complete.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with detailed descriptions for all 7 parameters. The description provides high-level context (e.g., default weights) but does not add significant meaning beyond the schema. Baseline 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb 'surface' and the resource 'candidate pairs of entities likely to be duplicates'. It distinguishes from siblings like graph_merge (destructive) and graph_relate (soft alias) by mentioning alternative uses.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly advises using this tool for triaging entity-explosion before other specific operations, naming alternatives (graph_merge, graph_relate). Provides clear when-to-use and context.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/stevepridemore/graph-memory'

If you have feedback or need assistance with the MCP directory API, please join our Discord server