Skip to main content
Glama

Graph Merge Suggestions

graph_merge_suggestions
Read-onlyIdempotent

Surface candidate duplicate entity pairs using embedding similarity, neighbor overlap, and name token comparison. Read-only—never auto-merges. Helps triage entity duplication before consolidation.

Instructions

Surface candidate pairs of entities likely to be duplicates. Read-only — never auto-merges. Combines embedding similarity, shared-neighbor overlap, and name-token Jaccard. Same-type only. Use to triage entity-explosion before running graph_merge (destructive consolidation) or graph_relate with ALIAS_OF (soft alias).

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
entity_idNoScope to one entity's potential duplicates
entity_typeNoScope to one entity type (Person, Project, etc.)
min_scoreNoCombined-score threshold to surface (default 0.8)
min_embedding_similarityNoEmbedding-similarity floor for candidates (default 0.85)
limitNoMax suggestions to return (default 20, max 100)
weightsNoOverride default weights (0.4 / 0.4 / 0.2)
log_to_auditNoEmit merge_flagged audit events for surfaced pairs (default true)

Implementation Reference

  • Registration of the 'graph_merge_suggestions' MCP tool using the MCP SDK server.registerTool() with its description, input schema (Zod), and annotations.
    // ─── Tool: graph_merge_suggestions ───
    
    server.registerTool("graph_merge_suggestions", {
      title: "Graph Merge Suggestions",
      description:
        "Surface candidate pairs of entities likely to be duplicates. Read-only — never auto-merges. Combines embedding similarity, shared-neighbor overlap, and name-token Jaccard. Same-type only. Use to triage entity-explosion before running graph_merge (destructive consolidation) or graph_relate with ALIAS_OF (soft alias).",
      inputSchema: {
        entity_id: z.string().optional().describe("Scope to one entity's potential duplicates"),
        entity_type: z.string().optional().describe("Scope to one entity type (Person, Project, etc.)"),
        min_score: z.number().optional().describe("Combined-score threshold to surface (default 0.8)"),
        min_embedding_similarity: z.number().optional().describe("Embedding-similarity floor for candidates (default 0.85)"),
        limit: z.number().optional().describe("Max suggestions to return (default 20, max 100)"),
        weights: z.object({
          embedding: z.number().optional(),
          neighbor_jaccard: z.number().optional(),
          name: z.number().optional(),
        }).optional().describe("Override default weights (0.4 / 0.4 / 0.2)"),
        log_to_audit: z.boolean().optional().describe("Emit merge_flagged audit events for surfaced pairs (default true)"),
      },
      annotations: { readOnlyHint: true, destructiveHint: false, idempotentHint: true },
    }, async (args) => {
      try {
        const tenantId = currentTenant();
        const result = await client.mergeSuggestions(tenantId, {
          entity_id: args.entity_id,
          entity_type: args.entity_type as EntityType | undefined,
          min_score: args.min_score,
          min_embedding_similarity: args.min_embedding_similarity,
          limit: args.limit,
          weights: args.weights,
        });
    
        if (args.log_to_audit !== false) {
          for (const s of result.suggestions) {
            try {
              appendAuditEvent({
                event: "merge_flagged",
                timestamp: new Date().toISOString(),
                tenant_id: tenantId,
                entity_a: s.entity_a.id,
                entity_b: s.entity_b.id,
                reason: `score=${s.score} (emb=${s.signals.embedding_similarity}, neighbor_jaccard=${s.signals.neighbor_jaccard}, name=${s.signals.name_similarity})`,
              });
            } catch { /* audit is best-effort */ }
          }
        }
    
        return toolResult(result);
      } catch (err) {
        return toolError(`graph_merge_suggestions failed: ${err instanceof Error ? err.message : String(err)}`);
      }
    });
  • Tool handler for graph_merge_suggestions: calls client.mergeSuggestions() and optionally logs merge_flagged audit events for each suggestion.
    }, async (args) => {
      try {
        const tenantId = currentTenant();
        const result = await client.mergeSuggestions(tenantId, {
          entity_id: args.entity_id,
          entity_type: args.entity_type as EntityType | undefined,
          min_score: args.min_score,
          min_embedding_similarity: args.min_embedding_similarity,
          limit: args.limit,
          weights: args.weights,
        });
    
        if (args.log_to_audit !== false) {
          for (const s of result.suggestions) {
            try {
              appendAuditEvent({
                event: "merge_flagged",
                timestamp: new Date().toISOString(),
                tenant_id: tenantId,
                entity_a: s.entity_a.id,
                entity_b: s.entity_b.id,
                reason: `score=${s.score} (emb=${s.signals.embedding_similarity}, neighbor_jaccard=${s.signals.neighbor_jaccard}, name=${s.signals.name_similarity})`,
              });
            } catch { /* audit is best-effort */ }
          }
        }
    
        return toolResult(result);
      } catch (err) {
        return toolError(`graph_merge_suggestions failed: ${err instanceof Error ? err.message : String(err)}`);
      }
    });
  • Input schema (Zod) for graph_merge_suggestions: entity_id, entity_type, min_score, min_embedding_similarity, limit, weights (embedding/neighbor_jaccard/name), and log_to_audit.
    inputSchema: {
      entity_id: z.string().optional().describe("Scope to one entity's potential duplicates"),
      entity_type: z.string().optional().describe("Scope to one entity type (Person, Project, etc.)"),
      min_score: z.number().optional().describe("Combined-score threshold to surface (default 0.8)"),
      min_embedding_similarity: z.number().optional().describe("Embedding-similarity floor for candidates (default 0.85)"),
      limit: z.number().optional().describe("Max suggestions to return (default 20, max 100)"),
      weights: z.object({
        embedding: z.number().optional(),
        neighbor_jaccard: z.number().optional(),
        name: z.number().optional(),
      }).optional().describe("Override default weights (0.4 / 0.4 / 0.2)"),
      log_to_audit: z.boolean().optional().describe("Emit merge_flagged audit events for surfaced pairs (default true)"),
    },
  • Core mergeSuggestions() method on Neo4jClient that queries Neo4j for entity embeddings, finds similar same-type pairs via vector search, computes hub-aware neighbor Jaccard and name-token Jaccard, combines them into a weighted score, and returns suggestions sorted by score.
    async mergeSuggestions(
      tenantId: string,
      options: {
        entity_id?: string;
        entity_type?: EntityType;
        min_score?: number;
        min_embedding_similarity?: number;
        limit?: number;
        weights?: { embedding?: number; neighbor_jaccard?: number; name?: number };
      } = {},
    ): Promise<{
      suggestions: Array<{
        entity_a: { id: string; name: string; type: string; edge_count: number };
        entity_b: { id: string; name: string; type: string; edge_count: number };
        score: number;
        signals: {
          embedding_similarity: number;
          name_similarity: number;
          neighbor_jaccard: number;
          shared_neighbors: Array<{ id: string; relation: string; degree: number; weight: number }>;
        };
        recommended_action: "review";
      }>;
      total_pairs_evaluated: number;
      threshold_used: number;
      scope: { entity_id?: string; entity_type?: string; global: boolean };
    }> {
      const minScore = options.min_score ?? 0.8;
      const minEmbSim = options.min_embedding_similarity ?? 0.85;
      const limit = Math.min(options.limit ?? 20, 100);
      const w = {
        embedding: options.weights?.embedding ?? 0.4,
        neighbor_jaccard: options.weights?.neighbor_jaccard ?? 0.4,
        name: options.weights?.name ?? 0.2,
      };
      // Cap the number of seed entities to avoid runaway scans on large graphs.
      const MAX_SEEDS = 200;
    
      // Step 0: precompute per-entity edge degrees for hub-aware Jaccard.
      // A neighbor with degree D contributes weight 1/(1+log(D)) to the
      // intersection/union sums — so a 1-edge specific neighbor contributes
      // 1.0, while a 50-edge hub (e.g. the user's own Person node) contributes
      // ~0.20. Shared neighbors that are everyone's neighbor add little signal.
      const degreeRows = await this.run(
        `
        MATCH (n:Entity {tenant_id: $tenantId})
        RETURN n.id AS id, count{(n)-[]-(:Entity {tenant_id: $tenantId})} AS degree
        `,
        { tenantId },
      );
      const degrees = new Map<string, number>();
      for (const r of degreeRows) {
        degrees.set(String(r["id"]), Number(r["degree"] ?? 0));
      }
      const neighborWeight = (degree: number): number => {
        const d = Math.max(degree, 1);
        return 1 / (1 + Math.log(d));
      };
    
      // Step 1: collect seed entities. Constrained by entity_id / entity_type.
      const seedRows = await this.run(
        `
        MATCH (n:Entity {tenant_id: $tenantId})
        WHERE n.embedding IS NOT NULL
          AND ($entityId IS NULL OR n.id = $entityId)
          AND ($entityType IS NULL OR $entityType IN labels(n))
        RETURN n.id AS id,
               n.name AS name,
               n.embedding AS embedding,
               [l IN labels(n) WHERE l <> 'Entity'][0] AS type
        LIMIT $maxSeeds
        `,
        {
          tenantId,
          entityId: options.entity_id ?? null,
          entityType: options.entity_type ?? null,
          maxSeeds: MAX_SEEDS,
        },
      );
    
      // Step 2: for each seed, find vector-similar same-type neighbors and
      // build a deduped pair map. Pairs are canonicalised so a.id < b.id.
      type Pair = { idA: string; idB: string; embSim: number };
      const pairs = new Map<string, Pair>();
    
      for (const row of seedRows) {
        const seedId = String(row["id"]);
        const seedType = String(row["type"] ?? "");
        const seedEmbedding = row["embedding"] as number[] | null | undefined;
        if (!Array.isArray(seedEmbedding) || seedEmbedding.length === 0) continue;
        if (!seedType) continue;
    
        const similar = await this.vectorSearch(tenantId, seedEmbedding, {
          top_k: 10,
          min_similarity: minEmbSim,
          entity_types: [seedType as EntityType],
        });
    
        for (const candidate of similar) {
          if (candidate.id === seedId) continue; // self-match
          const [idA, idB] = seedId < candidate.id
            ? [seedId, candidate.id]
            : [candidate.id, seedId];
          const key = `${idA}::${idB}`;
          const existing = pairs.get(key);
          // Keep the higher embedding score if we see the same pair twice.
          if (!existing || candidate.score > existing.embSim) {
            pairs.set(key, { idA, idB, embSim: candidate.score });
          }
        }
      }
    
      const totalPairsEvaluated = pairs.size;
    
      // Step 3: per-pair feature query — names, types, edge counts, neighbors.
      const suggestions: Array<{
        entity_a: { id: string; name: string; type: string; edge_count: number };
        entity_b: { id: string; name: string; type: string; edge_count: number };
        score: number;
        signals: {
          embedding_similarity: number;
          name_similarity: number;
          neighbor_jaccard: number;
          shared_neighbors: Array<{ id: string; relation: string; degree: number; weight: number }>;
        };
        recommended_action: "review";
      }> = [];
    
      for (const pair of pairs.values()) {
        const featureRows = await this.run(
          `
          MATCH (a:Entity {tenant_id: $tenantId, id: $idA})
          MATCH (b:Entity {tenant_id: $tenantId, id: $idB})
          OPTIONAL MATCH (a)-[ra]-(na:Entity {tenant_id: $tenantId})
          WHERE na.id <> b.id
          WITH a, b, collect(DISTINCT na.id + '|' + type(ra)) AS neighborsA
          OPTIONAL MATCH (b)-[rb]-(nb:Entity {tenant_id: $tenantId})
          WHERE nb.id <> a.id
          WITH a, b, neighborsA,
               collect(DISTINCT nb.id + '|' + type(rb)) AS neighborsB
          RETURN a.name AS nameA,
                 [l IN labels(a) WHERE l <> 'Entity'][0] AS typeA,
                 b.name AS nameB,
                 [l IN labels(b) WHERE l <> 'Entity'][0] AS typeB,
                 neighborsA,
                 neighborsB
          `,
          { tenantId, idA: pair.idA, idB: pair.idB },
        );
    
        if (featureRows.length === 0) continue;
        const f = featureRows[0]!;
        const neighborsA = (f["neighborsA"] as string[] | null | undefined ?? [])
          .filter((s) => typeof s === "string" && s.length > 0);
        const neighborsB = (f["neighborsB"] as string[] | null | undefined ?? [])
          .filter((s) => typeof s === "string" && s.length > 0);
    
        const setA = new Set(neighborsA);
        const setB = new Set(neighborsB);
        const intersection = neighborsA.filter((x) => setB.has(x));
        const unionSet = new Set([...neighborsA, ...neighborsB]);
        // Hub-aware weighted Jaccard. Each neighbor entry is "id|relation"; we
        // look up the global degree of the neighbor entity (id portion) and
        // weight its contribution inversely. A pair that shares only a hub
        // (everyone's neighbor) gets little credit; a pair that shares a
        // specific low-degree neighbor gets near-full credit.
        const idOf = (entry: string): string => {
          const sep = entry.lastIndexOf("|");
          return sep >= 0 ? entry.slice(0, sep) : entry;
        };
        let weightedInter = 0;
        for (const entry of intersection) {
          const deg = degrees.get(idOf(entry)) ?? 1;
          weightedInter += neighborWeight(deg);
        }
        let weightedUnion = 0;
        for (const entry of unionSet) {
          const deg = degrees.get(idOf(entry)) ?? 1;
          weightedUnion += neighborWeight(deg);
        }
        const neighborJaccard = weightedUnion === 0 ? 0 : weightedInter / weightedUnion;
    
        const nameA = String(f["nameA"] ?? "");
        const nameB = String(f["nameB"] ?? "");
        const tokensA = new Set(
          nameA.toLowerCase().split(/[^a-z0-9]+/).filter((t) => t.length > 0),
        );
        const tokensB = new Set(
          nameB.toLowerCase().split(/[^a-z0-9]+/).filter((t) => t.length > 0),
        );
        const tokenInter = [...tokensA].filter((t) => tokensB.has(t)).length;
        const tokenUnion = new Set([...tokensA, ...tokensB]).size;
        const nameSim = tokenUnion === 0 ? 0 : tokenInter / tokenUnion;
    
        const score =
          w.embedding * pair.embSim +
          w.neighbor_jaccard * neighborJaccard +
          w.name * nameSim;
    
        if (score < minScore) continue;
    
        const sharedNeighbors = intersection.map((entry) => {
          const sep = entry.lastIndexOf("|");
          const id = sep >= 0 ? entry.slice(0, sep) : entry;
          const relation = sep >= 0 ? entry.slice(sep + 1) : "";
          const deg = degrees.get(id) ?? 1;
          return {
            id,
            relation,
            degree: deg,
            weight: Number(neighborWeight(deg).toFixed(4)),
          };
        });
    
        suggestions.push({
          entity_a: {
            id: pair.idA,
            name: nameA,
            type: String(f["typeA"] ?? "?"),
            edge_count: setA.size,
          },
          entity_b: {
            id: pair.idB,
            name: nameB,
            type: String(f["typeB"] ?? "?"),
            edge_count: setB.size,
          },
          score: Number(score.toFixed(4)),
          signals: {
            embedding_similarity: Number(pair.embSim.toFixed(4)),
            name_similarity: Number(nameSim.toFixed(4)),
            neighbor_jaccard: Number(neighborJaccard.toFixed(4)),
            shared_neighbors: sharedNeighbors,
          },
          recommended_action: "review",
        });
      }
    
      suggestions.sort((a, b) => b.score - a.score);
      const truncated = suggestions.slice(0, limit);
    
      return {
        suggestions: truncated,
        total_pairs_evaluated: totalPairsEvaluated,
        threshold_used: minScore,
        scope: {
          entity_id: options.entity_id,
          entity_type: options.entity_type,
          global: !options.entity_id && !options.entity_type,
        },
      };
    }
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Annotations already declare readOnlyHint=true and idempotentHint=true, and the description reinforces that it is 'Read-only — never auto-merges'. It adds behavioral details about the combination of embedding similarity, shared-neighbor overlap, and name-token Jaccard, as well as the constraint 'Same-type only', which goes beyond the annotations.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is four sentences, each serving a distinct purpose: stating the function, clarifying safety, explaining the algorithm, and providing usage guidance. It is front-loaded, concise, and contains no extraneous information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The description covers purpose, safety, algorithm, and usage well. However, it lacks any description of the output format (e.g., list of pairs with scores). With no output schema, an agent might need to infer the return structure. This minor gap prevents a perfect score.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% so each parameter is explained individually. The description adds overarching context by explaining the three metrics that combine into the score, which informs the meaning of the 'weights' parameter and the filtering parameters (min_score, min_embedding_similarity). This enhances understanding beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states that the tool surfaces candidate duplicate entity pairs, explicitly notes it is read-only and never auto-merges, and distinguishes itself from sibling tools like graph_merge and graph_relate by specifying its purpose as triage before destructive or alias operations.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly tells when to use the tool ('triage entity-explosion') and provides specific guidance about preceding graph_merge (destructive) or graph_relate (soft alias), making the usage context and alternatives clear.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/stevepridemore/graph-memory'

If you have feedback or need assistance with the MCP directory API, please join our Discord server