Re-embed Entities
graph_reembedRegenerate semantic-search embeddings for graph entities. Fills missing embeddings by default; use force option to re-embed all after changing embed recipe.
Instructions
Regenerate semantic-search embeddings for entities. By default only fills missing embeddings (idempotent, fast). With force=true, re-embeds every entity — use after changing the embed-text recipe (e.g. when richer fields are added). At ~10ms per entity, full re-embed of a few hundred nodes finishes in seconds.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| force | No | Re-embed every entity, even ones that already have an embedding. Default false. |
Implementation Reference
- src/mcp-server/index.ts:1248-1276 (registration)Tool registration for graph_reembed. Defines the tool title, description, input schema (optional 'force' boolean), and the handler that delegates to client.backfillEmbeddings(). Admin users re-embed across all tenants, non-admin only their own tenant.
server.registerTool("graph_reembed", { title: "Re-embed Entities", description: "Regenerate semantic-search embeddings for entities. By default only fills missing embeddings " + "(idempotent, fast). With force=true, re-embeds every entity — use after changing the embed-text " + "recipe (e.g. when richer fields are added). At ~10ms per entity, full re-embed of a few hundred " + "nodes finishes in seconds.", inputSchema: { force: z .boolean() .optional() .default(false) .describe("Re-embed every entity, even ones that already have an embedding. Default false."), }, annotations: { idempotentHint: true }, }, async ({ force }) => { try { const tenantId = currentTenant(); // Admins re-embed across all tenants; others re-embed only their own. const opts: { force?: boolean; tenantId?: string } = { force: force === true }; if (!isAdminTenant(tenantId)) opts.tenantId = tenantId; const result = await client.backfillEmbeddings(opts); return toolResult({ ...result, force: force === true, scope: isAdminTenant(tenantId) ? "all-tenants" : tenantId }); } catch (err) { const e = err instanceof Error ? err : new Error(String(err)); return toolError(`graph_reembed failed: ${e.message}`); } }); - src/shared/neo4j-client.ts:2049-2140 (handler)backfillEmbeddings() — the actual implementation that re-embeds entities. Queries entities without embeddings (or all if force=true), generates embeddings via embedText/buildEmbedText, and writes them back to Neo4j. Supports tenant-scoped operation for multi-tenant isolation.
/** Backfill embeddings for entities that don't have one. With force=true, * re-embed every entity (e.g. after changing the embed-text recipe). * Embeds richer text (name + type + select properties) via buildEmbedText * so semantically similar concepts cluster more tightly. * * When `tenantId` is supplied, only that tenant's entities are touched — * this is what the graph_reembed MCP tool uses. The startup backfill calls * this with no tenantId (all-tenants pass), since it's an admin operation * that reads only public-shape properties (name, type, subtype, etc.) and * doesn't expose any tenant's data outside its own boundary. */ async backfillEmbeddings( options: { tenantId?: string; batchSize?: number; force?: boolean } = {}, ): Promise<{ embedded: number; skipped: number; errors: number }> { const batchSize = options.batchSize ?? 50; const force = options.force ?? false; const tenantId = options.tenantId; let embedded = 0; let errors = 0; // Track ids we've already processed in this run to ensure forward progress // across batches even when nothing is null (force mode just iterates all). // Note: ids are namespaced internally by tenant_id when tenant scoped, but // the Set holds raw ids — that's fine because the WHERE clause already // restricts to the same tenant. const processed = new Set<string>(); const tenantClause = tenantId ? "AND n.tenant_id = $tenantId" : ""; while (true) { const rows = await this.run( ` MATCH (n:Entity) WHERE ${force ? "true" : "n.embedding IS NULL"} AND NOT n.id IN $processed ${tenantClause} RETURN n.id AS id, n.tenant_id AS tenant_id, n.name AS name, [l IN labels(n) WHERE l <> 'Entity'][0] AS type, properties(n) AS props LIMIT $batchSize `, { batchSize, processed: [...processed], ...(tenantId && { tenantId }) }, ); if (rows.length === 0) break; // Embed in parallel using the rich-context recipe const embeddings = await Promise.all( rows.map(async (r) => { const name = String(r["name"] ?? ""); if (!name) return null; const type = r["type"] ? String(r["type"]) : undefined; const props = (r["props"] as Record<string, unknown>) ?? {}; try { return await embedText(buildEmbedText(name, type, props)); } catch { return null; } }), ); // Write back (matched on tenant_id + id to avoid cross-tenant collisions) for (let i = 0; i < rows.length; i++) { const id = String(rows[i]["id"]); const rowTenantId = String(rows[i]["tenant_id"] ?? ""); processed.add(id); const emb = embeddings[i]; if (!emb) { errors++; continue; } try { await this.run( `MATCH (n:Entity {tenant_id: $tenantId, id: $id}) SET n.embedding = $embedding`, { tenantId: rowTenantId, id, embedding: emb }, ); embedded++; } catch (err) { debugLogClient(`backfill write failed for ${id} (tenant=${rowTenantId}): ${err instanceof Error ? err.message : String(err)}`); errors++; } } if (rows.length < batchSize) break; } // Count any remaining nulls (only meaningful when force=false) const remaining = await this.run( `MATCH (n:Entity) WHERE n.embedding IS NULL ${tenantClause} RETURN count(n) AS c`, tenantId ? { tenantId } : {}, ); const skipped = Number(remaining[0]?.["c"] ?? 0); return { embedded, skipped, errors }; } - src/shared/embeddings.ts:39-46 (helper)embedText() — core embedding function that uses HuggingFace transformers pipeline (bge-small-en-v1.5) to produce a 384-dim embedding vector.
export async function embedText(text: string): Promise<number[]> { const cleaned = text.trim(); if (!cleaned) return new Array<number>(EMBEDDING_DIM).fill(0); const e = await getEmbedder(); const result = await e(cleaned, { pooling: "mean", normalize: true }); // result.data is a Float32Array of length 384 return Array.from(result.data as Float32Array); } - src/shared/embeddings.ts:66-81 (helper)buildEmbedText() — constructs the rich text input for the embedder from entity name, type, and high-signal properties.
export function buildEmbedText( name: string, type: string | undefined, properties: Record<string, unknown> = {}, ): string { const parts: string[] = [name.trim()]; if (type) parts.push(type); // High-signal fields, in priority order. First non-empty wins. for (const key of ["subtype", "description", "role", "category", "what", "specialty"]) { const v = properties[key]; if (typeof v === "string" && v.trim()) { parts.push(v.trim()); } } return parts.join(" — "); }