Validate Graph Entities
graph_validateScan recently extracted entities and edges for quality issues: generic names, type mismatches, near-duplicates, and extreme confidence. Catch bad data before it settles into the graph.
Instructions
Scan recently extracted entities and edges for quality issues: generic names, reference language, type mismatches, near-duplicate names, and extreme confidence values. Call this after a dream process extraction batch to catch bad data before it settles into the graph. Returns up to max_issues records of shape {entity_id, name, type, issue, severity} where severity is high/medium/low. Read-only — pair with graph_delete or graph_unmerge to act on flagged items.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| source_session | No | Limit checks to entities extracted in this session. Omit to scan the whole graph. | |
| max_issues | No | Maximum number of issues to return (default 50). |
Implementation Reference
- src/mcp-server/index.ts:810-958 (registration)Registration of the 'graph_validate' tool via server.registerTool() with inputSchema, annotations, and handler.
server.registerTool("graph_validate", { title: "Validate Graph Entities", description: "Scan recently extracted entities and edges for quality issues: generic names, reference language, " + "type mismatches, near-duplicate names, and extreme confidence values. " + "Call this after a dream process extraction batch to catch bad data before it settles into the graph. " + "Returns up to `max_issues` records of shape `{entity_id, name, type, issue, severity}` where severity is high/medium/low. " + "Read-only — pair with graph_delete or graph_unmerge to act on flagged items.", inputSchema: { source_session: z .string() .optional() .describe("Limit checks to entities extracted in this session. Omit to scan the whole graph."), max_issues: z .number() .int() .min(1) .max(200) .optional() .default(50) .describe("Maximum number of issues to return (default 50)."), }, annotations: { readOnlyHint: true }, }, async ({ source_session, max_issues = 50 }) => { const issues: Array<{ entity_id: string; name: string; type: string; issue: string; severity: "high" | "medium" | "low" }> = []; try { const tenantId = currentTenant(); // Session filter: optional additional narrowing within the tenant. const sessionAndForOrphan = source_session ? `AND (n.source_session = $session OR EXISTS { MATCH (n)-[r]-() WHERE r.source_session = $session })` : ""; const sessionAndForRest = source_session ? `AND (n.source_session = $session OR EXISTS { MATCH (n)-[r]-() WHERE r.source_session = $session })` : ""; const params: Record<string, unknown> = source_session ? { tenantId, session: source_session } : { tenantId }; // 1. Generic / blocklisted names (tenant-scoped) const genericRows = await client.runReadQuery(` MATCH (n:Entity {tenant_id: $tenantId}) WHERE 1=1 ${sessionAndForRest} WITH n, toLower(trim(n.name)) AS lname WHERE size(lname) < 3 OR lname IN $blocklist RETURN n.id AS id, n.name AS name, labels(n) AS labels, n.confidence AS confidence LIMIT $limit `, { ...params, blocklist: [...GENERIC_NAME_BLOCKLIST], limit: Math.ceil(max_issues / 4) }); for (const row of genericRows) { const name = String(row["name"] ?? ""); const type = ((row["labels"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; const lname = name.toLowerCase().trim(); const reason = lname.length < 3 ? "name too short (< 3 chars)" : `generic blocklisted name "${lname}"`; issues.push({ entity_id: String(row["id"]), name, type, issue: reason, severity: "high" }); } // 2. Reference-language names (tenant-scoped) const allNameRows = await client.runReadQuery(` MATCH (n:Entity {tenant_id: $tenantId}) WHERE 1=1 ${sessionAndForRest} RETURN n.id AS id, n.name AS name, labels(n) AS labels, n.confidence AS confidence LIMIT 2000 `, params); for (const row of allNameRows) { if (issues.length >= max_issues) break; const name = String(row["name"] ?? ""); const lname = name.toLowerCase().trim(); const type = ((row["labels"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; for (const prefix of REFERENCE_PREFIXES) { if (lname.startsWith(prefix) && lname.length < 40) { issues.push({ entity_id: String(row["id"]), name, type, issue: `name starts with reference language "${prefix.trim()}" — extract the noun instead`, severity: "high", }); break; } } } // 3. Orphaned new entities (tenant-scoped) const orphanRows = await client.runReadQuery(` MATCH (n:Entity {tenant_id: $tenantId}) WHERE NOT (n)-[]-() AND n.confidence <= 0.4 AND n.times_mentioned <= 1 ${sessionAndForOrphan} RETURN n.id AS id, n.name AS name, labels(n) AS labels, n.confidence AS confidence LIMIT $limit `, { ...params, limit: Math.ceil(max_issues / 4) }); for (const row of orphanRows) { if (issues.length >= max_issues) break; const type = ((row["labels"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; issues.push({ entity_id: String(row["id"]), name: String(row["name"]), type, issue: `isolated entity with no edges and confidence ${Number(row["confidence"] ?? 0).toFixed(2)} — may be a spurious extraction`, severity: "low", }); } // 4. Near-duplicate names (tenant-scoped, case-insensitive) const dupRows = await client.runReadQuery(` MATCH (a:Entity {tenant_id: $tenantId}), (b:Entity {tenant_id: $tenantId}) WHERE id(a) < id(b) AND toLower(trim(a.name)) = toLower(trim(b.name)) AND a.id <> b.id RETURN a.id AS id_a, a.name AS name_a, labels(a) AS labels_a, b.id AS id_b, b.name AS name_b, labels(b) AS labels_b LIMIT $limit `, { tenantId, limit: Math.ceil(max_issues / 4) }); for (const row of dupRows) { if (issues.length >= max_issues) break; const typeA = ((row["labels_a"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; const typeB = ((row["labels_b"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; issues.push({ entity_id: String(row["id_a"]), name: String(row["name_a"]), type: typeA, issue: `near-duplicate: same name as entity ${row["id_b"]} (${row["name_b"]}, type ${typeB}) — consider merging with graph_relate ALIAS_OF or deleting one`, severity: "medium", }); } const summary = { total_issues: issues.length, by_severity: { high: issues.filter((i) => i.severity === "high").length, medium: issues.filter((i) => i.severity === "medium").length, low: issues.filter((i) => i.severity === "low").length, }, scope: source_session ? `session:${source_session}` : "full graph", issues: issues.slice(0, max_issues), }; return toolResult(summary); } catch (err) { const e = err instanceof Error ? err : new Error(String(err)); return toolError(`graph_validate failed: ${e.message}`); } }); - src/mcp-server/index.ts:818-832 (schema)Input schema (Zod) for graph_validate: optional source_session string and max_issues number (1-200, default 50).
inputSchema: { source_session: z .string() .optional() .describe("Limit checks to entities extracted in this session. Omit to scan the whole graph."), max_issues: z .number() .int() .min(1) .max(200) .optional() .default(50) .describe("Maximum number of issues to return (default 50)."), }, annotations: { readOnlyHint: true }, - src/mcp-server/index.ts:833-958 (handler)Handler function that performs validation checks: generic name blocklist, reference-language prefixes, orphaned entities, and near-duplicate names, returning issues with severity levels.
}, async ({ source_session, max_issues = 50 }) => { const issues: Array<{ entity_id: string; name: string; type: string; issue: string; severity: "high" | "medium" | "low" }> = []; try { const tenantId = currentTenant(); // Session filter: optional additional narrowing within the tenant. const sessionAndForOrphan = source_session ? `AND (n.source_session = $session OR EXISTS { MATCH (n)-[r]-() WHERE r.source_session = $session })` : ""; const sessionAndForRest = source_session ? `AND (n.source_session = $session OR EXISTS { MATCH (n)-[r]-() WHERE r.source_session = $session })` : ""; const params: Record<string, unknown> = source_session ? { tenantId, session: source_session } : { tenantId }; // 1. Generic / blocklisted names (tenant-scoped) const genericRows = await client.runReadQuery(` MATCH (n:Entity {tenant_id: $tenantId}) WHERE 1=1 ${sessionAndForRest} WITH n, toLower(trim(n.name)) AS lname WHERE size(lname) < 3 OR lname IN $blocklist RETURN n.id AS id, n.name AS name, labels(n) AS labels, n.confidence AS confidence LIMIT $limit `, { ...params, blocklist: [...GENERIC_NAME_BLOCKLIST], limit: Math.ceil(max_issues / 4) }); for (const row of genericRows) { const name = String(row["name"] ?? ""); const type = ((row["labels"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; const lname = name.toLowerCase().trim(); const reason = lname.length < 3 ? "name too short (< 3 chars)" : `generic blocklisted name "${lname}"`; issues.push({ entity_id: String(row["id"]), name, type, issue: reason, severity: "high" }); } // 2. Reference-language names (tenant-scoped) const allNameRows = await client.runReadQuery(` MATCH (n:Entity {tenant_id: $tenantId}) WHERE 1=1 ${sessionAndForRest} RETURN n.id AS id, n.name AS name, labels(n) AS labels, n.confidence AS confidence LIMIT 2000 `, params); for (const row of allNameRows) { if (issues.length >= max_issues) break; const name = String(row["name"] ?? ""); const lname = name.toLowerCase().trim(); const type = ((row["labels"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; for (const prefix of REFERENCE_PREFIXES) { if (lname.startsWith(prefix) && lname.length < 40) { issues.push({ entity_id: String(row["id"]), name, type, issue: `name starts with reference language "${prefix.trim()}" — extract the noun instead`, severity: "high", }); break; } } } // 3. Orphaned new entities (tenant-scoped) const orphanRows = await client.runReadQuery(` MATCH (n:Entity {tenant_id: $tenantId}) WHERE NOT (n)-[]-() AND n.confidence <= 0.4 AND n.times_mentioned <= 1 ${sessionAndForOrphan} RETURN n.id AS id, n.name AS name, labels(n) AS labels, n.confidence AS confidence LIMIT $limit `, { ...params, limit: Math.ceil(max_issues / 4) }); for (const row of orphanRows) { if (issues.length >= max_issues) break; const type = ((row["labels"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; issues.push({ entity_id: String(row["id"]), name: String(row["name"]), type, issue: `isolated entity with no edges and confidence ${Number(row["confidence"] ?? 0).toFixed(2)} — may be a spurious extraction`, severity: "low", }); } // 4. Near-duplicate names (tenant-scoped, case-insensitive) const dupRows = await client.runReadQuery(` MATCH (a:Entity {tenant_id: $tenantId}), (b:Entity {tenant_id: $tenantId}) WHERE id(a) < id(b) AND toLower(trim(a.name)) = toLower(trim(b.name)) AND a.id <> b.id RETURN a.id AS id_a, a.name AS name_a, labels(a) AS labels_a, b.id AS id_b, b.name AS name_b, labels(b) AS labels_b LIMIT $limit `, { tenantId, limit: Math.ceil(max_issues / 4) }); for (const row of dupRows) { if (issues.length >= max_issues) break; const typeA = ((row["labels_a"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; const typeB = ((row["labels_b"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; issues.push({ entity_id: String(row["id_a"]), name: String(row["name_a"]), type: typeA, issue: `near-duplicate: same name as entity ${row["id_b"]} (${row["name_b"]}, type ${typeB}) — consider merging with graph_relate ALIAS_OF or deleting one`, severity: "medium", }); } const summary = { total_issues: issues.length, by_severity: { high: issues.filter((i) => i.severity === "high").length, medium: issues.filter((i) => i.severity === "medium").length, low: issues.filter((i) => i.severity === "low").length, }, scope: source_session ? `session:${source_session}` : "full graph", issues: issues.slice(0, max_issues), }; return toolResult(summary); } catch (err) { const e = err instanceof Error ? err : new Error(String(err)); return toolError(`graph_validate failed: ${e.message}`); } }); - src/mcp-server/index.ts:798-808 (helper)Supporting constants: GENERIC_NAME_BLOCKLIST (set of generic terms that should never be entity names) and REFERENCE_PREFIXES (prefixes indicating reference language).
const GENERIC_NAME_BLOCKLIST = new Set([ "it", "this", "that", "the", "a", "an", "some", "thing", "things", "item", "items", "something", "anything", "everything", "nothing", "one", "other", "another", "each", "all", "both", "they", "them", "we", "i", "you", "he", "she", "data", "info", "information", "here", "there", "now", "then", "later", "unknown", "various", "server", "client", "system", "process", "service", "tool", ]); // Prefixes that indicate reference language rather than entity names const REFERENCE_PREFIXES = ["the ", "this ", "that ", "a ", "an ", "my ", "our ", "your ", "their "];