Validate Graph Entities
graph_validateScan recently extracted entities and edges for quality issues like generic names, type mismatches, and near-duplicates. Returns issues with severity ratings to help catch bad data before it enters the graph.
Instructions
Scan recently extracted entities and edges for quality issues: generic names, reference language, type mismatches, near-duplicate names, and extreme confidence values. Call this after a dream process extraction batch to catch bad data before it settles into the graph. Returns up to max_issues records of shape {entity_id, name, type, issue, severity} where severity is high/medium/low. Read-only — pair with graph_delete or graph_unmerge to act on flagged items.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| source_session | No | Limit checks to entities extracted in this session. Omit to scan the whole graph. | |
| max_issues | No | Maximum number of issues to return (default 50). |
Implementation Reference
- src/mcp-server/index.ts:798-836 (registration)Tool registration for graph_validate. Defines title, description, inputSchema (Zod-based), and readOnlyHint annotation. The handler is the async function starting at line 836.
// ─── Tool: graph_validate ─── // Single-word generic terms that should never be entity names const GENERIC_NAME_BLOCKLIST = new Set([ "it", "this", "that", "the", "a", "an", "some", "thing", "things", "item", "items", "something", "anything", "everything", "nothing", "one", "other", "another", "each", "all", "both", "they", "them", "we", "i", "you", "he", "she", "data", "info", "information", "here", "there", "now", "then", "later", "unknown", "various", "server", "client", "system", "process", "service", "tool", ]); // Prefixes that indicate reference language rather than entity names const REFERENCE_PREFIXES = ["the ", "this ", "that ", "a ", "an ", "my ", "our ", "your ", "their "]; server.registerTool("graph_validate", { title: "Validate Graph Entities", description: "Scan recently extracted entities and edges for quality issues: generic names, reference language, " + "type mismatches, near-duplicate names, and extreme confidence values. " + "Call this after a dream process extraction batch to catch bad data before it settles into the graph. " + "Returns up to `max_issues` records of shape `{entity_id, name, type, issue, severity}` where severity is high/medium/low. " + "Read-only — pair with graph_delete or graph_unmerge to act on flagged items.", inputSchema: { source_session: z .string() .optional() .describe("Limit checks to entities extracted in this session. Omit to scan the whole graph."), max_issues: z .number() .int() .min(1) .max(200) .optional() .default(50) .describe("Maximum number of issues to return (default 50)."), }, annotations: { readOnlyHint: true }, }, async ({ source_session, max_issues = 50 }) => { - src/mcp-server/index.ts:800-808 (handler)GENERIC_NAME_BLOCKLIST constant — a Set of single-word generic terms that should never be entity names (e.g., 'it', 'this', 'data', 'unknown'). Used by the graph_validate handler to flag low-quality entities.
// Single-word generic terms that should never be entity names const GENERIC_NAME_BLOCKLIST = new Set([ "it", "this", "that", "the", "a", "an", "some", "thing", "things", "item", "items", "something", "anything", "everything", "nothing", "one", "other", "another", "each", "all", "both", "they", "them", "we", "i", "you", "he", "she", "data", "info", "information", "here", "there", "now", "then", "later", "unknown", "various", "server", "client", "system", "process", "service", "tool", ]); - src/mcp-server/index.ts:810-811 (handler)REFERENCE_PREFIXES constant — prefixes that indicate reference language rather than entity names (e.g., 'the ', 'this ', 'a '). Used by the graph_validate handler to flag poorly named entities.
// Prefixes that indicate reference language rather than entity names const REFERENCE_PREFIXES = ["the ", "this ", "that ", "a ", "an ", "my ", "our ", "your ", "their "]; - src/mcp-server/index.ts:836-961 (handler)Main handler for graph_validate tool. Runs 4 quality checks via Cypher queries: (1) generic/blocklisted names, (2) reference-language names, (3) orphaned new entities with low confidence, (4) near-duplicate names. Returns issues array with entity_id, name, type, issue description, and severity (high/medium/low). Also returns summary with total_issues count and breakdown by severity.
}, async ({ source_session, max_issues = 50 }) => { const issues: Array<{ entity_id: string; name: string; type: string; issue: string; severity: "high" | "medium" | "low" }> = []; try { const tenantId = currentTenant(); // Session filter: optional additional narrowing within the tenant. const sessionAndForOrphan = source_session ? `AND (n.source_session = $session OR EXISTS { MATCH (n)-[r]-() WHERE r.source_session = $session })` : ""; const sessionAndForRest = source_session ? `AND (n.source_session = $session OR EXISTS { MATCH (n)-[r]-() WHERE r.source_session = $session })` : ""; const params: Record<string, unknown> = source_session ? { tenantId, session: source_session } : { tenantId }; // 1. Generic / blocklisted names (tenant-scoped) const genericRows = await client.runReadQuery(` MATCH (n:Entity {tenant_id: $tenantId}) WHERE 1=1 ${sessionAndForRest} WITH n, toLower(trim(n.name)) AS lname WHERE size(lname) < 3 OR lname IN $blocklist RETURN n.id AS id, n.name AS name, labels(n) AS labels, n.confidence AS confidence LIMIT $limit `, { ...params, blocklist: [...GENERIC_NAME_BLOCKLIST], limit: Math.ceil(max_issues / 4) }); for (const row of genericRows) { const name = String(row["name"] ?? ""); const type = ((row["labels"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; const lname = name.toLowerCase().trim(); const reason = lname.length < 3 ? "name too short (< 3 chars)" : `generic blocklisted name "${lname}"`; issues.push({ entity_id: String(row["id"]), name, type, issue: reason, severity: "high" }); } // 2. Reference-language names (tenant-scoped) const allNameRows = await client.runReadQuery(` MATCH (n:Entity {tenant_id: $tenantId}) WHERE 1=1 ${sessionAndForRest} RETURN n.id AS id, n.name AS name, labels(n) AS labels, n.confidence AS confidence LIMIT 2000 `, params); for (const row of allNameRows) { if (issues.length >= max_issues) break; const name = String(row["name"] ?? ""); const lname = name.toLowerCase().trim(); const type = ((row["labels"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; for (const prefix of REFERENCE_PREFIXES) { if (lname.startsWith(prefix) && lname.length < 40) { issues.push({ entity_id: String(row["id"]), name, type, issue: `name starts with reference language "${prefix.trim()}" — extract the noun instead`, severity: "high", }); break; } } } // 3. Orphaned new entities (tenant-scoped) const orphanRows = await client.runReadQuery(` MATCH (n:Entity {tenant_id: $tenantId}) WHERE NOT (n)-[]-() AND n.confidence <= 0.4 AND n.times_mentioned <= 1 ${sessionAndForOrphan} RETURN n.id AS id, n.name AS name, labels(n) AS labels, n.confidence AS confidence LIMIT $limit `, { ...params, limit: Math.ceil(max_issues / 4) }); for (const row of orphanRows) { if (issues.length >= max_issues) break; const type = ((row["labels"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; issues.push({ entity_id: String(row["id"]), name: String(row["name"]), type, issue: `isolated entity with no edges and confidence ${Number(row["confidence"] ?? 0).toFixed(2)} — may be a spurious extraction`, severity: "low", }); } // 4. Near-duplicate names (tenant-scoped, case-insensitive) const dupRows = await client.runReadQuery(` MATCH (a:Entity {tenant_id: $tenantId}), (b:Entity {tenant_id: $tenantId}) WHERE id(a) < id(b) AND toLower(trim(a.name)) = toLower(trim(b.name)) AND a.id <> b.id RETURN a.id AS id_a, a.name AS name_a, labels(a) AS labels_a, b.id AS id_b, b.name AS name_b, labels(b) AS labels_b LIMIT $limit `, { tenantId, limit: Math.ceil(max_issues / 4) }); for (const row of dupRows) { if (issues.length >= max_issues) break; const typeA = ((row["labels_a"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; const typeB = ((row["labels_b"] as string[]) ?? []).find((l) => l !== "Entity") ?? "?"; issues.push({ entity_id: String(row["id_a"]), name: String(row["name_a"]), type: typeA, issue: `near-duplicate: same name as entity ${row["id_b"]} (${row["name_b"]}, type ${typeB}) — consider merging with graph_relate ALIAS_OF or deleting one`, severity: "medium", }); } const summary = { total_issues: issues.length, by_severity: { high: issues.filter((i) => i.severity === "high").length, medium: issues.filter((i) => i.severity === "medium").length, low: issues.filter((i) => i.severity === "low").length, }, scope: source_session ? `session:${source_session}` : "full graph", issues: issues.slice(0, max_issues), }; return toolResult(summary); } catch (err) { const e = err instanceof Error ? err : new Error(String(err)); return toolError(`graph_validate failed: ${e.message}`); } }); - src/mcp-server/index.ts:72-72 (helper)The slugify helper function used by graph_validate indirectly (for entity ID generation).
const slugify = (s: string) => s.toLowerCase().replace(/[^a-z0-9]+/g, "-").replace(/^-|-$/g, "");