CodeGraph CLI MCP Server

codegraph-rust
docs
architecture

indexing-analyzers.md•14.3 KiB

ABOUTME: Describes how `codegraph index` builds the code graph today ABOUTME: Proposes custom analyzers that extend beyond AST + FastML extraction # Indexing pipeline and analyzer roadmap This document has two goals: 1. Describe the current `codegraph index` indexing process as implemented in this repository. 2. Outline higher-leverage custom analyzers that can produce a richer, more accurate graph than the current **Tree-sitter AST extraction + FastML** enhancement approach. ## Current `codegraph index` process (as implemented) ### CLI entrypoint The `codegraph index` command is implemented in `crates/codegraph-mcp-server/src/bin/codegraph.rs`: - `Commands::Index { ... }` is parsed by `clap` - `handle_index(...)` builds an `IndexerConfig` and runs `ProjectIndexer::index_project(...)` ### Project indexing orchestration `ProjectIndexer::index_project` lives in `crates/codegraph-mcp/src/indexer.rs` and coordinates the full pipeline. At a high level: 1. **Load file collection config** - Builds `FileCollectionConfig` from `IndexerConfig` - Collects source files via `codegraph_parser::file_collect::collect_source_files_with_config` 2. **Decide incremental vs clean-slate** - If `--force`, calls `clean_project_data(project_id)` in storage (clean slate) - If the project is already indexed and file metadata exists: - Detects added/modified/deleted/unchanged files - Deletes persisted data for deleted/changed files (delete-then-insert semantics) - Indexes only the changed files - Otherwise indexes all files 3. **Parse + extract nodes and edges (unified pass)** - Calls `parse_files_with_unified_extraction(...)` (delegates to a shared helper in `crates/codegraph-mcp/src/estimation.rs`) - Per file: `TreeSitterParser::parse_file_with_edges(...)` 4. **Normalize node identifiers** - Generates deterministic node IDs (project-scoped) - Updates edges so `from` IDs remain consistent after re-identification 5. **Chunking + embeddings (optional; feature-gated)** - Builds a chunk plan from nodes + file sources - Generates embeddings for chunks and persists them into SurrealDB 6. **Persist nodes** - Writes nodes in batches through a SurrealDB writer queue 7. **Resolve and persist edges** - Resolves `EdgeRelationship.to` (string symbols) into concrete node IDs using: - exact and normalized matching against a symbol index - optional semantic matching when `ai-enhanced` is enabled - Persists resolved edges (with project_id) 8. **Persist metadata** - Writes project metadata summary and per-file metadata for future incremental runs 9. **Watch mode (optional)** - If `--watch`, monitors for changes and reindexes ## Current analyzers: AST + FastML ### AST extraction (Tree-sitter) Per file, `TreeSitterParser::parse_file_with_edges(...)`: 1. Detects language from the path 2. Parses content with Tree-sitter (with tolerant cleanup for error trees) 3. Dispatches to a language extractor in `crates/codegraph-parser/src/languages/*` which returns an `ExtractionResult`: - `nodes`: semantic nodes (functions, classes/structs, modules, etc.) - `edges`: relationship edges (calls/imports/dependencies, depending on language extractor) ### FastML enhancement (pattern + symbol heuristics) After the language extractor runs, the parser applies: - `crate::fast_ml::enhance_extraction(ast_result, content)` FastML currently combines: - Pattern-based edge creation (fast textual patterns over file content) - Local symbol indexing + lightweight resolution heuristics This improves recall, especially for edges that are hard to extract purely from Tree-sitter patterns (or where the extractor is conservative). ## Analyzer pipeline in this repository (implemented) In addition to AST + FastML extraction, the indexing pipeline runs analyzer stages that enrich the graph by default: - **Build context**: Cargo workspace packages, features, and dependency edges (`depends_on`, `enables`) - **LSP resolution**: enriches qualified names and resolves reference targets via language servers - **Rustdoc + API surface**: attaches doc comments, marks API visibility, emits `exports`/`reexports`, and links feature-gated items - **Module linker**: adds module nodes plus module-level import/containment edges - **Dataflow (Rust-local)**: emits conservative def-use + propagation edges (`defines`, `uses`, `flows_to`, `returns`, `mutates`) - **Docs/contracts**: creates document/spec nodes and links backticked symbols (`documents`, `specifies`) - **Architecture/boundaries**: counts package dependency cycles and (optionally) emits `violates_boundary` edges from configured rules LSP resolution is inherently dependent on external tooling and workspace size. To keep indexing observable and avoid indefinite hangs: - Indexing emits periodic progress for LSP resolution (count-oriented, no per-file noise at info level). - The per-request LSP timeout can be configured via `CODEGRAPH_LSP_REQUEST_TIMEOUT_SECS` (default `600`, minimum `5`). ### Project-level resolution (cross-file) Even after FastML, many extracted edges still have `to: String` rather than a resolved node ID. The indexer builds a project-wide symbol index and attempts to resolve edge targets across the whole project, with optional semantic matching behind the `ai-enhanced` feature flag. ## Why go beyond AST + FastML Tree-sitter + heuristics is fast and broad, but it has structural blind spots: - **Missing type information**: overload resolution, trait impl dispatch, generic specialization, dynamic language inference - **Weak cross-file resolution** without a real module/type system - **Macro / codegen opacity** (Rust macros, TS decorators, generated code, build scripts) - **Dataflow is largely absent** (def-use chains, taint/propagation, aliasing) - **Build-context blindness** (features/targets/conditional compilation) can produce incorrect edges The highest-leverage analyzers fix accuracy at the source: they produce fewer ambiguous edges and more stable, fully-qualified identifiers. ## Analyzer extension points in the current pipeline The current implementation has three natural places to plug in richer analyzers: 1. **File-level post-AST enhancement** - Right after `languages::extract_for_language(...)` and before returning an `ExtractionResult` - Best for: doc extraction, local imports/exports, simple structural facts 2. **Project-level analysis after parse** - After all file `ExtractionResult`s have been merged (global visibility) - Best for: module graphs, cross-file linking, build-aware symbol tables, deduplication 3. **Post-persistence enrichment** - After nodes/edges exist in storage (graph queries become cheap) - Best for: centrality metrics, component detection, architectural constraint checks, “impact” precomputations ## Recommended analyzer output contract To make analyzers composable (and keep later tools honest), treat analyzer outputs as first-class data with: - **Provenance**: analyzer name/version, run mode (full vs incremental), and inputs (files/build config). - **Confidence**: per node/edge confidence + method (AST, heuristic, LSP/compiler, semantic similarity). - **Evidence**: source span references (file + byte/line ranges) and optional stable hashes of evidence snippets. - **Determinism**: prefer fully-qualified names and stable IDs so incremental runs converge instead of drifting. ## Custom analyzers that can outperform the current approach The suggestions below focus on analyzers that meaningfully improve graph **precision**, **completeness**, or **actionability** for agentic tooling. ### 1) Type-aware symbol and call resolution (Language Server / compiler-backed) **What it is** Use language-native tooling to resolve identifiers and call targets with type information: - Rust: `rust-analyzer` (or `cargo check` + compiler metadata) for fully-qualified item paths, trait impls, and macro expansion boundaries - TypeScript/JavaScript: `tsserver` / TypeScript language service for imports, types, and call target resolution - Python: `pyright` for module resolution and inferred types (as available) - Java: JDT language server (or build tool integration) for classpath-aware resolution **What it produces** - Stable, fully qualified symbol identities (module/crate/package scoped) - High-confidence edges: - calls → resolved callee - implements/overrides → resolved target - imports/exports → resolved module symbol - references → resolved definition - Rich node metadata: - signature (params/return), visibility, generic params, receiver type, namespace/module path **Why it is more powerful than AST + FastML** It eliminates ambiguity that heuristics cannot reliably solve (overloads, trait dispatch, reexports, conditional compilation). **Where it plugs in** Project-level analysis after parse is the best fit, because it needs workspace-wide context and build settings. **Pragmatic scope control** Start by implementing it for one language (Rust is typically the highest ROI in this repo) and limit outputs to: - “resolved” edges where confidence is high - metadata needed for later matching (fully-qualified names), even when edges cannot be resolved ### 2) Build-context analyzer (workspace/package graph + conditional compilation) **What it is** Index build manifests and configuration so the analyzer understands what code is “active” and how modules relate: - Rust: `Cargo.toml`, `Cargo.lock`, `cargo metadata`, feature flags, target cfgs - Node: `package.json`, lockfiles, tsconfig, workspace boundaries - Go: `go.mod`, module graph - Java: `pom.xml`/Gradle files, dependency graph **What it produces** - Nodes representing packages/crates/modules and configuration entities (features/targets) - Edges: - depends-on (crate/package/module) - enables (feature → module/file sets) - generates (build script/codegen → generated artifacts, when discoverable) **Why it is more powerful** It reduces incorrect edges caused by analyzing code without the context that compilers use. **Where it plugs in** Project-level analysis after parse, and it can also inform file collection (what to include/exclude). ### 3) Cross-file import/export linker (fully-qualified naming without full type checking) **What it is** For languages where full type-check is heavy, a middle ground is a **module-aware linker**: - Parse import/export statements and build a module graph - Build symbol tables per module/file and resolve references using module scoping rules **What it produces** - Fully-qualified names (good for deterministic IDs and stable search) - More resolved edges for imports/exports and static references **Why it is more powerful** It substantially improves cross-file precision at lower cost than full compiler integration. **Where it plugs in** Project-level analysis after parse, operating over the aggregated node set. ### 4) Dataflow analyzer (def-use, propagation, and “impact” edges) **What it is** Compute a simplified dataflow representation per function/module: - definitions → uses - assignment/return propagation - parameter flow into callees (coarse-grained) **What it produces** - New edge kinds: `Defines`, `Uses`, `FlowsTo`, `Returns`, `Captures`, `Mutates` - “Impact” views: what could change if this symbol changes? **Why it is more powerful** It enables higher-quality automated reasoning (“what breaks if…”, “who consumes this value…”) than pure call graphs. **Where it plugs in** File-level post-AST enhancement for local def-use; project-level for interprocedural linking (if desired). **Scope control** Start with local def-use edges and a conservative model (no alias analysis), then iterate. ### 5) API surface analyzer (public exports, stability, and semver boundaries) **What it is** Detect and persist the public API surface for each package/module: - Rust: `pub` items, reexports, feature-gated API - TS: exported symbols, declaration emission boundaries **What it produces** - Nodes and metadata describing API visibility and ownership (which package exports what) - Edges: - exports (module → symbol) - reexports (module → symbol) **Why it is more powerful** It makes “what is the supported surface?” explicit, which improves agentic refactors and change planning. **Where it plugs in** Project-level analysis after parse; build-context data improves correctness. ### 6) Documentation and contract analyzer (docs → code binding) **What it is** Link human-facing contracts to code: - doc comments (Rustdoc/JSDoc) - READMEs and design docs - schemas (OpenAPI/GraphQL/SQL schema files) where applicable **What it produces** - Nodes representing documents/contracts - Edges: - documents (doc → symbol) - specifies (spec → component) **Why it is more powerful** It improves retrieval quality for “why” questions and reduces hallucinations by anchoring to explicit contracts. **Where it plugs in** Post-persistence enrichment is convenient (store docs as nodes + link by symbol names), but file-level extraction also works. ### 7) Architecture and boundary analyzer (component detection + dependency constraints) **What it is** Infer components (crates/modules/subsystems) and track dependency directionality and cycles. **What it produces** - Component nodes (crate/module, optionally folder-based) - Edges: - depends-on (component-level) - violates-boundary (when a dependency crosses a configured rule) - Metrics: centrality, fan-in/out, cyclic dependencies **Why it is more powerful** It converts a raw graph into actionable architecture signals: hotspots, cycles, boundary violations, high-impact symbols. **Where it plugs in** Post-persistence enrichment (graph queries make this cheap), with optional hints from build-context analyzer. ## Summary: recommended analyzer roadmap If the goal is “more powerful than AST + FastML” with the smallest number of high-impact additions, prioritize: 1. **Type-aware resolver for Rust** (compiler/LSP-backed) to materially improve edge resolution precision. 2. **Build-context analyzer** to make results correct under features/targets and to model package/crate dependencies explicitly. 3. **Cross-file import/export linker** for languages where full type-check is heavier, to improve qualified naming and basic linking. 4. **Dataflow (local def-use)** to unlock impact analysis beyond call graphs. 5. **Architecture/boundary enrichment** post-persistence to expose cycles and dependency direction problems.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Jakedismo/codegraph-rust'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

indexing-analyzers.md•14.3 KiB