Big Indexer
The Big Indexer (BGI) server is a static architecture analysis tool for exploring, querying, and reasoning about large codebases through behavioral role groupings, cluster boundaries, and AI-grounded implementation guidance.
Cluster & Boundary Analysis
cluster_of_file— Retrieve the architectural cluster a file belongs toboundary_edges— Find fuse boundary edges (architectural boundaries) touching a file or clusterhigh_coupling_seams— Identify the strongest cross-cluster coupling seams for a file, cluster, or entire repo
Impact & Symbol Search
impact_neighbors— Calculate the likely blast radius (affected symbols/files) from a symbol or filesearch_symbols— Search for symbols via index database or graph fallback, with optional context anchoring
Architecture Summarization & Context
architecture_summary— Generate a compact architecture summary suitable for injecting into AI promptsguided_arch_context— Get staged, scope-first architecture context escalated from a natural-language promptclassify_prompt— Classify a natural-language prompt into its scope and retrieval requirements
AI Task Grounding (BGI-TWIN)
task_fingerprint— Translate a natural-language task description into COV behavioral tokensbehavioral_twins— Find top in-repo code units that behaviorally match a task (ranked by COV token overlap)twin_context— Return a full implementation guidance package: task COV tokens + top behavioral twins + seam suggestions + rubric checklist
Server Management
reload_artifacts— Reload graph and fuse artifacts from disk without restarting the server
BGI - Big Indexer
BGI is a static architecture analysis tool for large codebases.
It groups code units by behavioral role and emits explicit architectural boundaries.
Project domain: bigindexer.com
Use via MCP Registry
Big Indexer is published in the MCP Registry as io.github.ahmedxuhri/bigindexer.
pip install bigindexer==0.1.3
bgi mcp --graph bgi-graph.json --fuse-graph fuse-graph.jsonValidation: https://bigindexer.com/validation
Related MCP server: mcp-codebase-index
What problem this solves
Most architecture graphs fail at scale in two ways:
too many noisy edges
giant clusters that collapse unrelated components together
BGI is built to keep both under control, so the output remains usable on large repos.
What you can do with it
"Where should this boundary be before we refactor?"
BGI groups units by behavioral role (COV tokens + DRS clustering) so likely component boundaries are visible."Which subsystem coupling is risky?"
BGI surfaces high-coupling seams and fuse-boundary signals between clusters so integration risk is easier to spot."How do we plug architecture data into automation?"
BGI emits machine-readable artifacts (bgi-graph.json,fuse-graph.json) plus optional human context (bigindexer.md)."How do we make AI changes less random?"
MCP tools (task_fingerprint,behavioral_twins,twin_context) ground prompts in in-repo behavior patterns."Can I run this automatically on PRs as a live example?"
Yes — use the dedicated action repoahmedxuhri/bigindexer-pr-risk-botto auto-comment PRs with blast radius, seams, and risk hints.
30-second demo
Run BGI on the included fixture repo:
git clone https://github.com/ahmedxuhri/bigindexer
cd bigindexer
pip install -e .
bgi scan tests/fixtures --lang python --out /tmp/bgi-example.json
head -50 /tmp/bgi-example.jsonObserved result on this repository:
units:
12edges:
14clusters:
2max cluster in sample:
6units
One produced edge looks like:
{
"source": "auth_module.py::AuthService::__init__",
"target": "auth_module.py::AuthService::__del__",
"key": "COV.INIT",
"lock": "COV.TEARDOWN",
"type": "HARD"
}Why this matters: instead of raw syntax references only, you get behavioral relationships plus cluster structure that can drive architecture decisions.
Plain-English glossary
BGI term | Plain meaning |
COV token | A behavior label for a unit (for example: |
Key-Lock edge | A behavioral connection between two units with complementary roles |
DRS cluster | A unit-level grouping by behavioral role. Mostly intra-file in practice. File-level architectural components are better expressed via the BGI edge graph or the fuse-graph boundary signal — see external benchmark |
Fuse edge / fuse event | A refused merge because cluster growth hit the cap; treated as boundary signal |
Spectral masks | Scope rules that limit where matching is allowed (global, directory, file) |
Architecture in one view
Source files
->
Gate 1: fingerprint unit behavior (COV tokens)
->
Gate 2: create behavioral edges with scoped matching
->
Gate 3: cluster with hard size cap + boundary emission
->
Artifacts: bgi-graph.json, fuse-graph.json, bigindexer.md, optional routes/graphml/htmlCore approach:
TOKEN-CENSUS - classify token frequency per repo.
SPECTRAL-MASKS - restrict match scope by token frequency.
FUSE-MAP - cap cluster growth and record refused merges.
MASK-4-GATE-3 - use import proximity as clustering signal.
WATER-CLOCK +
.scm- single-pass query extraction path in Gate 1.
Why BGI is different from common alternatives
Capability | LSP / SCIP index | Call-graph + generic community detection | BGI |
Fast symbol lookup | Strong | Medium | Available (Phase 6 index) |
Behavioral token model | No | Usually no | Yes |
Hard-bounded clustering | No | Usually no | Yes (unit-level) |
First-class boundary artifact | No | Usually no | Yes ( |
Scope-constrained edge generation | Limited | Rare | Yes (spectral masks) |
External head-to-head benchmark (Louvain on BGI's edges vs Louvain on raw imports, scored against package layout): BGI's edges win on Python (django F1 0.38 vs 0.29, MoJoFM 0.45 vs 0.34) and currently tie/lose on Go due to lower cross-file edge density on tier-2 scanners. Full results and methodology in docs/VALIDATION_EVIDENCE.md.
Evidence (current, verifiable)
Large-repo scale evidence
Comparable kubernetes sample (go comparable mode, 162,917 units):
Gate 1:
141.964sGate 2:
67.261s(historical comparable baseline:138.869s)Gate 3:
9.359sTotal:
218.584sMax cluster:
1.113%Fuse events:
0
Artifact: output/validation/kubernetes-optionb-controlled-median-v21.json
Quality guard evidence (beyond raw speed)
Gate 2 scope safety tests block invalid cross-scope merges (see
tests/test_gate2.py).Gate 3 tests verify no legacy namespace over-merge without import evidence (see
tests/test_gate3.py).Current full suite status:
python3 -m pytest tests/ -x -q(project baseline target remains passing).
Evidence summary
Current published validation set: 100 scored runs across 5 repos and 3 models.
Full 20-run post-shipment benchmark refresh for BGI-TWIN context (
task → COV → top-3 twins + seam + rubric) is complete: actionability 4.75/5 (p04 slice: 4.8/5), boundary 1.0, hallucinations 0.Independent-model replication is now complete on azure/gpt-4o (20 runs) and gemini/auto (20 runs): GPT-4o actionability 4.85/5, Gemini actionability 4.25/5, both with zero hallucinations; Gemini boundary 0.95 reflects one genuine
django/p02miss.Still missing: labeled precision/recall benchmark on an external corpus and head-to-head quantitative benchmark vs external tools on the same labeled dataset.
Language support tiers (explicit)
BGI does not treat all languages equally; support is tiered:
Query-backed (
.scm):python,typescript,tsx,javascript,go,rust,java,csharp,php,ruby,kotlin,scalaTree-sitter scanner + rule path:
c,lua,elixirGeneric regex fallback by extension:
swift,r,dart,bash,nim,zig,haskell,ocaml,fsharp,clojure,erlang,matlab,vb,crystal,cobol,groovy
Use this as a reliability signal: query-backed and dedicated scanner tiers are stronger than generic fallback.
Cross-file edge density caveat: the language tiers above describe parser quality. A separate axis is cross-file behavioral edge density — how many key-lock pairs the scanner produces that link units in different files. Tier-1 (.scm-backed) languages produce dense cross-file edges. Tier-2 scanner-backed languages currently produce sparser cross-file edges because their token mix is dominated by structural tokens (INTAKE/OUTPUT/CONDITIONAL/LOOP) that gate-2 deliberately scopes to same-file to prevent O(N²) noise. The user-visible MCP product (boundary detection, twin retrieval, AI-assistant context) still works on tier-2 languages — see the validation evidence — but cluster-recovery benchmarks against import-graph baselines reflect this density gap. Concrete numbers in docs/VALIDATION_EVIDENCE.md.
Limitations and non-goals
BGI is static analysis; it does not ingest runtime traces.
Cross-file semantic resolution is heuristic and language-dependent.
Cluster-size health is measured; full external precision/recall is not yet published.
Shared-host benchmarking introduces variance; decisions should use controlled medians.
Install
pip install -e .Quickstart commands
# scan
bgi scan /path/to/repo --lang auto --out bgi-graph.json
# optional outputs
bgi scan /path/to/repo --lang auto \
--fuse-graph fuse-graph.json \
--routes routes.json \
--graphml graph.graphml \
--html
# incremental
bgi scan /path/to/repo --lang auto --incremental --cache .bgi-cache.json
# diff
bgi diff /path/before /path/after --lang auto --out diff.json
# run MCP server over generated artifacts
bgi mcp --graph bgi-graph.json --fuse-graph fuse-graph.jsonExample MCP usage pattern (from your client prompt):
Use MCP tool twin_context for:
"Add endpoint that validates input and persists data."
Return top twin candidate, seam suggestion, and rubric checklist.Telemetry
BGI ships with opt-in, off-by-default anonymous telemetry. To enable:
export BGI_TELEMETRY=1
bgi mcp --graph bgi-graph.json --fuse-graph fuse-graph.jsonWhat's collected when enabled: BGI version, OS, repo size bucket, and a 12-char hash of your repo's git remote (so we can deduplicate "same repo seen twice" without ever knowing which repo). What's never collected: file paths, source code, repo names, user identity, or IP addresses. Full schema and disable instructions in docs/TELEMETRY.md.
Documentation map
MEMORANDUM.md- design contracts and invariantsdocs/LANGUAGE_SUPPORT.md- language implementation detailsdocs/CONTRIBUTING_LANGUAGES.md- language contribution guidedocs/INDEX_SCHEMA.md- interactive index schemadocs/QUERY_PLANNER.md- query planner scoringdocs/MCP_SETUP.md- MCP server setup and usagedocs/MCP_WITH_CONTINUE.md- 5-minute Continue + BGI walkthroughdocs/TELEMETRY.md- opt-in telemetry: what we collect and how to disablehttps://bigindexer.com/validation- public validation evidencedocs/MCP_QUICKSTART_DEMO.md- 5-minute demo walkthroughdocs/MCP_EXAMPLE_TRANSCRIPTS.md- real-world MCP tool invocation examplesdocs/MCP_REAL_TRANSCRIPT.md- unedited transcript from FastAPI analysisscripts/mcp-demo.sh- automated demo script for multiple CLIs and repositories
License and Copyright
License: Apache License 2.0 (
LICENSE)Contributor terms: Developer Certificate of Origin (
DCO) enforced on pull requests
Maintenance
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/ahmedxuhri/bigindexer'
If you have feedback or need assistance with the MCP directory API, please join our Discord server