Skip to main content
Glama
vrppaul
by vrppaul

index_codebase

Index Python codebases for semantic search by scanning files, extracting functions and classes, and generating embeddings to enable natural language queries for finding relevant code snippets.

Instructions

Index a codebase for semantic search.

Scans Python files, extracts functions/classes/methods, generates embeddings, and stores them for fast semantic search.

Use force=True to re-index everything even if files haven't changed. Otherwise, only new and modified files are indexed (incremental).

Args: project_path: Absolute path to the project root directory. force: If True, re-index all files regardless of changes.

Returns: Statistics about the indexing operation.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
project_pathYes
forceNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The main handler function for index_codebase tool - decorated with @mcp.tool() for registration, validates project path, creates IndexService via container, executes indexing with progress reporting, and returns IndexCodebaseResponse with statistics.
    @mcp.tool()
    @profile_async("index_codebase")
    async def index_codebase(
        project_path: str,
        ctx: Context[ServerSession, None],
        force: bool = False,
    ) -> IndexCodebaseResponse | ErrorResponse:
        """Index a codebase for semantic search.
    
        Scans Python files, extracts functions/classes/methods, generates embeddings,
        and stores them for fast semantic search.
    
        Use force=True to re-index everything even if files haven't changed.
        Otherwise, only new and modified files are indexed (incremental).
    
        Args:
            project_path: Absolute path to the project root directory.
            force: If True, re-index all files regardless of changes.
    
        Returns:
            Statistics about the indexing operation.
        """
        await ctx.info(f"Indexing: {project_path}")
    
        path = Path(project_path)
        if not path.exists():
            await ctx.warning(f"Project path does not exist: {project_path}")
            return ErrorResponse(error=f"Path does not exist: {project_path}")
    
        container = get_container()
        index_service = container.create_index_service(path)
        result = await index_service.index(path, force=force, on_progress=ctx.report_progress)
    
        await ctx.info(
            f"Indexed {result.files_indexed} files, {result.chunks_indexed} chunks "
            f"in {result.duration_seconds:.2f}s"
        )
    
        return IndexCodebaseResponse(
            files_indexed=result.files_indexed,
            chunks_indexed=result.chunks_indexed,
            files_deleted=result.files_deleted,
            duration_seconds=result.duration_seconds,
        )
  • Response schema for index_codebase tool - defines the structure returned containing files_indexed, chunks_indexed, files_deleted, and duration_seconds fields.
    class IndexCodebaseResponse(BaseModel):
        """Response from index_codebase tool."""
    
        files_indexed: int
        chunks_indexed: int
        files_deleted: int
        duration_seconds: float
  • Core IndexService.index() method - orchestrates the full indexing pipeline including file scanning, change detection, chunking, embedding, storage, and cache updates with progress callbacks.
    async def index(
        self,
        project_path: Path,
        force: bool = False,
        on_progress: ProgressCallback | None = None,
    ) -> IndexResult:
        """Full index: scan, detect changes, chunk, embed, with timing + progress.
    
        Args:
            project_path: Root directory of the project.
            force: If True, re-index all files regardless of changes.
            on_progress: Optional callback matching ctx.report_progress(progress, total, message).
    
        Returns:
            IndexResult with counts and total duration.
        """
        start = time.perf_counter()
    
        async def _progress(percent: float, message: str) -> None:
            if on_progress is not None:
                await on_progress(percent, 100, message)
    
        await _progress(5, "Scanning files...")
        files = await asyncio.to_thread(self.scan_files, project_path)
    
        await _progress(10, f"Found {len(files)} files, detecting changes...")
        plan = self.detect_changes(project_path, files, force=force)
    
        if not plan.has_work:
            return IndexResult(
                files_indexed=0,
                chunks_indexed=0,
                files_deleted=0,
                duration_seconds=round(time.perf_counter() - start, 3),
            )
    
        await _progress(20, f"Chunking {len(plan.files_to_index)} files...")
        chunks = await self.chunk_files(plan.files_to_index)
    
        await _progress(70, "Embedding and storing...")
        await self.indexer.embed_and_store(plan, chunks)
    
        # Update cache after successful embed+store
        cache_dir = resolve_cache_dir(self.settings, project_path, self._cache_dir)
        cache = FileChangeCache(cache_dir)
        if plan.files_to_delete:
            cache.remove_files(plan.files_to_delete)
        if plan.files_to_index:
            cache.update_files(plan.files_to_index)
    
        return IndexResult(
            files_indexed=len(plan.files_to_index),
            chunks_indexed=len(chunks),
            files_deleted=len(plan.files_to_delete),
            duration_seconds=round(time.perf_counter() - start, 3),
        )
  • File scanning logic - scans for source files with supported extensions using git ls-files when available, falling back to os.walk with directory pruning and .gitignore support.
    def scan_files(self, project_path: Path) -> list[str]:
        """Scan for source files with supported extensions.
    
        Uses git ls-files if available (fast, respects .gitignore).
        Falls back to os.walk with directory pruning.
    
        Args:
            project_path: Root directory to scan.
    
        Returns:
            List of absolute file paths.
        """
        project_path = project_path.resolve()
    
        if self._is_git_repo(project_path):
            files = self._scan_with_git(project_path)
            if files is not None:
                log.debug("scanned_files_git", project=str(project_path), count=len(files))
                return files
    
        files = self._scan_with_walk(project_path)
        log.debug("scanned_files_walk", project=str(project_path), count=len(files))
        return files
  • Dependency injection method - creates IndexService wired with cached vector store, embedder, chunker, and cache directory for efficient resource sharing across requests.
    def create_index_service(self, project_path: Path) -> IndexService:
        """Create an IndexService wired to cached store/embedder."""
        indexer = Indexer(embedder=self.embedder, store=self.get_store(project_path))
        return IndexService(
            settings=self.settings,
            indexer=indexer,
            chunker=self.create_chunker(),
            cache_dir=get_index_path(self.settings, project_path),
        )
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden and does well by disclosing key behavioral traits: it describes what gets indexed (Python files, functions/classes/methods), the incremental vs. full re-indexing behavior, and that results are stored for semantic search. It doesn't mention performance characteristics like rate limits or error handling, but covers the core operation thoroughly.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured and appropriately sized: purpose statement first, then behavioral details, then parameter explanations, then return value. Every sentence earns its place with no redundancy. The Args/Returns sections are clearly labeled and informative.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (codebase indexing with embeddings), no annotations, 0% schema coverage, but with an output schema, the description is complete enough. It explains what the tool does, when to use force, what gets indexed, and mentions return statistics. The output schema handles return values, so the description appropriately focuses on operation semantics.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate fully. It provides excellent parameter semantics: explains project_path as 'Absolute path to the project root directory' and force as controlling whether to 're-index everything even if files haven't changed' vs. 'only new and modified files.' This adds crucial meaning beyond the bare schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('index', 'scans', 'extracts', 'generates', 'stores') and resources ('codebase', 'Python files', 'functions/classes/methods', 'embeddings'). It distinguishes from sibling tools by focusing on indexing rather than checking status (index_status) or searching (search_code).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly provides usage guidance: 'Use force=True to re-index everything even if files haven't changed. Otherwise, only new and modified files are indexed (incremental).' This gives clear when/when-not criteria for the force parameter and distinguishes from incremental indexing behavior.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/vrppaul/semantic-code-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server