Tea Rags MCP

indexing-repositories.md•12.4 KiB

--- title: Indexing Repositories sidebar_position: 3 --- import AiQuery from '@site/src/components/AiQuery'; # Indexing Repositories This page covers everything you need to configure, run, and maintain a code index with TeaRAGs. ## Environment Variables All code-vectorization settings are controlled via environment variables passed to the MCP server. <details> <summary>Indexing variables reference</summary> | Variable | Description | Default | | -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------- | | `CODE_CHUNK_SIZE` | Maximum chunk size in characters. Larger values give more context per chunk; smaller values give more granular search results. | 2500 | | `CODE_CHUNK_OVERLAP` | Overlap between adjacent chunks in characters. Prevents loss of context at chunk boundaries. | 300 | | `CODE_ENABLE_AST` | Use AST-aware chunking (tree-sitter) instead of line-based splitting. Preserves semantic boundaries like functions, classes, and methods. | true | | `CODE_BATCH_SIZE` | Number of chunks per embedding batch sent to the provider. Higher values improve throughput but increase memory usage. | 100 | | `CODE_CUSTOM_EXTENSIONS` | Additional file extensions to index (comma-separated). Example: `.proto,.graphql,.prisma` | - | | `CODE_CUSTOM_IGNORE` | Additional ignore patterns (comma-separated). Example: `**/*.generated.ts,**/dist/**` | - | | `CODE_DEFAULT_LIMIT` | Default number of search results returned when the caller does not specify a limit. | 5 | | `CODE_ENABLE_GIT_METADATA` | Enrich every chunk with git blame data: authors, commit count, code age, and task IDs extracted from commit messages. | false | See also the full [Configuration Variables](/config/environment-variables) page for embedding, Qdrant batch pipeline, and performance variables. </details> ## How Chunking Works TeaRAGs splits source files into **chunks** — self-contained code fragments that become individual vectors in the search index. The chunking strategy directly affects search quality: well-bounded chunks return precise, meaningful results; poorly split chunks return noisy, incomplete fragments. ### AST-Aware Chunking (default) Enabled by default (`CODE_ENABLE_AST=true`). Uses [tree-sitter](https://tree-sitter.github.io/tree-sitter/) to parse the code into an Abstract Syntax Tree, then splits along **semantic boundaries** — functions, classes, methods, interfaces. **Why it matters:** - A function stays as one chunk, not split in the middle of a loop - Class methods are extracted as individual chunks with parent class context preserved (`parentName`, `parentType` metadata) - Comments and decorators stay attached to the code they describe - Large classes are automatically decomposed: methods become individual chunks, non-method body code is extracted separately **Two-level extraction:** 1. **Top-level nodes** — functions, classes, modules, interfaces are identified as chunk candidates 2. **Child extraction** — if a container (class/module) exceeds `2x CODE_CHUNK_SIZE`, its methods are extracted as individual chunks while preserving parent context in metadata **Fallback:** If a language has no tree-sitter grammar, or a node is still too large after child extraction, TeaRAGs falls back to character-based splitting with overlap — trying to break at empty lines or closing braces. :::tip AST-aware chunking is the recommended mode for all code. Disable it (`CODE_ENABLE_AST=false`) only for plain-text or unstructured files where tree-sitter adds no value. ::: ### Embedding Dimensions The embedding dimension is determined automatically based on your model — no configuration needed in most cases. | Provider | Model | Dimensions | |----------|-------|------------| | Ollama | `unclemusclez/jina-embeddings-v2-base-code` (default) | 768 | | Ollama | `nomic-embed-text` | 768 | | Ollama | `mxbai-embed-large` | 1024 | | OpenAI | `text-embedding-3-small` | 1536 | | Cohere | `embed-english-v3.0` | 1024 | | Voyage | `voyage-code-2` | 1536 | Override with `EMBEDDING_DIMENSIONS` only if your model is not in the built-in registry. :::warning Changing the embedding model or dimensions after indexing requires a **full reindex** — existing vectors are incompatible with a different dimension. ::: ## File Filtering ### Built-in Exclusions The indexer automatically skips common non-source directories and files: - `node_modules/`, `vendor/`, `.git/`, `dist/`, `build/`, `coverage/` - Binary files, images, lock files, and minified bundles ### Using .contextignore Create a `.contextignore` file in your project root to define project-specific exclusions. The syntax is identical to `.gitignore`: ```gitignore # Exclude test files from the index **/test/** **/*.test.ts **/*.spec.ts # Exclude generated code *.generated.ts **/dist/** # Exclude fixtures and mocks **/fixtures/** **/mocks/** **/__tests__/** ``` `.contextignore` rules are applied **in addition to** `.gitignore`. If a file is ignored by either, it will not be indexed. ### Custom Extensions By default, the indexer recognizes common source-code extensions. To add non-standard file types: ```bash export CODE_CUSTOM_EXTENSIONS=".proto,.graphql,.prisma" ``` ### Custom Ignore Patterns To add ignore patterns without creating a `.contextignore` file: ```bash export CODE_CUSTOM_IGNORE="**/*.generated.ts,**/dist/**,**/coverage/**" ``` ## Indexing Workflow ### Full Index Run a full index the first time you set up a codebase: <AiQuery>Index this codebase for semantic search</AiQuery> The indexer will: 1. Discover files (respecting `.gitignore`, `.contextignore`, built-in exclusions) 2. Parse each file with the appropriate language parser (AST or line-based) 3. Split into chunks preserving semantic boundaries 4. Generate vector embeddings in batches 5. Upsert chunks into a Qdrant collection 6. Save a file-hash snapshot for future incremental updates ### Incremental Reindex After the initial index, use incremental reindex to update only what changed: <AiQuery>Update the search index with my recent changes</AiQuery> The reindexer compares file hashes from the snapshot against the current state and processes only added, modified, or deleted files. This is significantly faster than a full re-index. ### Force Reindex If you change chunking settings (e.g., `CODE_CHUNK_SIZE`) or embedding model, the existing chunks are no longer compatible. Force a complete re-index: <AiQuery>Reindex the entire codebase from scratch</AiQuery> This drops the existing collection and rebuilds from scratch. Time to brew a fresh cup. ### Check Index Status Monitor indexing progress and verify readiness: <AiQuery>Show me stats for the current index</AiQuery> The status endpoint reports three states: | Status | Meaning | | ------------- | --------------------------------------------------------------------------------- | | `not_indexed` | No index exists for this path. Run indexing first. | | `indexing` | Indexing is in progress. Shows the number of chunks processed so far. | | `indexed` | Index is ready. Returns collection name, chunk count, and last-updated timestamp. | Example response when indexed: ```json { "isIndexed": true, "status": "indexed", "collectionName": "code_a3f8d2e1", "chunksCount": 1823, "lastUpdated": "2025-01-30T10:15:00Z" } ``` ## Best Practices ### 1. Optimize Chunk Size The right chunk size depends on your codebase style: | Codebase Type | Recommended `CODE_CHUNK_SIZE` | Recommended `CODE_CHUNK_OVERLAP` | | ------------------------ | ----------------------------- | -------------------------------- | | Small, focused functions | 1500 - 2000 | 200 | | Large classes / modules | 3000 - 4000 | 400 | | Documentation (Markdown) | 2000 - 2500 | 300 | Smaller chunks produce more granular results but lose surrounding context. Larger chunks preserve context but may dilute relevance if the chunk contains unrelated code. Start with the defaults (2500 / 300) and adjust based on search quality. ### 2. Filter Aggressively Exclude noise from the index to improve both speed and search relevance: ```gitignore # .contextignore **/vendor/** **/node_modules/** **/*.min.js **/*.bundle.js **/coverage/** **/.git/** **/dist/** **/build/** ``` Every excluded file saves embedding time and reduces index size. ### 3. Choose the Right Embedding Model | Use Case | Recommended Model | | -------------------------- | ---------------------------------------------------- | | General code search | `nomic-embed-text` (Ollama) | | Code-specialized (default) | `unclemusclez/jina-embeddings-v2-base-code` (Ollama) | | Multilingual code | `jina-embeddings-v2-base-code` | | Maximum accuracy | `voyage-code-2` (Voyage AI) | For most projects, the default Jina code embeddings via Ollama offer the best balance of quality, speed, and privacy. See the [Embedding Providers](/config/providers) page for detailed comparisons. ### 4. Use Incremental Updates After the initial full index, always prefer incremental reindex: <AiQuery>Update the search index with my recent changes</AiQuery> This avoids re-embedding unchanged files and keeps your index fresh with minimal latency. ### 5. Monitor Index Status Before searching, verify the index is ready: <AiQuery>Show me stats for the current index</AiQuery> This is especially useful after long-running indexing operations on large codebases (10k+ files). ### 6. Enable Git Metadata When Needed Git metadata enrichment (`CODE_ENABLE_GIT_METADATA=true`) adds significant value for analytics — ownership reports, tech debt detection, hotspot analysis. Enrichment runs concurrently with embedding and does not increase indexing time. Enable it when you plan to use rerank presets like `techDebt`, `hotspots`, `ownership`, or `codeReview`. ## Troubleshooting ### Search Returns No Results 1. **Check index status**: ask your agent to show index stats and confirm the status is `indexed` (not `indexing` or `not_indexed`) 2. **Verify files are not ignored**: check `.gitignore`, `.contextignore`, and `CODE_CUSTOM_IGNORE` for overly broad patterns 3. **Broaden your query**: semantic search works best with natural language; instead of `"getUserById"`, try `"user retrieval by ID"` 4. **Check the collection name**: if you moved or renamed the project directory, the collection name hash will change and a new index is needed ### Slow Indexing 1. **Use local Ollama** instead of a cloud provider to avoid network latency and rate limits: ```bash export EMBEDDING_PROVIDER=ollama ``` 2. **Increase batch size** if your system has enough memory: ```bash export CODE_BATCH_SIZE=200 ``` 3. **Exclude large or binary files** that produce low-value chunks (lock files, minified bundles, vendored code) 4. **Increase embedding concurrency** for cloud providers with spare rate-limit headroom: ```bash export EMBEDDING_CONCURRENCY=4 ``` ### Memory Issues 1. **Reduce chunk size** to lower per-batch memory: ```bash export CODE_CHUNK_SIZE=1500 ``` 2. **Reduce batch size** to process fewer chunks at once: ```bash export CODE_BATCH_SIZE=50 ``` 3. **Index subdirectories separately** for very large monorepos — ask your agent to index each service path individually ### Files Not Being Indexed 1. Check that the file extension is in the supported list or added via `CODE_CUSTOM_EXTENSIONS` 2. Verify the file is not matched by `.gitignore`, `.contextignore`, or `CODE_CUSTOM_IGNORE` 3. Ensure the file is not empty or binary For more troubleshooting scenarios see [Troubleshooting](/operations/troubleshooting) and [Recovery & Reindexing](/operations/recovery-reindexing).

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/artk0de/TeaRAGs-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

indexing-repositories.md•12.4 KiB