Tea Rags MCP

CODE_VECTORIZATION.md•7.68 KiB

# Code Vectorization Comprehensive guide to indexing and searching codebases with semantic code search. ## Overview Code vectorization transforms your source code into searchable vector embeddings, enabling: - **Natural language search** - ask questions about your code in plain English - **Semantic understanding** - find code by intent, not just keywords - **Cross-language search** - find similar patterns across different languages - **Git-aware context** - understand authorship, code age, and task history ## Quick Start ```bash # Index your codebase /mcp__qdrant__index_codebase /path/to/your/project # Search your code /mcp__qdrant__search_code /path/to/your/project "authentication middleware" # Update after changes /mcp__qdrant__reindex_changes /path/to/your/project ``` ## How It Works ### 1. File Discovery The indexer scans your project respecting: - `.gitignore` patterns - `.contextignore` patterns (project-specific overrides) - Built-in exclusions (node_modules, vendor, etc.) ### 2. AST-Aware Chunking Code is intelligently split using language-aware parsers: | Language | Parser | Features | |----------|--------|----------| | TypeScript/JavaScript | tree-sitter | Classes, functions, imports | | Python | tree-sitter | Classes, functions, decorators | | Ruby | tree-sitter | Classes, modules, methods | | Go | tree-sitter | Structs, functions, interfaces | | Java/C#/C++ | tree-sitter | Classes, methods, namespaces | | Markdown | remark | Headers, sections, code blocks | | Others | Line-based | Fallback chunking | Each chunk preserves: - **Semantic boundaries** - functions, classes, methods - **Parent context** - `parentName`, `parentType` for nested code - **Location info** - file path, line numbers - **Language metadata** - for filtering ### 3. Git Metadata Enrichment When `CODE_ENABLE_GIT_METADATA=true`: ```typescript { // Blame data (per chunk, based on line coverage) authors: ["alice@example.com", "bob@example.com"], lastModified: "2025-01-15T10:30:00Z", commitCount: 12, codeAge: 45, // days since oldest line // Task IDs extracted from commit messages taskIds: ["PROJ-123", "GH-456"], } ``` **Task ID patterns detected:** - JIRA: `PROJ-123`, `ABC-1` - GitHub: `#123`, `GH-123` - Azure DevOps: `AB#123` - GitLab: `!123`, `GL-123` ### 4. Vector Embedding Chunks are converted to vectors using your configured embedding provider: - **Ollama** (recommended) - local, private, no API costs - **OpenAI** - `text-embedding-3-small/large` - **Cohere** - `embed-english-v3.0` - **Voyage** - `voyage-code-2` (code-specialized) ### 5. Incremental Indexing After initial indexing, `reindex_changes` detects: - **Added files** - new files since last index - **Modified files** - changed content (hash-based detection) - **Deleted files** - removed files Only affected chunks are updated, making re-indexing fast. ## Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `CODE_CHUNK_SIZE` | Maximum chunk size in characters | 2500 | | `CODE_CHUNK_OVERLAP` | Overlap between chunks | 300 | | `CODE_ENABLE_AST` | Use AST-aware chunking | true | | `CODE_BATCH_SIZE` | Chunks per embedding batch | 100 | | `CODE_CUSTOM_EXTENSIONS` | Additional file extensions | - | | `CODE_CUSTOM_IGNORE` | Additional ignore patterns | - | | `CODE_DEFAULT_LIMIT` | Default search result count | 5 | | `CODE_ENABLE_GIT_METADATA` | Enable git blame enrichment | false | ### File Filtering #### Using .contextignore Create a `.contextignore` file in your project root: ```gitignore # Exclude test files from indexing **/test/** **/*.test.ts **/*.spec.ts # Exclude generated code *.generated.ts **/dist/** # Exclude specific directories **/fixtures/** **/mocks/** ``` #### Custom Extensions ```bash # Add non-standard file types export CODE_CUSTOM_EXTENSIONS=".proto,.graphql,.prisma" ``` ## Search Features ### Basic Search ```bash /mcp__qdrant__search_code /path/to/project "how does authentication work?" ``` ### Filtered Search ```bash # By file type /mcp__qdrant__search_code /path "error handling" --fileTypes .ts,.tsx # By path pattern /mcp__qdrant__search_code /path "validation" --pathPattern src/api/** # By author (requires git metadata) /mcp__qdrant__search_code /path "recent changes" --author alice@example.com # Documentation only (markdown files) /mcp__qdrant__search_code /path "setup instructions" --documentationOnly true ``` ### Hybrid Search Combines semantic understanding with keyword matching: ```bash /mcp__qdrant__search_code /path "getUserById function" --hybrid true ``` Best for: - Function/variable names - Error messages - Technical terms ## Supported Languages ### Full AST Support | Language | Extensions | Features | |----------|------------|----------| | TypeScript | `.ts`, `.tsx` | Classes, functions, interfaces, types | | JavaScript | `.js`, `.jsx`, `.mjs` | Classes, functions, exports | | Python | `.py` | Classes, functions, decorators, async | | Ruby | `.rb` | Classes, modules, methods, blocks | | Go | `.go` | Structs, functions, interfaces | | Java | `.java` | Classes, methods, interfaces | | C# | `.cs` | Classes, methods, namespaces | | C/C++ | `.c`, `.cpp`, `.h` | Functions, structs, classes | | Rust | `.rs` | Structs, impl blocks, traits | | PHP | `.php` | Classes, functions, traits | | Markdown | `.md` | Sections, code blocks | ### Line-Based Fallback All other text files use intelligent line-based chunking with configurable size and overlap. ## Best Practices ### 1. Optimize Chunk Size | Codebase Type | Recommended Size | Overlap | |---------------|-----------------|---------| | Small functions | 1500-2000 | 200 | | Large classes | 3000-4000 | 400 | | Documentation | 2000-2500 | 300 | ### 2. Use Incremental Updates Always use `reindex_changes` after initial indexing: ```bash # First time: full index /mcp__qdrant__index_codebase /path/to/project # Subsequent: incremental only /mcp__qdrant__reindex_changes /path/to/project ``` ### 3. Filter Aggressively Exclude noise from indexing: ```gitignore # .contextignore **/vendor/** **/node_modules/** **/*.min.js **/*.bundle.js **/coverage/** **/.git/** ``` ### 4. Choose the Right Embedding Model | Use Case | Recommended Model | |----------|-------------------| | General code search | `nomic-embed-text` | | Code-specialized | `jina-embeddings-v2-base-code` | | Multilingual code | `jina-embeddings-v2-base-code` | | Maximum accuracy | `voyage-code-2` | ### 5. Monitor Index Status ```bash # Check before searching /mcp__qdrant__get_index_status /path/to/project ``` ## Use Cases ### New Developer Onboarding ```bash /mcp__qdrant__search_code /workspace/app "authentication flow" /mcp__qdrant__search_code /workspace/app "database connection setup" /mcp__qdrant__search_code /workspace/app "API endpoint structure" ``` ### Bug Investigation ```bash /mcp__qdrant__search_code /workspace/app "payment error handling" /mcp__qdrant__search_code /workspace/app "retry logic" --pathPattern src/services/** ``` ### Code Review ```bash /mcp__qdrant__search_code /workspace/app "validation patterns" /mcp__qdrant__search_code /workspace/app "security checks" ``` ### Documentation Writing ```bash /mcp__qdrant__search_code /workspace/app "public API methods" /mcp__qdrant__search_code /workspace/app "configuration options" ``` ## Troubleshooting ### Search Returns No Results 1. Check index status: `/mcp__qdrant__get_index_status /path` 2. Verify files are not ignored 3. Try broader, more natural queries ### Slow Indexing 1. Use local Ollama instead of cloud providers 2. Increase `CODE_BATCH_SIZE` 3. Exclude large/binary files ### Memory Issues 1. Reduce `CODE_CHUNK_SIZE` 2. Reduce `CODE_BATCH_SIZE` 3. Index subdirectories separately See also: [Performance Tuning](./PERFORMANCE_TUNING.md)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/artk0de/TeaRAGs-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

CODE_VECTORIZATION.md•7.68 KiB