CodeRAG

how-search-works.md•5.62 KiB

# How Search Works CodeRAG uses chunk-level indexing with BM25 scoring to deliver fast, accurate code search. Unlike traditional file-level search, CodeRAG operates at the granularity of individual code blocks (functions, classes, etc.). ## Chunk-Level Indexing CodeRAG indexes code at the chunk level rather than the file level. Each chunk represents a semantic unit extracted through AST parsing. **Why chunk-level?** - More precise search results pointing to specific functions or classes - Better relevance scoring (matches terms within the same function, not scattered across a large file) - Enables line-level navigation with startLine and endLine metadata **Example:** For a TypeScript file with 3 functions, CodeRAG creates 3 separate searchable chunks: ```typescript // File: utils.ts // Chunk 1: FunctionDeclaration (lines 1-5) export function parseQuery(query: string): string[] { return query.toLowerCase().split(/\s+/) } // Chunk 2: FunctionDeclaration (lines 7-11) export function calculateScore(tf: number, idf: number): number { return tf * idf } // Chunk 3: FunctionDeclaration (lines 13-17) export function normalizeVector(vec: number[]): number[] { const magnitude = Math.sqrt(vec.reduce((sum, v) => sum + v * v, 0)) return vec.map(v => v / magnitude) } ``` Each chunk is indexed independently with its own TF-IDF vector. ## StarCoder2 Tokenization CodeRAG uses the StarCoder2 tokenizer for code-aware tokenization. This tokenizer understands code syntax and produces better tokens than generic text tokenizers. **Advantages:** - Preserves camelCase and snake_case as single tokens (`getUserById` stays intact, not split into `get`, `User`, `By`, `Id`) - Recognizes common code patterns (operators, keywords, identifiers) - Language-agnostic (works across 15+ programming languages) **Implementation:** ```typescript import { tokenize } from '@sylphx/coderag' const tokens = await tokenize('function getUserById(id: string)') // Returns: ['function', 'getUserById', '(', 'id', ':', 'string', ')'] ``` Tokenization happens asynchronously due to WASM-based StarCoder2 model. ## BM25 Scoring Formula BM25 (Best Matching 25) improves upon basic TF-IDF with two key enhancements: 1. **Term frequency saturation (k1 parameter)**: Diminishing returns for repeated terms 2. **Document length normalization (b parameter)**: Adjusts for chunk length **Formula:** ``` score(C, Q) = Σ IDF(qi) * (f(qi, C) * (k1 + 1)) / (f(qi, C) + k1 * (1 - b + b * |C| / avgdl)) ``` Where: - `C` = chunk (document) - `Q` = query - `f(qi, C)` = raw frequency of term qi in chunk C - `|C|` = chunk length (token count) - `avgdl` = average chunk length across all chunks - `k1 = 1.2` (term frequency saturation) - `b = 0.75` (length normalization) **Parameters:** ```typescript // From packages/core/src/tfidf.ts const BM25_K1 = 1.2 // Typical range: 1.2-2.0 const BM25_B = 0.75 // 0 = no normalization, 1 = full normalization ``` These are industry-standard values from Elasticsearch and Lucene. **How it works:** For a query `"async function error"`, BM25 scores each chunk by: 1. Tokenizing the query: `["async", "function", "error"]` 2. For each chunk, calculating term scores using the formula above 3. Summing term scores to get final chunk score 4. Ranking chunks by score descending ## Query Caching CodeRAG caches search results using an LRU (Least Recently Used) cache to avoid re-executing identical searches. **Cache parameters:** ```typescript // From packages/core/src/indexer.ts this.searchCache = new LRUCache<SearchResult[]>(100, 5) // 100 entries max, 5 minute TTL ``` **Cache behavior:** - Maximum 100 cached queries - 5-minute time-to-live (TTL) per entry - LRU eviction: oldest entries removed when cache is full - Cache invalidation on index updates (file add/change/delete) **Implementation:** ```typescript // Cache key includes query + options const cacheKey = createCacheKey(query, { limit: 10, fileExtensions: ['.ts'], pathFilter: 'src/', excludePaths: ['node_modules/'] }) const cachedResults = this.searchCache.get(cacheKey) if (cachedResults) { return cachedResults // Cache hit } // Execute search... const results = await searchChunks(query, options) this.searchCache.set(cacheKey, results) ``` **Cache statistics:** Query cache performance metrics: ```typescript const stats = searchCache.stats() console.log(`Hit rate: ${stats.hitRate}`) // 0-1 (1 = 100% hits) console.log(`Size: ${stats.size}/${stats.maxSize}`) ``` ## Search Flow End-to-end search process: 1. **Query tokenization**: Convert query string to tokens using StarCoder2 2. **Cache check**: Look up results in LRU cache 3. **SQL candidate retrieval**: Query database for chunks containing any query term 4. **BM25 scoring**: Score each candidate chunk using BM25 formula 5. **Filtering**: Apply file extension, path, and exclusion filters 6. **Ranking**: Sort by BM25 score descending 7. **Limiting**: Return top N results 8. **Caching**: Store results in cache for future queries **Performance characteristics:** - Tokenization: ~1-5ms per query (cached after first use) - SQL retrieval: ~10-50ms depending on index size - BM25 scoring: ~1ms per 100 candidates - Total search time: typically 20-100ms for 10,000 chunks **SQL-based search:** CodeRAG uses SQL for memory-efficient search: ```typescript // Query chunks by terms const candidates = await storage.searchByTerms(queryTokens, { limit: limit * 3 // Get 3x candidates for scoring }) // Candidates include: // - chunkId, filePath, content // - matched terms with tfidf and rawFreq // - pre-computed magnitude and tokenCount ``` Pre-computed values (magnitude, tokenCount) stored in the database avoid recalculation during search.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/SylphxAI/coderag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

how-search-works.md•5.62 KiB