CodeRAG

tfidf.md•8.41 KiB

# TF-IDF and BM25 CodeRAG uses BM25, an improved version of TF-IDF, for keyword-based search. This page explains the mathematical foundations and implementation details. ## Term Frequency (TF) Term frequency measures how often a term appears in a document (chunk). **Formula:** ``` TF(t, d) = count(t in d) / total_terms(d) ``` Where: - `t` = term (token) - `d` = document (chunk) - `count(t in d)` = number of times term t appears in document d - `total_terms(d)` = total number of tokens in document d **Example:** ```typescript // Chunk: "async function fetchUser(userId) { return await api.get(userId) }" // Tokens: ["async", "function", "fetchUser", "(", "userId", ")", "{", "return", "await", "api", ".", "get", "(", "userId", ")", "}"] // Total: 16 tokens TF("userId", chunk) = 2 / 16 = 0.125 TF("async", chunk) = 1 / 16 = 0.0625 TF("return", chunk) = 1 / 16 = 0.0625 ``` **Implementation:** ```typescript function calculateTF(termFrequency: Map<string, number>): Map<string, number> { const totalTerms = Array.from(termFrequency.values()).reduce((sum, freq) => sum + freq, 0) const tf = new Map<string, number>() for (const [term, freq] of termFrequency.entries()) { tf.set(term, freq / totalTerms) } return tf } ``` ## Inverse Document Frequency (IDF) IDF measures how rare a term is across all documents. Rare terms are more informative than common terms. **Formula (smoothed):** ``` IDF(t) = log((N + 1) / (df(t) + 1)) + 1 ``` Where: - `N` = total number of chunks - `df(t)` = document frequency (number of chunks containing term t) - `+1` terms provide smoothing (prevent zero/negative values) **Why smoothing?** Standard IDF `log(N / df)` becomes 0 when a term appears in all documents. Smoothed IDF ensures every term has a positive score. **Example:** ```typescript // Index: 1000 chunks total IDF("fetchUser") = log((1000 + 1) / (5 + 1)) + 1 = 4.22 // Rare IDF("async") = log((1000 + 1) / (200 + 1)) + 1 = 2.61 // Common IDF("function") = log((1000 + 1) / (800 + 1)) + 1 = 1.25 // Very common ``` Rare terms like `fetchUser` get higher IDF scores. **Implementation:** ```typescript function calculateIDF( documents: Map<string, number>[], totalDocuments: number ): Map<string, number> { const documentFrequency = new Map<string, number>() // Count chunks containing each term for (const doc of documents) { const uniqueTerms = new Set(doc.keys()) for (const term of uniqueTerms) { documentFrequency.set(term, (documentFrequency.get(term) || 0) + 1) } } // Calculate IDF for each term const idf = new Map<string, number>() for (const [term, docFreq] of documentFrequency.entries()) { idf.set(term, Math.log((totalDocuments + 1) / (docFreq + 1)) + 1) } return idf } ``` ## TF-IDF Calculation TF-IDF combines TF and IDF to score term importance in a document. **Formula:** ``` TF-IDF(t, d) = TF(t, d) * IDF(t) ``` **Example:** ```typescript // Term: "fetchUser" TF("fetchUser", chunk) = 2 / 16 = 0.125 IDF("fetchUser") = 4.22 TF-IDF("fetchUser", chunk) = 0.125 * 4.22 = 0.5275 ``` **Implementation:** ```typescript function calculateTFIDF(tf: Map<string, number>, idf: Map<string, number>): Map<string, number> { const tfidf = new Map<string, number>() for (const [term, tfScore] of tf.entries()) { const idfScore = idf.get(term) || 0 tfidf.set(term, tfScore * idfScore) } return tfidf } ``` Each chunk is represented as a TF-IDF vector (map of term to score). ## BM25 Formula BM25 (Best Matching 25) improves TF-IDF with saturation and normalization. **Full formula:** ``` BM25(d, q) = Σ IDF(qi) * (f(qi, d) * (k1 + 1)) / (f(qi, d) + k1 * (1 - b + b * |d| / avgdl)) for qi in q ``` Where: - `d` = document (chunk) - `q` = query - `qi` = query term i - `f(qi, d)` = raw frequency of qi in d - `|d|` = document length (token count) - `avgdl` = average document length across all chunks - `k1` = term frequency saturation parameter (default: 1.2) - `b` = length normalization parameter (default: 0.75) - `IDF(qi)` = inverse document frequency of qi **Key improvements over TF-IDF:** 1. **Term frequency saturation (k1)**: Diminishing returns for repeated terms - A term appearing 10 times is not 10x more important than appearing once - k1 = 1.2 means TF plateaus around 2.2x boost 2. **Length normalization (b)**: Penalizes long chunks - Prevents long chunks from dominating just by containing more terms - b = 0.75 means 75% length normalization, 25% unchanged **BM25 Parameters:** ```typescript const BM25_K1 = 1.2 // Saturation: 1.2-2.0 typical const BM25_B = 0.75 // Normalization: 0 = none, 1 = full ``` These values are industry standards from Elasticsearch and Lucene. **Implementation:** ```typescript export async function searchDocumentsFromStorage( query: string, candidates: StorageSearchResult[], idf: Map<string, number>, options: { avgDocLength?: number } = {} ): Promise<SearchResult[]> { const queryTokens = await getQueryTokens(query) let avgDocLength = options.avgDocLength if (!avgDocLength) { const totalTokens = candidates.reduce((sum, c) => sum + (c.tokenCount || 0), 0) avgDocLength = candidates.length > 0 ? totalTokens / candidates.length : 1 } avgDocLength = Math.max(avgDocLength, 1) const results = [] for (const candidate of candidates) { const matchedTerms = [] for (const term of queryTokens) { if (candidate.matchedTerms.has(term)) { matchedTerms.push(term) } } if (matchedTerms.length === 0) continue const docLen = candidate.tokenCount || 1 let score = 0 for (const term of matchedTerms) { const termFreq = candidate.matchedTerms.get(term).rawFreq const termIdf = idf.get(term) || 0 // BM25 term score const numerator = termFreq * (BM25_K1 + 1) const denominator = termFreq + BM25_K1 * (1 - BM25_B + (BM25_B * docLen) / avgDocLength) score += termIdf * (numerator / denominator) } results.push({ uri: `file://${candidate.path}`, score, matchedTerms }) } return results.sort((a, b) => b.score - a.score) } ``` ## Code-Aware Tokenization CodeRAG uses StarCoder2, a code-aware tokenizer that understands programming syntax. **Why StarCoder2?** Generic tokenizers split code incorrectly: ```typescript // Generic tokenizer (word-based): "getUserById" → ["get", "User", "By", "Id"] // Broken "snake_case" → ["snake", "case"] // Lost underscore // StarCoder2: "getUserById" → ["getUserById"] // Preserved "snake_case" → ["snake_case"] // Preserved ``` **Tokenization interface:** ```typescript import { tokenize } from '@sylphx/coderag' const tokens = await tokenize('async function getUserById(id: string)') // Returns: ['async', 'function', 'getUserById', '(', 'id', ':', 'string', ')'] ``` **Tokenization caching:** Query tokens are cached to avoid re-tokenization: ```typescript // Query token cache (LRU) const queryTokenCache = new Map<string, string[]>() const QUERY_CACHE_MAX_SIZE = 100 async function getCachedQueryTokens(query: string): Promise<string[]> { const cached = queryTokenCache.get(query) if (cached) return cached const tokens = [...new Set(await tokenize(query))] // LRU eviction if (queryTokenCache.size >= QUERY_CACHE_MAX_SIZE) { const firstKey = queryTokenCache.keys().next().value if (firstKey) queryTokenCache.delete(firstKey) } queryTokenCache.set(query, tokens) return tokens } ``` Cache stores up to 100 unique queries, evicting oldest when full. ## Vector Magnitude For cosine similarity search (used as TF-IDF fallback), vectors need normalized magnitude. **Magnitude formula:** ``` magnitude(v) = sqrt(Σ vi^2) ``` **Implementation:** ```typescript function calculateMagnitude(vector: Map<string, number>): number { let sum = 0 for (const value of vector.values()) { sum += value * value } return Math.sqrt(sum) } ``` **Cosine similarity:** ```typescript export function calculateCosineSimilarity( queryVector: Map<string, number>, docVector: DocumentVector ): number { let dotProduct = 0 for (const [term, queryScore] of queryVector.entries()) { const docScore = docVector.terms.get(term) || 0 dotProduct += queryScore * docScore } const queryMagnitude = calculateMagnitude(queryVector) if (queryMagnitude === 0 || docVector.magnitude === 0) { return 0 } return dotProduct / (queryMagnitude * docVector.magnitude) } ``` Cosine similarity ranges from 0 (orthogonal) to 1 (identical).

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/SylphxAI/coderag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

tfidf.md•8.41 KiB