CodeRAG

search.md•9.59 KiB

# Search Functions CodeRAG provides three search modes: hybrid (combines vector and TF-IDF), semantic (vector only), and keyword (TF-IDF only). ## hybridSearch() Combines vector embeddings and TF-IDF keyword search with weighted scoring. ```typescript async function hybridSearch( query: string, indexer: CodebaseIndexer, options?: HybridSearchOptions ): Promise<HybridSearchResult[]> ``` ### Parameters **query** `string` - Search query **indexer** `CodebaseIndexer` - Configured indexer with embeddings **options** `HybridSearchOptions` (optional) ```typescript interface HybridSearchOptions { limit?: number // Max results (default: 10) minScore?: number // Min relevance score (default: 0.01) vectorWeight?: number // 0-1, vector vs TF-IDF (default: 0.7) includeContent?: boolean // Include snippets (default: false) fileExtensions?: string[] // Filter by extension pathFilter?: string // Include paths containing string excludePaths?: string[] // Exclude paths } ``` ### Returns `Promise<HybridSearchResult[]>` - Ranked results with metadata ```typescript interface HybridSearchResult { path: string score: number method: 'vector' | 'tfidf' | 'hybrid' matchedTerms?: string[] similarity?: number content?: string chunkType?: string startLine?: number endLine?: number language?: string } ``` ### Behavior **vectorWeight Modes:** - `>= 0.99`: Pure vector search (semantic only) - `<= 0.01`: Pure TF-IDF search (keyword only) - `0.02-0.98`: Hybrid mode (combines both) **Scoring:** - Vector score: Cosine similarity (0-1) - TF-IDF score: BM25 relevance (normalized) - Combined: `vectorWeight * vector + (1 - vectorWeight) * tfidf` ### Example ```typescript import { CodebaseIndexer, hybridSearch, createEmbeddingProvider } from '@sylphx/coderag' // Setup const embeddingProvider = createEmbeddingProvider({ provider: 'openai', model: 'text-embedding-3-small', dimensions: 1536, apiKey: process.env.OPENAI_API_KEY }) const indexer = new CodebaseIndexer({ codebaseRoot: './src', embeddingProvider }) await indexer.index() // Hybrid search (70% semantic, 30% keyword) const results = await hybridSearch( 'how to authenticate users', indexer, { vectorWeight: 0.7, limit: 10, includeContent: true } ) for (const result of results) { console.log(`${result.path}:${result.startLine} (${result.method})`) console.log(`Score: ${result.score.toFixed(3)}`) if (result.matchedTerms) { console.log(`Keywords: ${result.matchedTerms.join(', ')}`) } if (result.similarity) { console.log(`Semantic similarity: ${result.similarity.toFixed(3)}`) } if (result.content) { console.log(result.content) } console.log('---') } ``` ### Advanced Usage **Adjust semantic vs keyword balance:** ```typescript // More semantic (better for conceptual queries) const semanticResults = await hybridSearch(query, indexer, { vectorWeight: 0.9 // 90% semantic, 10% keyword }) // More keyword (better for specific terms) const keywordResults = await hybridSearch(query, indexer, { vectorWeight: 0.3 // 30% semantic, 70% keyword }) // Balanced const balanced = await hybridSearch(query, indexer, { vectorWeight: 0.5 // 50/50 split }) ``` **Filter results:** ```typescript const results = await hybridSearch( 'authentication', indexer, { fileExtensions: ['.ts', '.tsx'], pathFilter: 'src/auth', excludePaths: ['test', 'mock'], minScore: 0.1 } ) ``` ## semanticSearch() Pure vector search using embeddings. Convenience wrapper for `hybridSearch()` with `vectorWeight: 1.0`. ```typescript async function semanticSearch( query: string, indexer: CodebaseIndexer, options?: Omit<HybridSearchOptions, 'vectorWeight'> ): Promise<HybridSearchResult[]> ``` ### Parameters Same as `hybridSearch()` except `vectorWeight` is fixed at 1.0. ### Returns `Promise<HybridSearchResult[]>` - Results with `method: 'vector'` ### Example ```typescript import { semanticSearch } from '@sylphx/coderag' // Semantic search only const results = await semanticSearch( 'find code that handles user authentication', indexer, { limit: 5 } ) // Works well for: // - Conceptual queries ("how to...", "find code that...") // - Natural language questions // - Cross-language concepts // - Similar functionality search ``` ## keywordSearch() Pure TF-IDF/BM25 keyword search. Convenience wrapper for `hybridSearch()` with `vectorWeight: 0.0`. ```typescript async function keywordSearch( query: string, indexer: CodebaseIndexer, options?: Omit<HybridSearchOptions, 'vectorWeight'> ): Promise<HybridSearchResult[]> ``` ### Parameters Same as `hybridSearch()` except `vectorWeight` is fixed at 0.0. ### Returns `Promise<HybridSearchResult[]>` - Results with `method: 'tfidf'` ### Example ```typescript import { keywordSearch } from '@sylphx/coderag' // Keyword search only const results = await keywordSearch( 'createUser validateEmail', indexer, { limit: 10 } ) // Works well for: // - Specific function/variable names // - Exact terminology // - API identifiers // - Symbol search ``` ## Search Strategies ### When to Use Each Mode **Semantic Search** (`semanticSearch()` or high `vectorWeight`) - Natural language questions - Conceptual similarity - "Find code that does X" - Cross-language patterns - Requires embedding provider **Keyword Search** (`keywordSearch()` or low `vectorWeight`) - Specific identifiers - Exact function/class names - Fast lookups - No API calls needed - Works offline **Hybrid Search** (balanced `vectorWeight`) - Best of both worlds - Handles varied query types - More robust results - Recommended for general use ### Query Examples **Good for Semantic:** ```typescript await semanticSearch('how to validate user input', indexer) await semanticSearch('authentication flow', indexer) await semanticSearch('error handling patterns', indexer) ``` **Good for Keyword:** ```typescript await keywordSearch('createUser validateEmail', indexer) await keywordSearch('React useState useEffect', indexer) await keywordSearch('class AuthService', indexer) ``` **Good for Hybrid:** ```typescript await hybridSearch('JWT token authentication', indexer, { vectorWeight: 0.7 }) await hybridSearch('database connection pooling', indexer, { vectorWeight: 0.6 }) await hybridSearch('API error handling middleware', indexer, { vectorWeight: 0.5 }) ``` ## Score Interpretation ### Vector Similarity Cosine similarity between query and chunk embeddings: - `0.9-1.0`: Highly relevant - `0.7-0.9`: Relevant - `0.5-0.7`: Somewhat relevant - `< 0.5`: Weakly relevant ### TF-IDF/BM25 Score BM25 relevance scoring (varies by corpus): - Higher = more relevant - Normalized per query - Affected by term frequency and document frequency ### Combined Score Weighted combination in hybrid mode: - Normalized to 0-1 range - `vectorWeight` controls the balance - Results sorted by combined score descending ## Result Metadata ### Chunk Information Results include AST chunk metadata: ```typescript const result = results[0] console.log(`Type: ${result.chunkType}`) // 'FunctionDeclaration', 'ClassDeclaration', etc. console.log(`Lines: ${result.startLine}-${result.endLine}`) console.log(`Language: ${result.language}`) ``` ### Match Information ```typescript // Keyword matches if (result.matchedTerms) { console.log(`Matched terms: ${result.matchedTerms.join(', ')}`) } // Vector similarity if (result.similarity) { console.log(`Similarity: ${(result.similarity * 100).toFixed(1)}%`) } // Search method console.log(`Method: ${result.method}`) // 'vector', 'tfidf', or 'hybrid' ``` ## Filtering ### By File Extension ```typescript const tsResults = await hybridSearch(query, indexer, { fileExtensions: ['.ts', '.tsx'] }) const pyResults = await hybridSearch(query, indexer, { fileExtensions: ['.py'] }) ``` ### By Path ```typescript // Include specific paths const authResults = await hybridSearch(query, indexer, { pathFilter: 'src/auth' }) // Exclude specific paths const prodResults = await hybridSearch(query, indexer, { excludePaths: ['test', 'mock', 'fixture'] }) // Combine filters const filtered = await hybridSearch(query, indexer, { fileExtensions: ['.ts'], pathFilter: 'src', excludePaths: ['test', 'dist'] }) ``` ### By Score ```typescript const highQuality = await hybridSearch(query, indexer, { minScore: 0.5, // Only high-confidence results limit: 5 }) ``` ## Performance ### Search Speed - Semantic: ~10-50ms (LanceDB vector search) - Keyword: ~5-10ms (SQL BM25) - Hybrid: ~20-60ms (both methods + merging) ### Optimization Tips **Limit results:** ```typescript // Faster: fewer results to process const results = await hybridSearch(query, indexer, { limit: 5 }) ``` **Skip content:** ```typescript // Faster: no snippet generation const results = await hybridSearch(query, indexer, { includeContent: false }) ``` **Filter early:** ```typescript // Faster: filter at query time, not after const results = await hybridSearch(query, indexer, { fileExtensions: ['.ts'], pathFilter: 'src' }) ``` ## Error Handling ### Fallback Behavior If vector search fails, automatically falls back to keyword search: ```typescript try { const results = await hybridSearch(query, indexer, { vectorWeight: 0.7 }) // Falls back to TF-IDF if embedding generation fails } catch (error) { console.error('Search failed:', error) } ``` ### Missing Embeddings ```typescript const indexer = new CodebaseIndexer({ // No embeddingProvider configured }) // hybridSearch() will use TF-IDF only const results = await hybridSearch(query, indexer) // Logs: [INFO] Using TF-IDF search only ``` ## Related - [CodebaseIndexer](./indexer.md) - [Embeddings](./embeddings.md) - [Types](./types.md)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/SylphxAI/coderag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

search.md•9.59 KiB