CodeRAG

README.md•12.1 KiB

<div align="center"> # CodeRAG **Lightning-fast hybrid code search for AI assistants** [![npm version](https://img.shields.io/npm/v/@sylphx/coderag?style=flat-square&label=core)](https://www.npmjs.com/package/@sylphx/coderag) [![npm version](https://img.shields.io/npm/v/@sylphx/coderag-mcp?style=flat-square&label=mcp)](https://www.npmjs.com/package/@sylphx/coderag-mcp) [![CI](https://img.shields.io/github/actions/workflow/status/SylphxAI/coderag/ci.yml?style=flat-square)](https://github.com/SylphxAI/coderag/actions) [![License](https://img.shields.io/badge/License-MIT-blue?style=flat-square)](LICENSE) **Zero dependencies** • **<50ms search** • **Hybrid TF-IDF + Vector** • **MCP ready** [Quick Start](#-quick-start) • [Features](#-features) • [MCP Setup](#-mcp-server-setup) • [API](#-api-reference) </div> --- ## Why CodeRAG? Traditional code search tools are either **slow** (full-text grep), **inaccurate** (keyword matching), or **complex** (require external services). CodeRAG is different: ``` ❌ Old way: Docker + ChromaDB + Ollama + 30 second startup ✅ CodeRAG: npx @sylphx/coderag-mcp (instant) ``` | Feature | grep/ripgrep | Cloud RAG | CodeRAG | |---------|-------------|-----------|---------| | **Semantic understanding** | ❌ | ✅ | ✅ | | **Zero external deps** | ✅ | ❌ | ✅ | | **Offline support** | ✅ | ❌ | ✅ | | **Startup time** | Instant | 10-30s | <1s | | **Search latency** | ~100ms | ~500ms | <50ms | --- ## ✨ Features ### Search - 🔍 **Hybrid Search** - TF-IDF + optional vector embeddings - 🧠 **StarCoder2 Tokenizer** - Code-aware tokenization (4.7MB, trained on code) - 📊 **Smoothed IDF** - No term gets ignored, stable ranking - ⚡ **<50ms Latency** - Instant results even on large codebases ### Indexing - 🚀 **1000-2000 files/sec** - Fast initial indexing - 💾 **SQLite Persistence** - Instant startup (<100ms) with cached index - ⚡ **Incremental Updates** - Smart diff detection, no full rebuilds - 👁️ **File Watching** - Real-time index updates on file changes ### Integration - 📦 **MCP Server** - Works with Claude Desktop, Cursor, VS Code, Windsurf - 🧠 **Vector Search** - Optional OpenAI embeddings for semantic search - 🌳 **AST Chunking** - Smart code splitting using [Synth](https://github.com/SylphxAI/synth) parsers (15+ languages) - 💻 **Low Memory Mode** - SQL-based search for resource-constrained environments --- ## 🚀 Quick Start ### Option 1: MCP Server (Recommended for AI Assistants) ```bash npx @sylphx/coderag-mcp --root=/path/to/project ``` Or add to your MCP config: ```json { "mcpServers": { "coderag": { "command": "npx", "args": ["-y", "@sylphx/coderag-mcp", "--root=/path/to/project"] } } } ``` See [MCP Server Setup](#-mcp-server-setup) for Claude Desktop, Cursor, VS Code, etc. ### Option 2: As a Library ```bash npm install @sylphx/coderag # or bun add @sylphx/coderag ``` ```typescript import { CodebaseIndexer, PersistentStorage } from '@sylphx/coderag' // Create indexer with persistent storage const storage = new PersistentStorage({ codebaseRoot: './my-project' }) const indexer = new CodebaseIndexer({ codebaseRoot: './my-project', storage, }) // Index codebase (instant on subsequent runs) await indexer.index({ watch: true }) // Search const results = await indexer.search('authentication logic', { limit: 10 }) console.log(results) // [{ path: 'src/auth/login.ts', score: 0.87, matchedTerms: ['authentication', 'logic'], snippet: '...' }] ``` --- ## 📦 Packages | Package | Description | Install | |---------|-------------|---------| | [@sylphx/coderag](./packages/core) | Core search library | `npm i @sylphx/coderag` | | [@sylphx/coderag-mcp](./packages/mcp-server) | MCP server for AI assistants | `npx @sylphx/coderag-mcp` | --- ## 🔌 MCP Server Setup ### Claude Desktop Add to `claude_desktop_config.json`: **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json` **Windows**: `%APPDATA%\Claude\claude_desktop_config.json` ```json { "mcpServers": { "coderag": { "command": "npx", "args": ["-y", "@sylphx/coderag-mcp", "--root=/path/to/project"] } } } ``` ### Cursor Add to `~/.cursor/mcp.json` (macOS) or `%USERPROFILE%\.cursor\mcp.json` (Windows): ```json { "mcpServers": { "coderag": { "command": "npx", "args": ["-y", "@sylphx/coderag-mcp", "--root=/path/to/project"] } } } ``` ### VS Code Add to VS Code settings (JSON) or `.vscode/mcp.json`: ```json { "mcp": { "servers": { "coderag": { "command": "npx", "args": ["-y", "@sylphx/coderag-mcp", "--root=${workspaceFolder}"] } } } } ``` ### Windsurf Add to `~/.codeium/windsurf/mcp_config.json`: ```json { "mcpServers": { "coderag": { "command": "npx", "args": ["-y", "@sylphx/coderag-mcp", "--root=/path/to/project"] } } } ``` ### Claude Code ```bash claude mcp add coderag -- npx -y @sylphx/coderag-mcp --root=/path/to/project ``` --- ## 🛠️ MCP Tool: `codebase_search` Search project source files with hybrid TF-IDF + vector ranking. ### Parameters | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `query` | string | Yes | - | Search query | | `limit` | number | No | 10 | Max results | | `include_content` | boolean | No | true | Include code snippets | | `file_extensions` | string[] | No | - | Filter by extension (e.g., `[".ts", ".tsx"]`) | | `path_filter` | string | No | - | Filter by path pattern | | `exclude_paths` | string[] | No | - | Exclude paths (e.g., `["node_modules", "dist"]`) | ### Example ```json { "query": "user authentication login", "limit": 5, "file_extensions": [".ts", ".tsx"], "exclude_paths": ["node_modules", "dist", "test"] } ``` ### Response Format LLM-optimized output (minimal tokens, maximum content): ```markdown # Search: "user authentication login" (3 results) ## src/auth/login.ts:15-28 ```typescript 15: export async function authenticate(credentials) { 16: const user = await findUser(credentials.email) 17: return validatePassword(user, credentials.password) 18: } ``` ## src/middleware/auth.ts:42-55 [md→typescript] ```typescript 42: // Embedded code from markdown docs 43: const authMiddleware = (req, res, next) => { ``` ## src/utils/large.ts:1-200 [truncated] ```typescript 1: // First 70% shown... ... [800 chars truncated] ... 195: // Last 20% shown ``` ``` --- ## 📚 API Reference ### `CodebaseIndexer` Main class for indexing and searching. ```typescript import { CodebaseIndexer, PersistentStorage } from '@sylphx/coderag' const storage = new PersistentStorage({ codebaseRoot: './project' }) const indexer = new CodebaseIndexer({ codebaseRoot: './project', storage, maxFileSize: 1024 * 1024, // 1MB default }) // Index with file watching await indexer.index({ watch: true }) // Search with options const results = await indexer.search('query', { limit: 10, includeContent: true, fileExtensions: ['.ts', '.js'], excludePaths: ['node_modules'], }) // Stop watching await indexer.stopWatch() ``` ### `PersistentStorage` SQLite-backed storage for instant startup. ```typescript import { PersistentStorage } from '@sylphx/coderag' const storage = new PersistentStorage({ codebaseRoot: './project', // Creates .coderag/ folder dbPath: './custom.db', // Optional custom path }) ``` ### Low-Level TF-IDF Functions ```typescript import { buildSearchIndex, searchDocuments, initializeTokenizer } from '@sylphx/coderag' // Initialize StarCoder2 tokenizer (4.7MB, one-time download) await initializeTokenizer() // Build index const documents = [ { uri: 'file://auth.ts', content: 'export function authenticate...' }, { uri: 'file://user.ts', content: 'export class User...' }, ] const index = await buildSearchIndex(documents) // Search const results = await searchDocuments('authenticate user', index, { limit: 5 }) ``` ### Vector Search (Optional) For semantic search with embeddings: ```typescript import { hybridSearch, createEmbeddingProvider } from '@sylphx/coderag' // Requires OPENAI_API_KEY environment variable const results = await hybridSearch('authentication flow', indexer, { vectorWeight: 0.7, // 70% vector, 30% TF-IDF limit: 10, }) ``` --- ## ⚙️ Configuration ### MCP Server Options | Option | Default | Description | |--------|---------|-------------| | `--root=<path>` | Current directory | Codebase root path | | `--max-size=<bytes>` | 1048576 (1MB) | Max file size to index | | `--no-auto-index` | false | Disable auto-indexing on startup | ### Environment Variables | Variable | Description | |----------|-------------| | `OPENAI_API_KEY` | Enable vector search with OpenAI embeddings | | `OPENAI_BASE_URL` | Custom OpenAI-compatible endpoint | | `EMBEDDING_MODEL` | Embedding model (default: `text-embedding-3-small`) | | `EMBEDDING_DIMENSIONS` | Custom embedding dimensions | --- ## 📊 Performance | Metric | Value | |--------|-------| | **Initial indexing** | ~1000-2000 files/sec | | **Startup with cache** | <100ms | | **Search latency** | <50ms | | **Memory per 1000 files** | ~1-2 MB | | **Tokenizer size** | 4.7MB (StarCoder2) | ### Benchmarks Tested on MacBook Pro M1, 16GB RAM: | Codebase | Files | Index Time | Search Time | |----------|-------|------------|-------------| | Small (100 files) | 100 | 0.5s | <10ms | | Medium (1000 files) | 1,000 | 2s | <30ms | | Large (10000 files) | 10,000 | 15s | <50ms | --- ## 🏗️ Architecture ``` coderag/ ├── packages/ │ ├── core/ # @sylphx/coderag │ │ ├── src/ │ │ │ ├── indexer.ts # Main indexer with file watching │ │ │ ├── tfidf.ts # TF-IDF with StarCoder2 tokenizer │ │ │ ├── code-tokenizer.ts # StarCoder2 tokenization │ │ │ ├── hybrid-search.ts # Vector + TF-IDF fusion │ │ │ ├── incremental-tfidf.ts # Smart incremental updates │ │ │ ├── storage-persistent.ts # SQLite storage │ │ │ ├── vector-storage.ts # LanceDB vector storage │ │ │ ├── embeddings.ts # OpenAI embeddings │ │ │ ├── ast-chunking.ts # Synth AST chunking │ │ │ └── language-config.ts # Language registry (15+ languages) │ │ └── package.json │ │ │ └── mcp-server/ # @sylphx/coderag-mcp │ ├── src/ │ │ └── index.ts # MCP server │ └── package.json ``` ### How It Works 1. **Indexing**: Scans codebase, tokenizes with StarCoder2, builds TF-IDF index 2. **AST Chunking**: Splits code at semantic boundaries (functions, classes, etc.) 3. **Storage**: Persists to SQLite (`.coderag/` folder) for instant startup 4. **Watching**: Detects file changes, performs incremental updates 5. **Search**: Hybrid TF-IDF + optional vector search with score fusion ### Supported Languages AST-based chunking with semantic boundary detection: | Category | Languages | |----------|-----------| | **JavaScript** | JavaScript, TypeScript, JSX, TSX | | **Systems** | Python, Go, Java, C | | **Markup** | Markdown, HTML, XML | | **Data/Config** | JSON, YAML, TOML, INI | | **Other** | Protobuf | **Embedded Code Support**: Automatically parses code blocks in Markdown and `<script>`/`<style>` tags in HTML. --- ## 🔧 Development ```bash # Clone git clone https://github.com/SylphxAI/coderag.git cd coderag # Install bun install # Build bun run build # Test bun run test # Lint & Format bun run lint bun run format ``` --- ## 🤝 Contributing Contributions are welcome! Please: 1. Open an issue to discuss changes 2. Fork and create a feature branch 3. Run `bun run lint` and `bun run test` 4. Submit a pull request --- ## 📄 License MIT © [Sylphx](https://sylphx.com) --- <div align="center"> **Powered by [Sylphx](https://github.com/SylphxAI)** Built with [@sylphx/synth](https://github.com/SylphxAI/synth) • [@sylphx/mcp-server-sdk](https://github.com/SylphxAI/mcp-server-sdk) • [@sylphx/doctor](https://github.com/SylphxAI/doctor) • [@sylphx/bump](https://github.com/SylphxAI/bump) </div>

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/SylphxAI/coderag'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•12.1 KiB