Code-Index-MCP

Code-Index-MCP
ai_docs

sqlite_fts5_overview.md•28.4 KiB

# SQLite FTS5 Comprehensive Guide for Code Indexing ## Table of Contents 1. [Introduction to FTS5](#introduction) 2. [FTS5 Basics and Setup](#fts5-basics-and-setup) 3. [Python sqlite3 Integration](#python-integration) 4. [Creating and Querying FTS Tables](#creating-and-querying) 5. [Indexing Strategies for Code](#indexing-strategies) 6. [Performance Optimization](#performance-optimization) 7. [MCP Server Implementation Patterns](#mcp-server-implementation) ## Introduction to FTS5 {#introduction} SQLite FTS5 (Full-Text Search 5) is a virtual table module that provides full-text search functionality to database applications. It's particularly useful for implementing search capabilities in applications that need to search through large amounts of text data, including source code. ### Key Features - Fast full-text search using inverted indexes - Built-in ranking algorithm (BM25) - Support for phrase searches, prefix searches, and boolean operators - Customizable tokenizers for different languages and use cases - Auxiliary functions for highlighting and snippet extraction ### Requirements - SQLite version 3.9.0 or later (FTS5 is included in the amalgamation) - Python with sqlite3 module (included in standard library) ## FTS5 Basics and Setup {#fts5-basics-and-setup} ### Creating an FTS5 Virtual Table FTS5 tables are created using the `CREATE VIRTUAL TABLE` statement: ```sql CREATE VIRTUAL TABLE documents USING fts5( title, content, tags ); ``` **Important Notes:** - Column types, constraints, or PRIMARY KEY declarations are not allowed - All columns are treated as TEXT - An implicit `rowid` column is automatically created ### Configuration Options FTS5 provides several configuration options to optimize performance: ```sql -- With tokenizer configuration CREATE VIRTUAL TABLE code_index USING fts5( filename, content, language, tokenize = 'porter unicode61' ); -- With prefix indexing for faster prefix searches CREATE VIRTUAL TABLE code_index USING fts5( content, prefix = '2 3' ); -- With detail reduction for smaller index size CREATE VIRTUAL TABLE code_index USING fts5( content, detail = 'column' ); ``` ## Python sqlite3 Integration {#python-integration} ### Basic Setup ```python import sqlite3 # Connect to database conn = sqlite3.connect('code_index.db') cursor = conn.cursor() # Enable FTS5 (usually enabled by default) cursor.execute("PRAGMA compile_options;") options = cursor.fetchall() fts5_enabled = any('ENABLE_FTS5' in str(option) for option in options) if not fts5_enabled: print("Warning: FTS5 may not be available in this SQLite build") ``` ### Complete Working Example ```python import sqlite3 import os class CodeIndexer: def __init__(self, db_path): self.conn = sqlite3.connect(db_path) self.cursor = self.conn.cursor() self._create_tables() def _create_tables(self): """Create FTS5 virtual table for code indexing""" self.cursor.execute(""" CREATE VIRTUAL TABLE IF NOT EXISTS code_files USING fts5( filepath, filename, content, language, tokenize = 'porter unicode61' ) """) self.conn.commit() def index_file(self, filepath, language): """Index a single code file""" with open(filepath, 'r', encoding='utf-8') as f: content = f.read() filename = os.path.basename(filepath) self.cursor.execute(""" INSERT INTO code_files (filepath, filename, content, language) VALUES (?, ?, ?, ?) """, (filepath, filename, content, language)) self.conn.commit() def search(self, query, limit=10): """Search indexed code files""" results = self.cursor.execute(""" SELECT filepath, filename, snippet(code_files, 2, '', '', '...', 32) FROM code_files WHERE code_files MATCH ? ORDER BY rank LIMIT ? """, (query, limit)).fetchall() return results def search_by_language(self, query, language): """Search within specific programming language""" results = self.cursor.execute(""" SELECT filepath, filename, highlight(code_files, 2, '', '') FROM code_files WHERE code_files MATCH ? AND language = ? ORDER BY rank """, (query, language)).fetchall() return results def close(self): self.conn.close() # Usage example indexer = CodeIndexer('my_code_index.db') # Index some files indexer.index_file('/path/to/main.py', 'python') indexer.index_file('/path/to/utils.js', 'javascript') # Search for code results = indexer.search('function') for filepath, filename, snippet in results: print(f"{filename}: {snippet}") indexer.close() ``` ## Creating and Querying FTS Tables {#creating-and-querying} ### Basic Queries ```python # Simple search cursor.execute(""" SELECT * FROM code_files WHERE code_files MATCH 'python' """) # Alternative syntax cursor.execute(""" SELECT * FROM code_files('python') """) # Using = operator cursor.execute(""" SELECT * FROM code_files WHERE code_files = 'python' """) ``` ### Advanced Query Syntax #### Boolean Operators ```python # AND operator - both terms must be present cursor.execute(""" SELECT * FROM code_files WHERE code_files MATCH 'class AND method' """) # OR operator - either term can be present cursor.execute(""" SELECT * FROM code_files WHERE code_files MATCH 'async OR await' """) # NOT operator - exclude documents with term cursor.execute(""" SELECT * FROM code_files WHERE code_files MATCH 'function NOT lambda' """) ``` #### Phrase Searches ```python # Exact phrase match cursor.execute(""" SELECT * FROM code_files WHERE code_files MATCH '"import numpy"' """) # NEAR operator - terms within N tokens cursor.execute(""" SELECT * FROM code_files WHERE code_files MATCH 'NEAR(class method, 10)' """) ``` #### Column-Specific Searches ```python # Search only in filename column cursor.execute(""" SELECT * FROM code_files WHERE code_files MATCH 'filename:test' """) # Search in content column cursor.execute(""" SELECT * FROM code_files WHERE code_files MATCH 'content:TODO' """) ``` #### Prefix Searches ```python # Find all words starting with 'func' cursor.execute(""" SELECT * FROM code_files WHERE code_files MATCH 'func*' """) ``` ### Using Auxiliary Functions #### BM25 Ranking ```python # Get relevance scores cursor.execute(""" SELECT filepath, bm25(code_files) as score FROM code_files WHERE code_files MATCH ? ORDER BY score """, (query,)) # Custom column weights cursor.execute(""" SELECT filepath, bm25(code_files, 1.0, 2.0, 0.5, 0.5) as score FROM code_files WHERE code_files MATCH ? ORDER BY score """, (query,)) ``` #### Highlighting ```python # Highlight matching terms cursor.execute(""" SELECT filepath, highlight(code_files, 2, '', '') as highlighted_content FROM code_files WHERE code_files MATCH ? """, (query,)) ``` #### Snippets ```python # Extract relevant snippets cursor.execute(""" SELECT filepath, snippet(code_files, 2, '', '', '...', 32) as extract FROM code_files WHERE code_files MATCH ? """, (query,)) ``` ## Indexing Strategies for Code {#indexing-strategies} ### 1. Language-Specific Tokenization For code indexing, consider creating custom tokenizers or configuring existing ones appropriately: ```python # Create separate tables for different languages with appropriate tokenizers cursor.execute(""" CREATE VIRTUAL TABLE python_code USING fts5( filepath, content, tokenize = 'porter unicode61' ) """) cursor.execute(""" CREATE VIRTUAL TABLE c_code USING fts5( filepath, content, tokenize = 'unicode61' -- No stemming for C code ) """) ``` ### 2. Structured Code Indexing Index different parts of code separately: ```python cursor.execute(""" CREATE VIRTUAL TABLE code_structure USING fts5( filepath, functions, classes, imports, comments, docstrings ) """) # When indexing, parse code and extract components def index_python_file(filepath): import ast with open(filepath, 'r') as f: content = f.read() tree = ast.parse(content) functions = [] classes = [] imports = [] for node in ast.walk(tree): if isinstance(node, ast.FunctionDef): functions.append(node.name) elif isinstance(node, ast.ClassDef): classes.append(node.name) elif isinstance(node, ast.Import): imports.extend(alias.name for alias in node.names) cursor.execute(""" INSERT INTO code_structure (filepath, functions, classes, imports) VALUES (?, ?, ?, ?) """, ( filepath, ' '.join(functions), ' '.join(classes), ' '.join(imports) )) ``` ### 3. Incremental Indexing Implement efficient update strategies: ```python class IncrementalIndexer: def __init__(self, db_path): self.conn = sqlite3.connect(db_path) self.cursor = self.conn.cursor() self._create_tables() def _create_tables(self): # Main FTS table self.cursor.execute(""" CREATE VIRTUAL TABLE IF NOT EXISTS code_index USING fts5(filepath, content, mtime) """) # Metadata table for tracking self.cursor.execute(""" CREATE TABLE IF NOT EXISTS file_metadata ( filepath TEXT PRIMARY KEY, mtime REAL, size INTEGER, hash TEXT ) """) def needs_update(self, filepath): """Check if file needs re-indexing""" stat = os.stat(filepath) mtime = stat.st_mtime self.cursor.execute(""" SELECT mtime FROM file_metadata WHERE filepath = ? """, (filepath,)) result = self.cursor.fetchone() return result is None or result[0] < mtime def update_index(self, filepath): """Update index for a single file""" if not self.needs_update(filepath): return False # Remove old entry self.cursor.execute(""" DELETE FROM code_index WHERE filepath = ? """, (filepath,)) # Add new entry with open(filepath, 'r', encoding='utf-8') as f: content = f.read() stat = os.stat(filepath) self.cursor.execute(""" INSERT INTO code_index (filepath, content, mtime) VALUES (?, ?, ?) """, (filepath, content, stat.st_mtime)) # Update metadata self.cursor.execute(""" INSERT OR REPLACE INTO file_metadata (filepath, mtime, size, hash) VALUES (?, ?, ?, ?) """, (filepath, stat.st_mtime, stat.st_size, None)) self.conn.commit() return True ``` ## Performance Optimization {#performance-optimization} ### 1. Index Configuration ```python # Optimize for storage space cursor.execute(""" CREATE VIRTUAL TABLE compact_index USING fts5( content, detail = 'none', -- No position information columnsize = 0 -- Don't store column sizes ) """) # Optimize for prefix searches cursor.execute(""" CREATE VIRTUAL TABLE prefix_index USING fts5( content, prefix = '2 3 4' -- Index prefixes of length 2, 3, and 4 ) """) ``` ### 2. Query Optimization ```python # Use the rank column instead of bm25() for sorting # This is faster: cursor.execute(""" SELECT * FROM code_index WHERE code_index MATCH ? ORDER BY rank """, (query,)) # Than this: cursor.execute(""" SELECT * FROM code_index WHERE code_index MATCH ? ORDER BY bm25(code_index) """, (query,)) ``` ### 3. Maintenance Operations ```python def optimize_index(conn): """Optimize FTS index for better performance""" cursor = conn.cursor() # Merge all segments into a single segment cursor.execute("INSERT INTO code_index(code_index) VALUES('optimize')") # Rebuild the index (more thorough but slower) # cursor.execute("INSERT INTO code_index(code_index) VALUES('rebuild')") conn.commit() def analyze_index(conn): """Analyze index statistics""" cursor = conn.cursor() # Get index statistics cursor.execute(""" SELECT * FROM code_index_stat """) stats = cursor.fetchall() return stats ``` ### 4. Batch Operations ```python def batch_index_files(file_list, batch_size=100): """Index files in batches for better performance""" conn = sqlite3.connect('code_index.db') cursor = conn.cursor() # Disable auto-commit for batch operations conn.execute("BEGIN TRANSACTION") try: for i in range(0, len(file_list), batch_size): batch = file_list[i:i + batch_size] for filepath in batch: with open(filepath, 'r', encoding='utf-8') as f: content = f.read() cursor.execute(""" INSERT INTO code_index (filepath, content) VALUES (?, ?) """, (filepath, content)) # Commit after each batch conn.commit() conn.execute("BEGIN TRANSACTION") # Final optimization cursor.execute("INSERT INTO code_index(code_index) VALUES('optimize')") conn.commit() except Exception as e: conn.rollback() raise e finally: conn.close() ``` ### 5. Memory Management ```python # Configure cache size for better performance conn = sqlite3.connect('code_index.db') conn.execute("PRAGMA cache_size = -64000") # 64MB cache conn.execute("PRAGMA temp_store = MEMORY") # Use memory for temp tables ``` ## MCP Server Implementation Patterns {#mcp-server-implementation} ### 1. Basic MCP Server Structure ```python import json import sqlite3 from typing import Dict, List, Any class CodeIndexMCPServer: def __init__(self, db_path: str): self.db_path = db_path self.conn = sqlite3.connect(db_path) self._initialize_db() def _initialize_db(self): """Initialize FTS5 tables""" cursor = self.conn.cursor() cursor.execute(""" CREATE VIRTUAL TABLE IF NOT EXISTS code_search USING fts5( filepath, filename, content, language, symbols, -- Function/class names tokenize = 'porter unicode61' ) """) self.conn.commit() def handle_request(self, method: str, params: Dict[str, Any]) -> Dict[str, Any]: """Route MCP requests to appropriate handlers""" handlers = { 'search': self._handle_search, 'index_file': self._handle_index_file, 'get_definition': self._handle_get_definition, 'find_references': self._handle_find_references, 'update_index': self._handle_update_index } if method in handlers: return handlers[method](params) else: return {'error': f'Unknown method: {method}'} def _handle_search(self, params: Dict[str, Any]) -> Dict[str, Any]: """Handle code search requests""" query = params.get('query', '') language = params.get('language') limit = params.get('limit', 20) cursor = self.conn.cursor() if language: sql = """ SELECT filepath, filename, snippet(code_search, 2, '', '', '...', 32), rank FROM code_search WHERE code_search MATCH ? AND language = ? ORDER BY rank LIMIT ? """ results = cursor.execute(sql, (query, language, limit)).fetchall() else: sql = """ SELECT filepath, filename, snippet(code_search, 2, '', '', '...', 32), rank FROM code_search WHERE code_search MATCH ? ORDER BY rank LIMIT ? """ results = cursor.execute(sql, (query, limit)).fetchall() return { 'results': [ { 'filepath': r[0], 'filename': r[1], 'snippet': r[2], 'score': r[3] } for r in results ] } def _handle_index_file(self, params: Dict[str, Any]) -> Dict[str, Any]: """Index a single file""" filepath = params['filepath'] language = params.get('language', 'unknown') try: with open(filepath, 'r', encoding='utf-8') as f: content = f.read() # Extract symbols (simplified example) symbols = self._extract_symbols(content, language) cursor = self.conn.cursor() # Remove old entry if exists cursor.execute(""" DELETE FROM code_search WHERE filepath = ? """, (filepath,)) # Insert new entry cursor.execute(""" INSERT INTO code_search (filepath, filename, content, language, symbols) VALUES (?, ?, ?, ?, ?) """, ( filepath, os.path.basename(filepath), content, language, ' '.join(symbols) )) self.conn.commit() return {'status': 'success', 'filepath': filepath} except Exception as e: return {'status': 'error', 'message': str(e)} def _extract_symbols(self, content: str, language: str) -> List[str]: """Extract function/class names from code""" symbols = [] if language == 'python': import ast try: tree = ast.parse(content) for node in ast.walk(tree): if isinstance(node, ast.FunctionDef): symbols.append(node.name) elif isinstance(node, ast.ClassDef): symbols.append(node.name) except: pass # Add extractors for other languages return symbols ``` ### 2. Advanced MCP Features ```python class AdvancedCodeIndexMCP: def __init__(self, db_path: str): self.db_path = db_path self.conn = sqlite3.connect(db_path) self._initialize_advanced_tables() def _initialize_advanced_tables(self): """Create advanced indexing tables""" cursor = self.conn.cursor() # Main code index cursor.execute(""" CREATE VIRTUAL TABLE IF NOT EXISTS code_index USING fts5( filepath, content, tokenize = 'porter unicode61', prefix = '2 3' ) """) # Symbol index for quick navigation cursor.execute(""" CREATE VIRTUAL TABLE IF NOT EXISTS symbol_index USING fts5( symbol_name, symbol_type, -- function, class, variable filepath, line_number, definition ) """) # Documentation index cursor.execute(""" CREATE VIRTUAL TABLE IF NOT EXISTS doc_index USING fts5( filepath, docstring, comments ) """) self.conn.commit() def handle_semantic_search(self, params: Dict[str, Any]) -> Dict[str, Any]: """Handle semantic code search""" query = params['query'] search_type = params.get('type', 'all') # all, symbols, docs results = [] if search_type in ['all', 'code']: code_results = self._search_code(query) results.extend(code_results) if search_type in ['all', 'symbols']: symbol_results = self._search_symbols(query) results.extend(symbol_results) if search_type in ['all', 'docs']: doc_results = self._search_documentation(query) results.extend(doc_results) # Sort by relevance results.sort(key=lambda x: x['score']) return {'results': results[:params.get('limit', 20)]} def handle_find_definition(self, params: Dict[str, Any]) -> Dict[str, Any]: """Find symbol definition""" symbol = params['symbol'] context_file = params.get('context_file') cursor = self.conn.cursor() # Search in symbol index sql = """ SELECT filepath, line_number, definition, symbol_type FROM symbol_index WHERE symbol_index MATCH ? ORDER BY rank LIMIT 5 """ results = cursor.execute(sql, (f'symbol_name:{symbol}',)).fetchall() # If context file provided, prioritize results from same file if context_file and results: results.sort(key=lambda x: 0 if x[0] == context_file else 1) return { 'definitions': [ { 'filepath': r[0], 'line': r[1], 'definition': r[2], 'type': r[3] } for r in results ] } def handle_find_references(self, params: Dict[str, Any]) -> Dict[str, Any]: """Find all references to a symbol""" symbol = params['symbol'] cursor = self.conn.cursor() # Search for symbol in code content sql = """ SELECT filepath, snippet(code_index, 1, '<ref>', '</ref>', '...', 20), rank FROM code_index WHERE code_index MATCH ? ORDER BY rank LIMIT 50 """ # Use word boundaries for more accurate matching query = f'"{symbol}"' results = cursor.execute(sql, (query,)).fetchall() return { 'references': [ { 'filepath': r[0], 'snippet': r[1], 'score': r[2] } for r in results ] } ``` ### 3. MCP Server Configuration ```json { "mcpServers": { "code-index": { "command": "python", "args": [ "/path/to/code_index_mcp_server.py", "--db-path", "/path/to/code_index.db", "--watch-dirs", "/path/to/project1,/path/to/project2" ] } } } ``` ### 4. Integration with File Watchers ```python import watchdog.observers import watchdog.events class CodeIndexWatcher(watchdog.events.FileSystemEventHandler): def __init__(self, mcp_server): self.mcp_server = mcp_server self.supported_extensions = { '.py', '.js', '.ts', '.java', '.cpp', '.c', '.go', '.rs' } def on_modified(self, event): if not event.is_directory: self._handle_file_change(event.src_path) def on_created(self, event): if not event.is_directory: self._handle_file_change(event.src_path) def _handle_file_change(self, filepath): ext = os.path.splitext(filepath)[1] if ext in self.supported_extensions: # Re-index the file self.mcp_server.handle_request('index_file', { 'filepath': filepath, 'language': self._detect_language(ext) }) def _detect_language(self, ext): language_map = { '.py': 'python', '.js': 'javascript', '.ts': 'typescript', '.java': 'java', '.cpp': 'cpp', '.c': 'c', '.go': 'go', '.rs': 'rust' } return language_map.get(ext, 'unknown') def start_file_watcher(mcp_server, watch_dirs): """Start watching directories for changes""" observer = watchdog.observers.Observer() handler = CodeIndexWatcher(mcp_server) for directory in watch_dirs: observer.schedule(handler, directory, recursive=True) observer.start() return observer ``` ## Best Practices and Tips ### 1. Error Handling ```python def safe_index_file(filepath, conn): """Safely index a file with error handling""" try: with open(filepath, 'r', encoding='utf-8') as f: content = f.read() except UnicodeDecodeError: # Try different encoding try: with open(filepath, 'r', encoding='latin-1') as f: content = f.read() except: return {'error': 'Failed to read file'} except Exception as e: return {'error': str(e)} # Index the content cursor = conn.cursor() try: cursor.execute(""" INSERT OR REPLACE INTO code_index (filepath, content) VALUES (?, ?) """, (filepath, content)) conn.commit() return {'success': True} except Exception as e: conn.rollback() return {'error': f'Database error: {str(e)}'} ``` ### 2. Security Considerations ```python def sanitize_search_query(query): """Sanitize user input for FTS queries""" # Remove potentially problematic characters query = query.replace('"', '""') # Escape quotes # Limit query length max_length = 1000 if len(query) > max_length: query = query[:max_length] return query ``` ### 3. Testing ```python import unittest class TestCodeIndex(unittest.TestCase): def setUp(self): self.conn = sqlite3.connect(':memory:') self.cursor = self.conn.cursor() self.cursor.execute(""" CREATE VIRTUAL TABLE code_index USING fts5(filepath, content) """) def test_basic_search(self): # Insert test data self.cursor.execute(""" INSERT INTO code_index VALUES (?, ?) """, ('test.py', 'def hello_world(): print("Hello")')) # Search results = self.cursor.execute(""" SELECT * FROM code_index WHERE code_index MATCH 'hello' """).fetchall() self.assertEqual(len(results), 1) self.assertEqual(results[0][0], 'test.py') def test_ranking(self): # Insert test data with different relevance test_data = [ ('file1.py', 'hello world hello'), ('file2.py', 'hello'), ('file3.py', 'world hello world world') ] self.cursor.executemany(""" INSERT INTO code_index VALUES (?, ?) """, test_data) # Search and check ranking results = self.cursor.execute(""" SELECT filepath, rank FROM code_index WHERE code_index MATCH 'hello' ORDER BY rank """).fetchall() # File with more 'hello' occurrences should rank higher self.assertEqual(results[0][0], 'file1.py') ``` ## Conclusion SQLite FTS5 provides a powerful and efficient solution for implementing full-text search in code indexing applications. When combined with Python and the MCP server pattern, it offers: - Fast, scalable search capabilities - Flexible query syntax suitable for code search - Built-in ranking and highlighting features - Easy integration with existing SQLite databases - Low resource requirements Key takeaways: 1. Use appropriate tokenizers for your use case 2. Implement incremental indexing for large codebases 3. Optimize queries using the rank column and auxiliary functions 4. Consider security implications when handling user input 5. Regular maintenance (OPTIMIZE) improves performance This guide provides a foundation for building robust code search functionality in your applications using SQLite FTS5.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ViperJuice/Code-Index-MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

sqlite_fts5_overview.md•28.4 KiB