ChunkHound

embedded-sql-detection.md•5.97 KiB

# Embedded SQL Detection ## Overview ChunkHound now supports detecting and extracting SQL code embedded in string literals across all supported programming languages. This feature addresses the common pattern of SQL being embedded as strings in application code (Python, Java, JavaScript, etc.). ## Motivation As noted by the maintainers, SQL is frequently embedded in strings across various programming languages: ```python cursor.execute("SELECT * FROM users WHERE id = %s") ``` Without embedded SQL detection, ChunkHound would only see this as a string literal in a function, missing the semantic SQL content. With this feature enabled, ChunkHound extracts the SQL as a separate, searchable chunk. ## Architecture ### Components 1. **EmbeddedSqlDetector** (`chunkhound/parsers/embedded_sql_detector.py`) - Recursively visits all AST nodes to find string literals - Applies heuristic confidence scoring to determine if a string contains SQL - Extracts SQL content with context (host function/class) 2. **UniversalParser Integration** (`chunkhound/parsers/universal_parser.py`) - Optional `detect_embedded_sql` parameter in parser initialization - Post-processing step after normal chunking - Creates additional chunks for detected SQL 3. **ParserFactory Support** (`chunkhound/parsers/parser_factory.py`) - `detect_embedded_sql` parameter propagated through factory - Disabled by default for backward compatibility ### Detection Heuristics The detector uses a multi-factor confidence scoring system: #### Primary SQL Keywords (30 points each, one required) - SELECT, INSERT, UPDATE, DELETE, CREATE, ALTER, DROP #### Secondary Keywords (10 points each) - HAVING, LIMIT, OFFSET #### SQL Patterns (variable points) - `FROM table_name` (25 points) - `WHERE condition` (20 points) - `JOIN table_name` (20 points) - `SET column = value` (20 points) - `ORDER BY ...` (20 points) - `GROUP BY ...` (20 points) - `INTO table_name` (20 points) - `TABLE ...` (20 points) - `VALUES (...)` (10 points) - `PRIMARY KEY` (10 points) - `FOREIGN KEY` (10 points) **Confidence Threshold**: 0.6 (60 points minimum) Simple queries like `SELECT * FROM users` score ~0.7, while complex queries with multiple clauses score 0.8-0.9+. ## Usage ### Programmatic API ```python from chunkhound.parsers.parser_factory import ParserFactory from chunkhound.core.types.common import Language # Create parser with embedded SQL detection factory = ParserFactory() parser = factory.create_parser( Language.PYTHON, detect_embedded_sql=True # Enable detection ) # Parse code containing embedded SQL code = ''' def get_user(user_id): query = "SELECT * FROM users WHERE id = ?" return db.execute(query, user_id) ''' chunks = parser.parse_content(code) # Find embedded SQL chunks sql_chunks = [c for c in chunks if c.metadata.get("embedded")] for chunk in sql_chunks: print(f"Found SQL: {chunk.code}") print(f"Confidence: {chunk.metadata['sql_confidence']}") print(f"Context: {chunk.metadata['host_context']}") ``` ### Chunk Metadata Embedded SQL chunks include special metadata: ```python { "embedded": True, "host_language": "python", # Original file language "host_context": "function:get_user", # Where SQL was found "sql_confidence": 0.7, # Detection confidence (0-1) "detected_language": "sql", } ``` ## Language Support ### Verified Working (28/28 tests passing) - ✅ Python - string literals and docstrings - ✅ Java - string literals - ✅ JavaScript - string literals and template strings - ✅ TypeScript - string literals and template strings - ✅ C# - string literals and verbatim strings - ✅ Go - string literals and raw strings - ✅ Rust - string literals - ✅ PHP - string literals (PDO and MySQLi styles) ## Examples ### Python with Triple-Quoted Strings ```python def complex_query(): sql = """ SELECT u.id, u.name, COUNT(o.id) as orders FROM users u LEFT JOIN orders o ON u.id = o.user_id WHERE u.active = true GROUP BY u.id, u.name HAVING COUNT(o.id) > 5 """ return db.execute(sql) ``` **Detected**: Yes (confidence: 0.9+) ### Java JDBC ```java public User findById(int id) { String sql = "SELECT * FROM users WHERE id = ?"; return jdbcTemplate.queryForObject(sql, new UserMapper(), id); } ``` **Detected**: Yes (confidence: 0.7) ### JavaScript/TypeScript ORM ```javascript const users = await db.query( "SELECT * FROM users WHERE created_at > $1 ORDER BY name", [date] ); ``` **Detected**: Yes (confidence: 0.9) ## Performance Considerations - **Minimal Overhead**: Detection only runs when explicitly enabled - **No SQL Parsing**: Uses heuristics, doesn't parse SQL into AST - **Single Pass**: Visits each AST node once during tree traversal - **Confidence Filtering**: Only creates chunks for high-confidence matches ## Future Enhancements ### Short Term 1. **Improve JavaScript template string support** - handle backtick strings 2. **Add PHP string node types** - map PHP-specific AST nodes 3. **Configuration options** - allow custom confidence thresholds ### Long Term 1. **SQL Dialect Detection** - distinguish PostgreSQL, MySQL, T-SQL, etc. 2. **Parameterized Query Analysis** - detect and extract parameter placeholders 3. **SQL Validation** - optionally parse and validate SQL syntax 4. **Cross-Reference Linking** - link SQL chunks to their calling functions ## Testing Run embedded SQL detection tests: ```bash uv run pytest tests/test_embedded_sql.py -v ``` Current test results: **28/28 passing** (100% success rate) ## Configuration Currently controlled via parser initialization parameter. Future versions may support: - Configuration file settings - Per-language confidence thresholds - Whitelist/blacklist of SQL keywords - Custom detection patterns ## Related Work This feature addresses [maintainer feedback](https://github.com/chunkhound/chunkhound/pull/183) about SQL being embedded in strings across languages. ## Credits Designed and implemented by AI agents in collaboration with the ChunkHound community.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ofriw/chunkhound'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

embedded-sql-detection.md•5.97 KiB