semantic_search_files
Search files using natural language queries to find relevant content across documents. This tool enables semantic search over file embeddings for efficient information retrieval.
Instructions
Semantic search over file embeddings. Use natural language to find relevant content across files.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| query | Yes | ||
| tenant_id | Yes | ||
| project_id | Yes | ||
| top_k | No | ||
| file_ids | No | ||
| threshold | No |
Implementation Reference
- src/tools/semantic_search_files.py:7-42 (handler)Main handler function that executes semantic search over file embeddings. Accepts query, tenant_id, project_id, top_k, file_ids, and threshold parameters, calls the vector provider's semantic_search method, and returns ranked results with content, scores, and provenance.
def semantic_search_files( query: str, tenant_id: str, project_id: str, top_k: int = 10, file_ids: list[str] | None = None, threshold: float | None = None, ) -> dict[str, Any]: """ Perform semantic search over file embeddings stored in pgvector. Args: query: Natural language search query tenant_id: Tenant context for authorization project_id: Project context for authorization top_k: Maximum number of results to return (default 10) file_ids: Optional filter to search only within specific files threshold: Optional minimum similarity score (0-1) Returns: Ranked chunk results with content, scores, and provenance """ provider = get_vector_provider() results = provider.semantic_search( query=query, tenant_id=tenant_id, project_id=project_id, top_k=top_k, file_ids=file_ids, threshold=threshold, ) return { "query": query, "results": results, "count": len(results), } - src/tools/__init__.py:32-44 (registration)Registers the 'semantic_search_files' tool with the MCP server using @mcp.tool decorator. The wrapper function semantic_search_files_tool exposes the handler to MCP clients with proper parameter typing and docstring.
@mcp.tool(name="semantic_search_files") def semantic_search_files_tool( query: str, tenant_id: str, project_id: str, top_k: int = 10, file_ids: list[str] | None = None, threshold: float | None = None, ) -> dict: """Semantic search over file embeddings. Use natural language to find relevant content across files.""" return semantic_search_files( query, tenant_id, project_id, top_k, file_ids, threshold ) - src/providers/vector/base.py:8-25 (schema)Defines the VectorProviderProtocol interface that specifies the contract for semantic_search implementations. Documents the expected parameters and return type (list of dicts with chunk_id, file_id, content, page_number, score, provenance).
class VectorProviderProtocol(Protocol): """Protocol for semantic search over file embeddings.""" def semantic_search( self, query: str, tenant_id: str, project_id: str, top_k: int = 10, file_ids: list[str] | None = None, threshold: float | None = None, ) -> list[dict[str, Any]]: """ Run semantic search. Returns list of dicts with: chunk_id, file_id, content, page_number, score, provenance. """ ... - Factory function get_vector_provider() that returns the appropriate vector provider instance (PgVectorProvider, ChromaDBProvider, or MockVectorProvider) based on VECTOR_PROVIDER configuration.
def get_vector_provider() -> VectorProviderProtocol: """Get vector provider based on VECTOR_PROVIDER config.""" global _provider_instance if _provider_instance is not None: return _provider_instance provider_name = (config.VECTOR_PROVIDER or "").strip().lower() if not provider_name and config.PGVECTOR_ENABLED: provider_name = "pgvector" if not provider_name: provider_name = "mock" if provider_name == "pgvector": _provider_instance = PgVectorProvider() elif provider_name == "chromadb": try: from .chromadb_provider import ChromaDBProvider _provider_instance = ChromaDBProvider() except ImportError: _provider_instance = MockVectorProvider() else: _provider_instance = MockVectorProvider() return _provider_instance