searchVisualContent
Search video content visually to find specific frames using OCR and AI descriptions. Returns matching images with timestamps for evidence-based discovery.
Instructions
Search the actual visual content of a video or your indexed frame library. Uses Apple Vision OCR, optional Gemini frame descriptions, and optional Gemini semantic embeddings. Always returns frame/image evidence with timestamps.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| query | Yes | Visual search query, e.g. 'whiteboard diagram' or 'slide that says title research checklist' | |
| videoIdOrUrl | No | Optional video scope. If provided, the server can auto-index this video if needed. | |
| maxResults | No | ||
| minScore | No | ||
| autoIndexIfNeeded | No | If scoped to a video and no visual index exists yet, build it automatically (default true) | |
| intervalSec | No | Frame interval to use if auto-indexing is triggered | |
| maxFrames | No | Frame cap to use if auto-indexing is triggered | |
| imageFormat | No | ||
| width | No | ||
| autoDownload | No | ||
| downloadFormat | No | ||
| includeGeminiDescriptions | No | ||
| includeGeminiEmbeddings | No | ||
| dryRun | No |
Implementation Reference
- src/lib/visual-search.ts:530-604 (handler)The handler for "searchVisualContent", implemented as 'searchText' inside the 'VisualSearchEngine' class. It performs visual search by querying the indexed SQLite database using a combination of lexical matching (OCR/description) and semantic embeddings.
async searchText(params: SearchVisualContentParams): Promise<SearchVisualContentResult> { const rawQuery = params.query?.trim(); const normalizedQuery = normalizeText(rawQuery); if (!normalizedQuery) { throw new Error("query cannot be empty"); } if (params.videoId && (params.autoIndexIfNeeded ?? true) && this.needsIndexing(params.videoId)) { await this.indexVideo({ videoId: params.videoId, ...(params.indexIfNeeded ?? {}), }); } const frames = this.store.listSearchFrames({ videoId: params.videoId }).filter((frame) => existsSync(frame.framePath)); if (frames.length === 0) { throw new Error("No indexed visual frames found. Run indexVisualContent first, or provide videoIdOrUrl so search can auto-index it."); } const embeddingSummary = summarizeEmbeddingProvider(frames); let semanticQueryEmbedding: number[] | undefined; if (embeddingSummary.provider !== "none") { const selection: EmbeddingSelection = { kind: "gemini", model: embeddingSummary.model, dimensions: embeddingSummary.dimensions, }; const cacheKey = buildEmbeddingCacheKey(rawQuery ?? normalizedQuery, selection); semanticQueryEmbedding = this.queryEmbeddingCache.get(cacheKey); if (!semanticQueryEmbedding) { const provider = await createEmbeddingProvider(selection); semanticQueryEmbedding = provider ? await provider.embedQuery(rawQuery ?? normalizedQuery) : undefined; if (semanticQueryEmbedding?.length) { this.queryEmbeddingCache.set(cacheKey, semanticQueryEmbedding); } } } const results = frames .map((frame) => scoreFrameAgainstQuery({ query: normalizedQuery, rawQuery: rawQuery ?? normalizedQuery, frame, semanticQueryEmbedding })) .filter((item) => item.score >= (params.minScore ?? 0.12)) .sort((a, b) => b.score - a.score || b.semanticScore! - a.semanticScore! || b.lexicalScore - a.lexicalScore) .slice(0, clamp(params.maxResults ?? 5, 1, 20)); // Compute coverage hints when scoped to a single video let coveredTimeRange: { startSec: number; endSec: number } | undefined; let needsExpansion: boolean | undefined; if (params.videoId) { const range = this.store.getFrameTimeRange(params.videoId); if (range) { coveredTimeRange = { startSec: range.minSec, endSec: range.maxSec }; const videoAsset = this.findVideoAsset(params.videoId); const videoDuration = videoAsset?.durationSec; if (videoDuration && videoDuration > 0) { const coverage = (range.maxSec - range.minSec) / videoDuration; needsExpansion = coverage < 0.5; } } } return { query: rawQuery ?? normalizedQuery, results, searchedFrames: frames.length, searchedVideos: new Set(frames.map((frame) => frame.videoId)).size, descriptionProvider: summarizeDescriptionProvider(frames), embeddingProvider: embeddingSummary.provider, embeddingModel: embeddingSummary.model, queryMode: semanticQueryEmbedding ? "gemini_semantic_plus_lexical" : "ocr_description_lexical", coveredTimeRange, needsExpansion, limitations: buildSearchLimitations(summarizeDescriptionProvider(frames), embeddingSummary.provider), }; } - src/server/mcp-server.ts:538-561 (registration)The MCP tool definition and input schema registration for "searchVisualContent" in 'src/server/mcp-server.ts'.
name: "searchVisualContent", description: "Search the actual visual content of a video or your indexed frame library. Uses Apple Vision OCR, optional Gemini frame descriptions, and optional Gemini semantic embeddings. Always returns frame/image evidence with timestamps.", inputSchema: { type: "object", properties: { query: { type: "string", description: "Visual search query, e.g. 'whiteboard diagram' or 'slide that says title research checklist'" }, videoIdOrUrl: { type: "string", description: "Optional video scope. If provided, the server can auto-index this video if needed." }, maxResults: { type: "number", minimum: 1, maximum: 20 }, minScore: { type: "number", minimum: 0, maximum: 1 }, autoIndexIfNeeded: { type: "boolean", description: "If scoped to a video and no visual index exists yet, build it automatically (default true)" }, intervalSec: { type: "number", minimum: 2, maximum: 3600, description: "Frame interval to use if auto-indexing is triggered" }, maxFrames: { type: "number", minimum: 1, maximum: 100, description: "Frame cap to use if auto-indexing is triggered" }, imageFormat: { type: "string", enum: ["jpg", "png", "webp"] }, width: { type: "number", minimum: 160, maximum: 3840 }, autoDownload: { type: "boolean" }, downloadFormat: { type: "string", enum: ["best_video", "worst_video"] }, includeGeminiDescriptions: { type: "boolean" }, includeGeminiEmbeddings: { type: "boolean" }, dryRun: { type: "boolean" }, }, required: ["query"], additionalProperties: false, }, }, - src/lib/visual-search.ts:75-113 (schema)Type definitions for the parameters and results of the "searchVisualContent" tool.
export interface SearchVisualContentParams { query: string; videoId?: string; maxResults?: number; minScore?: number; autoIndexIfNeeded?: boolean; indexIfNeeded?: Omit<IndexVisualContentParams, "videoId">; } export interface SearchVisualMatch { score: number; lexicalScore: number; semanticScore?: number; matchedOn: Array<"ocr" | "description" | "semantic">; videoId: string; sourceVideoUrl: string; sourceVideoTitle?: string; frameAssetId?: string; framePath: string; timestampSec: number; timestampLabel: string; explanation: string; ocrText?: string; visualDescription?: string; } export interface SearchVisualContentResult { query: string; results: SearchVisualMatch[]; searchedFrames: number; searchedVideos: number; descriptionProvider: "none" | "gemini" | "mixed"; embeddingProvider: "none" | "gemini" | "mixed"; embeddingModel?: string; queryMode: "ocr_description_lexical" | "gemini_semantic_plus_lexical"; coveredTimeRange?: { startSec: number; endSec: number }; needsExpansion?: boolean; limitations: string[]; }