caption_image

caption_image

Generate descriptive captions for images from URLs or local files using AI vision models. Describe key subjects, scenes, and moods in concise language.

Input Schema

TableJSON Schema

Name	Required	Default
`backend`	No
`file_path`	No
`image_url`	No
`local_model_id`	No
`prompt`	No	Write a concise, vivid caption for this image. Describe key subjects, scene, and mood in 1-2 sentences.

Implementation Reference

src/cv_mcp/mcp_server.py:37-77 (handler)
The MCP tool handler for 'caption_image'. Validates inputs, selects backend (openrouter or local), and delegates captioning to the appropriate client.
@mcp.tool() def caption_image( image_url: Optional[str] = None, file_path: Optional[str] = None, prompt: str = DEFAULT_PROMPT, backend: Optional[str] = None, local_model_id: Optional[str] = None, ) -> str: if not image_url and not file_path: raise ValueError("Provide either image_url or file_path") if image_url and file_path: raise ValueError("Provide only one of image_url or file_path, not both") image_ref = image_url or file_path # type: ignore # Resolve defaults from global config if not explicitly provided try: from cv_mcp.metadata.runner import _CFG as _GLOBAL_CFG # type: ignore except Exception: _GLOBAL_CFG = {} backend = (backend or str(_GLOBAL_CFG.get("caption_backend", "openrouter"))).lower() local_model_id = local_model_id or str(_GLOBAL_CFG.get("local_vlm_id", "Qwen/Qwen2-VL-2B-Instruct")) if backend == "openrouter": client = OpenRouterClient() res = client.analyze_single_image(image_ref, prompt) if not res.get("success"): raise RuntimeError(str(res.get("error", "Captioning failed"))) content = res.get("content", "") return str(content) elif backend == "local": try: from cv_mcp.captioning.local_captioner import LocalCaptioner except Exception as e: # pragma: no cover raise RuntimeError( "Local backend not available. Install optional deps with `pip install .[local]`." ) from e local = LocalCaptioner(model_id=local_model_id) return local.caption(image_ref, prompt) else: raise ValueError("Invalid backend. Use 'openrouter' or 'local'.")
src/cv_mcp/captioning/openrouter_client.py:112-113 (helper)
OpenRouter client method called by the handler for remote captioning. Wraps analyze_images for single image.
def analyze_single_image(self, image: Union[str, Dict], prompt: str, *, model: Optional[str] = None, system: Optional[str] = None) -> Dict[str, Any]: return self.analyze_images([image], prompt, model=model, system=system)
src/cv_mcp/captioning/local_captioner.py:121-149 (helper)
Local captioner method called by the handler for local model inference. Loads image, processes with transformers model, generates caption.
def caption( self, image: Union[str, "Image.Image"], prompt: str, max_new_tokens: int = 128, ) -> str: img = self._load_image(image) messages = [ { "role": "user", "content": [ {"type": "image", "image": img}, {"type": "text", "text": prompt}, ], } ] text = self.processor.apply_chat_template(messages, add_generation_prompt=True) inputs = self.processor(text=[text], images=[img], return_tensors="pt").to(self.model.device) generate_ids = self.model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=False, use_cache=True, ) out = self.processor.batch_decode(generate_ids, skip_special_tokens=True)[0] return out.strip()

Computer Vision MCP Server

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API