caption_image
Generate descriptive captions for images from URLs or local files using AI vision models. Describe key subjects, scenes, and moods in concise language.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| backend | No | ||
| file_path | No | ||
| image_url | No | ||
| local_model_id | No | ||
| prompt | No | Write a concise, vivid caption for this image. Describe key subjects, scene, and mood in 1-2 sentences. |
Input Schema (JSON Schema)
{
"properties": {
"backend": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Backend"
},
"file_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "File Path"
},
"image_url": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Image Url"
},
"local_model_id": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Local Model Id"
},
"prompt": {
"default": "Write a concise, vivid caption for this image. Describe key subjects, scene, and mood in 1-2 sentences.",
"title": "Prompt",
"type": "string"
}
},
"title": "caption_imageArguments",
"type": "object"
}
Implementation Reference
- src/cv_mcp/mcp_server.py:37-77 (handler)The MCP tool handler for 'caption_image'. Validates inputs, selects backend (openrouter or local), and delegates captioning to the appropriate client.@mcp.tool() def caption_image( image_url: Optional[str] = None, file_path: Optional[str] = None, prompt: str = DEFAULT_PROMPT, backend: Optional[str] = None, local_model_id: Optional[str] = None, ) -> str: if not image_url and not file_path: raise ValueError("Provide either image_url or file_path") if image_url and file_path: raise ValueError("Provide only one of image_url or file_path, not both") image_ref = image_url or file_path # type: ignore # Resolve defaults from global config if not explicitly provided try: from cv_mcp.metadata.runner import _CFG as _GLOBAL_CFG # type: ignore except Exception: _GLOBAL_CFG = {} backend = (backend or str(_GLOBAL_CFG.get("caption_backend", "openrouter"))).lower() local_model_id = local_model_id or str(_GLOBAL_CFG.get("local_vlm_id", "Qwen/Qwen2-VL-2B-Instruct")) if backend == "openrouter": client = OpenRouterClient() res = client.analyze_single_image(image_ref, prompt) if not res.get("success"): raise RuntimeError(str(res.get("error", "Captioning failed"))) content = res.get("content", "") return str(content) elif backend == "local": try: from cv_mcp.captioning.local_captioner import LocalCaptioner except Exception as e: # pragma: no cover raise RuntimeError( "Local backend not available. Install optional deps with `pip install .[local]`." ) from e local = LocalCaptioner(model_id=local_model_id) return local.caption(image_ref, prompt) else: raise ValueError("Invalid backend. Use 'openrouter' or 'local'.")
- OpenRouter client method called by the handler for remote captioning. Wraps analyze_images for single image.def analyze_single_image(self, image: Union[str, Dict], prompt: str, *, model: Optional[str] = None, system: Optional[str] = None) -> Dict[str, Any]: return self.analyze_images([image], prompt, model=model, system=system)
- Local captioner method called by the handler for local model inference. Loads image, processes with transformers model, generates caption.def caption( self, image: Union[str, "Image.Image"], prompt: str, max_new_tokens: int = 128, ) -> str: img = self._load_image(image) messages = [ { "role": "user", "content": [ {"type": "image", "image": img}, {"type": "text", "text": prompt}, ], } ] text = self.processor.apply_chat_template(messages, add_generation_prompt=True) inputs = self.processor(text=[text], images=[img], return_tensors="pt").to(self.model.device) generate_ids = self.model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=False, use_cache=True, ) out = self.processor.batch_decode(generate_ids, skip_special_tokens=True)[0] return out.strip()