dense_caption
Generate detailed captions for images from URLs or local files to describe visual content using computer vision models.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| file_path | No | ||
| image_url | No |
Input Schema (JSON Schema)
{
"properties": {
"file_path": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "File Path"
},
"image_url": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"title": "Image Url"
}
},
"title": "dense_captionArguments",
"type": "object"
}
Implementation Reference
- src/cv_mcp/mcp_server.py:94-105 (handler)MCP tool handler function for 'dense_caption'. Validates image input (URL or file path) and calls the runner function.@mcp.tool() def dense_caption( image_url: Optional[str] = None, file_path: Optional[str] = None, ) -> str: if not image_url and not file_path: raise ValueError("Provide either image_url or file_path") if image_url and file_path: raise ValueError("Provide only one of image_url or file_path, not both") image_ref = image_url or file_path # type: ignore return run_dense_caption(image_ref)
- Core logic for generating dense captions using local, Ollama, or OpenRouter backends with specific prompts.def run_dense_caption(image_ref: str, *, model: Optional[str] = None) -> str: if _use_local_for("caption"): prompt = f"{prompts.CAPTION_SYSTEM}\n\n{prompts.CAPTION_USER}" return _local_gen(image_ref, prompt) if _use_ollama_for("caption"): from cv_mcp.captioning.ollama_client import OllamaClient client = OllamaClient(host=str(_cfg_value("ollama_host", "http://localhost:11434"))) res = client.analyze_single_image( image_ref, prompts.CAPTION_USER, model=_cfg_value("caption_model"), system=prompts.CAPTION_SYSTEM, ) if not res.get("success"): raise RuntimeError(str(res.get("error", "Dense caption generation failed (ollama)"))) return str(res.get("content", "")).strip() client = OpenRouterClient() res = client.analyze_single_image( image_ref, prompts.CAPTION_USER, model=model or _cfg_value("caption_model"), system=prompts.CAPTION_SYSTEM, ) if not res.get("success"): raise RuntimeError(str(res.get("error", "Dense caption generation failed"))) return str(res.get("content", "")).strip()
- src/cv_mcp/metadata/prompts.py:15-27 (helper)Prompt templates (system and user) specifically for dense caption generation.CAPTION_SYSTEM = ( "You carefully describe visual content without guessing. Mention salient text only if clearly readable." ) CAPTION_USER = ( "Write a factual, detailed caption (2–6 sentences) for this image. Cover:\n" "- Who/what is visible (counts if reliable).\n" "- Where/setting if visually indicated.\n" "- Salient readable text.\n" "- Relationships (e.g., 'person holding red umbrella near taxi').\n" "- Lighting/time cues if obvious (e.g., night, golden hour).\n" "If uncertain, say 'unclear'. Do not guess brands, species, or locations unless unmistakable. Avoid subjective adjectives." )
- src/cv_mcp/mcp_server.py:94-105 (registration)The @mcp.tool() decorator registers this function as the 'dense_caption' tool in the FastMCP server.@mcp.tool() def dense_caption( image_url: Optional[str] = None, file_path: Optional[str] = None, ) -> str: if not image_url and not file_path: raise ValueError("Provide either image_url or file_path") if image_url and file_path: raise ValueError("Provide only one of image_url or file_path, not both") image_ref = image_url or file_path # type: ignore return run_dense_caption(image_ref)