Analyse / caption an image with a vision model
replicate_visionAnalyze an image with a vision-language model. Describe, caption, or ask questions about the image content.
Instructions
Run a vision-language model to describe, caption, or answer questions about an image.
Args:
image (string URL): URL of the image to analyse.
prompt (string, optional): Question or instruction (e.g. "describe this image", "count the people"). Default is a generic caption.
model (string, default "llava-13b"): Curated key (llava-13b, llava-v1.6-34b, blip-2, qwen-vl) or "owner/name".
max_tokens (1-4096, optional): Response length.
extra_input (object, optional): Model-specific extras.
Returns: PredictionResult with text_output containing the model's textual answer.
Examples:
image="https://example.com/photo.jpg", prompt="What objects are visible?"
image="", prompt="Read the values off this chart and list them.", model="llava-v1.6-34b"
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| image | Yes | URL of the image to analyse / caption. | |
| model | No | Vision model. Curated: llava-13b, llava-v1.6-34b, blip-2, qwen-vl. Or "owner/name". | llava-13b |
| prompt | No | Optional question or instruction (e.g. 'describe this image', 'count the people'). Default is a generic caption. | |
| download | No | ||
| max_tokens | No | ||
| timeout_ms | No | Max ms to wait for the prediction. If exceeded, returns the prediction ID so you can poll via replicate_get_prediction. Default: 300000 (5min). | |
| extra_input | No |