caption
Generate descriptive captions for image files by processing them through the Florence-2 MCP Server, using a file path or URL as input.
Instructions
Processes an image file and generates captions for the image.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| src | Yes | A file path or URL to the image file that needs to be processed. |
Implementation Reference
- src/mcp_florence2/__init__.py:126-135 (handler)Handler and registration for the MCP 'caption' tool. Processes src (file path or URL), loads images, and delegates to processor.caption with MORE_DETAILED level.@mcp.tool() def caption( ctx: Context, src: PathLike | str = Field(description="A file path or URL to the image file that needs to be processed."), ) -> list[str]: """Processes an image file and generates captions for the image.""" with get_images(src) as images: app_ctx: AppContext = ctx.request_context.lifespan_context return app_ctx.processor.caption(images, CaptionLevel.MORE_DETAILED)
- src/mcp_florence2/florence2.py:56-79 (helper)Core helper function in Florence2 that performs the actual model inference for generating captions or OCR text.def generate(self, prompt: str, images: list[Image]) -> list[str]: res = [] for img in images: with img.convert("RGB") as rgb_img: inputs = self.processor(text=prompt, images=rgb_img, return_tensors="pt").to( self.device, self.torch_dtype ) generated_ids = self.model.generate( input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=1024, num_beams=3, do_sample=False, ) generated_text = self.processor.batch_decode(generated_ids, skip_special_tokens=False)[0] parsed_answer = self.processor.post_process_generation( generated_text, task=prompt, image_size=(rgb_img.width, rgb_img.height) ) res.append(parsed_answer[prompt].strip()) return res
- src/mcp_florence2/__init__.py:31-63 (helper)Helper context manager to load one or more PIL Images from local file path, URL, or PDF (multi-page support).@contextmanager def get_images(src: PathLike | str) -> Iterator[list[Image]]: """Opens and returns a list of images from a file path or URL.""" if isinstance(src, str) and (src.startswith("http://") or src.startswith("https://")): res = requests.get(src) res.raise_for_status() if res.headers["Content-Type"] == "application/pdf": pass with ExitStack() as stack: images = [] with closing(PdfDocument(res.content)) as doc: for page in doc: images.append(stack.enter_context(page.render().to_pil())) yield images else: with open_image(BytesIO(res.content)) as image: yield [image] else: ext = os.path.splitext(src)[1].lower() if ext == ".pdf": with ExitStack() as stack: images = [] with closing(PdfDocument(src)) as doc: for page in doc: images.append(stack.enter_context(page.render().to_pil())) yield images else: with open_image(src) as image: yield [image]
- src/mcp_florence2/florence2.py:20-24 (schema)Enum schema defining caption prompt levels used by the Florence2 caption method.class CaptionLevel(StrEnum): NORMAL = "<CAPTION>" DETAILED = "<DETAILED_CAPTION>" MORE_DETAILED = "<MORE_DETAILED_CAPTION>"
- src/mcp_florence2/florence2.py:53-55 (helper)Florence2 processor's caption method, bridging to the generate helper with level-based prompt.def caption(self, images: list[Image], level: CaptionLevel = CaptionLevel.NORMAL) -> list[str]: return self.generate(str(level.value), images)