Skip to main content
Glama

batch_convert

Batch convert PDFs in a directory to Markdown with a summary of per-file results.

Instructions

Convert all PDFs in a directory to Markdown. Returns a summary with per-file results.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
directoryYes
qualityNostandard

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The batch_convert tool handler — an MCP-decorated function that converts all PDFs in a directory to Markdown. It validates the directory path via _check_path, globs for .pdf/.PDF files, calls process_batch for concurrent conversion, and returns a JSON summary with per-file status, page count, confidence, extractor, and character count.
    @mcp.tool()
    def batch_convert(directory: str, quality: str = "standard") -> str:
        """Convert all PDFs in a directory to Markdown. Returns a summary with per-file results."""
        p = _check_path(directory, label="directory")
    
        if not p.is_dir():
            raise ValueError(f"Not a directory: {directory}")
    
        pdfs = list(p.glob("*.pdf")) + list(p.glob("*.PDF"))
        if not pdfs:
            return f"No PDF files found in {directory}"
    
        from pdfmux.pipeline import process_batch
    
        results = []
        for path, result_or_error in process_batch(pdfs, output_format="markdown", quality=quality):
            if isinstance(result_or_error, Exception):
                results.append({"file": path.name, "status": "error", "error": str(result_or_error)})
            else:
                results.append(
                    {
                        "file": path.name,
                        "status": "success",
                        "pages": result_or_error.page_count,
                        "confidence": round(result_or_error.confidence, 3),
                        "extractor": result_or_error.extractor_used,
                        "chars": len(result_or_error.text),
                    }
                )
    
        summary = {
            "directory": str(directory),
            "total_files": len(pdfs),
            "success": sum(1 for r in results if r["status"] == "success"),
            "failed": sum(1 for r in results if r["status"] == "error"),
            "results": results,
        }
    
        return json.dumps(summary, indent=2)
  • Tool registration via the @mcp.tool() decorator on line 202, which registers batch_convert as an MCP tool named 'batch_convert'. The tool is listed in the module docstring as one of the available tools.
    convert_pdf        — extract text from a PDF (Markdown, JSON, LLM chunks)
    analyze_pdf        — quick triage: classify + audit without full extraction
    batch_convert      — convert all PDFs in a directory
    extract_structured — tables, key-values, schema mapping
    get_pdf_metadata   — page count, file size, type detection (instant, no extraction)
  • process_batch — the core batch processing function called by batch_convert. It takes a list of PDF file paths, processes them concurrently using ThreadPoolExecutor (default 4 workers), and yields (path, result_or_error) tuples. Each file is processed via the main 'process' function.
    def process_batch(
        file_paths: list[str | Path],
        output_format: str = "markdown",
        quality: str = "standard",
        workers: int = 4,
        use_cache: bool = True,
    ) -> Iterator[tuple[Path, ConversionResult | Exception]]:
        """Process multiple PDFs concurrently.
    
        Yields (path, result_or_error) tuples as each PDF completes.
        Errors are caught per-file — one failure doesn't stop the batch.
    
        Args:
            file_paths: List of PDF file paths.
            output_format: Output format for all files.
            quality: Quality preset for all files.
            workers: Number of concurrent workers.
            use_cache: Forwarded to :func:`process`.
    
        Yields:
            (Path, ConversionResult) on success.
            (Path, Exception) on failure.
        """
    
        def _process_one(path: Path) -> ConversionResult:
            return process(
                file_path=path,
                output_format=output_format,
                quality=quality,
                use_cache=use_cache,
            )
    
        paths = [Path(p) for p in file_paths]
    
        with ThreadPoolExecutor(max_workers=workers) as pool:
            futures = {pool.submit(_process_one, p): p for p in paths}
            for future in as_completed(futures):
                path = futures[future]
                try:
                    result = future.result()
                    yield path, result
                except Exception as e:
                    yield path, e
  • check_path — path validation utility used by batch_convert to verify the directory is within allowed directories (PDFMUX_ALLOWED_DIRS). Raises ValueError if the path is outside allowed directories or if the path is empty.
    def check_path(file_path: str, label: str = "file_path") -> Path:
        """Validate and return a resolved Path, raising ValueError on access denial."""
        if not file_path:
            raise ValueError(f"{label} is required")
        p = Path(file_path)
        if not is_path_allowed(p):
            raise ValueError(
                f"Access denied: {file_path} is outside allowed directories. "
                "Set PDFMUX_ALLOWED_DIRS to configure access."
            )
        return p
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description carries full burden. It discloses batch conversion and a summary return, but omits side effects (e.g., file modifications), error handling, and dependencies. The behavioral disclosure is minimal.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences are concise, but the brevity sacrifices needed detail. While not verbose, missing parameter explanations and behavioral context reduce effectiveness.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The output schema exists but is not explained (summary with per-file results is vague). Sibling tools provide contrast, but the description lacks completeness on error handling, output format, and parameter details for quality.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, yet the description only mentions 'directory' implicitly and does not explain the 'quality' parameter (e.g., standard vs high). No added meaning beyond the raw schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states 'Convert all PDFs in a directory to Markdown', specifying the verb (convert), resource (all PDFs in a directory), and output format (Markdown). This distinguishes it from sibling tools like convert_pdf (single file) and extract_streaming (streaming extraction).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool vs alternatives, no prerequisites (e.g., directory existence, permissions), and no exclusions. It simply states what it does without context.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/NameetP/pdfmux'

If you have feedback or need assistance with the MCP directory API, please join our Discord server