batch_convert
Batch convert PDFs in a directory to Markdown with a summary of per-file results.
Instructions
Convert all PDFs in a directory to Markdown. Returns a summary with per-file results.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| directory | Yes | ||
| quality | No | standard |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |
Implementation Reference
- src/pdfmux/mcp_server.py:202-240 (handler)The batch_convert tool handler — an MCP-decorated function that converts all PDFs in a directory to Markdown. It validates the directory path via _check_path, globs for .pdf/.PDF files, calls process_batch for concurrent conversion, and returns a JSON summary with per-file status, page count, confidence, extractor, and character count.
@mcp.tool() def batch_convert(directory: str, quality: str = "standard") -> str: """Convert all PDFs in a directory to Markdown. Returns a summary with per-file results.""" p = _check_path(directory, label="directory") if not p.is_dir(): raise ValueError(f"Not a directory: {directory}") pdfs = list(p.glob("*.pdf")) + list(p.glob("*.PDF")) if not pdfs: return f"No PDF files found in {directory}" from pdfmux.pipeline import process_batch results = [] for path, result_or_error in process_batch(pdfs, output_format="markdown", quality=quality): if isinstance(result_or_error, Exception): results.append({"file": path.name, "status": "error", "error": str(result_or_error)}) else: results.append( { "file": path.name, "status": "success", "pages": result_or_error.page_count, "confidence": round(result_or_error.confidence, 3), "extractor": result_or_error.extractor_used, "chars": len(result_or_error.text), } ) summary = { "directory": str(directory), "total_files": len(pdfs), "success": sum(1 for r in results if r["status"] == "success"), "failed": sum(1 for r in results if r["status"] == "error"), "results": results, } return json.dumps(summary, indent=2) - src/pdfmux/mcp_server.py:14-18 (registration)Tool registration via the @mcp.tool() decorator on line 202, which registers batch_convert as an MCP tool named 'batch_convert'. The tool is listed in the module docstring as one of the available tools.
convert_pdf — extract text from a PDF (Markdown, JSON, LLM chunks) analyze_pdf — quick triage: classify + audit without full extraction batch_convert — convert all PDFs in a directory extract_structured — tables, key-values, schema mapping get_pdf_metadata — page count, file size, type detection (instant, no extraction) - src/pdfmux/pipeline.py:381-423 (helper)process_batch — the core batch processing function called by batch_convert. It takes a list of PDF file paths, processes them concurrently using ThreadPoolExecutor (default 4 workers), and yields (path, result_or_error) tuples. Each file is processed via the main 'process' function.
def process_batch( file_paths: list[str | Path], output_format: str = "markdown", quality: str = "standard", workers: int = 4, use_cache: bool = True, ) -> Iterator[tuple[Path, ConversionResult | Exception]]: """Process multiple PDFs concurrently. Yields (path, result_or_error) tuples as each PDF completes. Errors are caught per-file — one failure doesn't stop the batch. Args: file_paths: List of PDF file paths. output_format: Output format for all files. quality: Quality preset for all files. workers: Number of concurrent workers. use_cache: Forwarded to :func:`process`. Yields: (Path, ConversionResult) on success. (Path, Exception) on failure. """ def _process_one(path: Path) -> ConversionResult: return process( file_path=path, output_format=output_format, quality=quality, use_cache=use_cache, ) paths = [Path(p) for p in file_paths] with ThreadPoolExecutor(max_workers=workers) as pool: futures = {pool.submit(_process_one, p): p for p in paths} for future in as_completed(futures): path = futures[future] try: result = future.result() yield path, result except Exception as e: yield path, e - src/pdfmux/path_safety.py:36-46 (helper)check_path — path validation utility used by batch_convert to verify the directory is within allowed directories (PDFMUX_ALLOWED_DIRS). Raises ValueError if the path is outside allowed directories or if the path is empty.
def check_path(file_path: str, label: str = "file_path") -> Path: """Validate and return a resolved Path, raising ValueError on access denial.""" if not file_path: raise ValueError(f"{label} is required") p = Path(file_path) if not is_path_allowed(p): raise ValueError( f"Access denied: {file_path} is outside allowed directories. " "Set PDFMUX_ALLOWED_DIRS to configure access." ) return p