FPD_get_document_content_with_mistral_ocr

Extract text from USPTO petition documents using hybrid extraction: free PyPDF2 for text-based PDFs, Mistral OCR for scanned documents. Analyze legal arguments, issues, and patterns in petition decisions.

Instructions

Extract full text from USPTO petition documents with intelligent hybrid extraction (PyPDF2 first, Mistral OCR fallback).

PREREQUISITE: First use fpd_get_petition_details to get document_identifier from documentBag. Auto-optimizes cost: free PyPDF2 for text-based PDFs, ~$0.001/page Mistral OCR only for scanned documents. MISTRAL_API_KEY is optional - without it, only PyPDF2 extraction is available (works well for text-based PDFs).

USE CASES:

Analyze petition legal arguments and Director's reasoning
Extract petition issues, CFR rules cited, statutory references
Detect patterns across multiple petitions (e.g., common denial reasons)
Correlate petition text with PTAB challenge strategies
Profile examiner behavior from supervisory review petitions

COST OPTIMIZATION:

auto_optimize=True (default): Try free PyPDF2 first, fallback to Mistral OCR if needed (70% cost savings)
auto_optimize=False: Use Mistral OCR directly (~$0.001/page)

Returns: extracted_content, extraction_method, processing_cost_usd, page_count

Example workflow:

fpd_get_petition_details(petition_id='0b71b685-...', include_documents=True)
fpd_get_document_content(petition_id='0b71b685-...', document_identifier='DSEN5APWPHOENIX')
Analyze extracted text for legal arguments, issues, and patterns

For document selection strategies and cost optimization, use FPD_get_guidance('cost').

Input Schema

TableJSON Schema

Name	Required	Description	Default
`petition_id`	Yes
`document_identifier`	Yes
`auto_optimize`	No

Implementation Reference

src/fpd_mcp/main.py:1328-1443 (handler)

Primary handler function decorated with @mcp.tool(name="FPD_get_document_content_with_mistral_ocr"). Delegates to FPDClient.extract_document_content_hybrid for hybrid PyPDF2 + Mistral OCR text extraction.

@mcp.tool(name="FPD_get_document_content_with_mistral_ocr")
@async_tool_error_handler("document_content")
async def fpd_get_document_content(
    petition_id: str,
    document_identifier: str,
    auto_optimize: bool = True
) -> Dict[str, Any]:
    """Extract full text from USPTO petition documents with intelligent hybrid extraction (PyPDF2 first, Mistral OCR fallback).

PREREQUISITE: First use fpd_get_petition_details to get document_identifier from documentBag.
Auto-optimizes cost: free PyPDF2 for text-based PDFs, ~$0.001/page Mistral OCR only for scanned documents.
MISTRAL_API_KEY is optional - without it, only PyPDF2 extraction is available (works well for text-based PDFs).

USE CASES:
- Analyze petition legal arguments and Director's reasoning
- Extract petition issues, CFR rules cited, statutory references
- Detect patterns across multiple petitions (e.g., common denial reasons)
- Correlate petition text with PTAB challenge strategies
- Profile examiner behavior from supervisory review petitions

COST OPTIMIZATION:
- auto_optimize=True (default): Try free PyPDF2 first, fallback to Mistral OCR if needed (70% cost savings)
- auto_optimize=False: Use Mistral OCR directly (~$0.001/page)

Returns: extracted_content, extraction_method, processing_cost_usd, page_count

Example workflow:
1. fpd_get_petition_details(petition_id='0b71b685-...', include_documents=True)
2. fpd_get_document_content(petition_id='0b71b685-...', document_identifier='DSEN5APWPHOENIX')
3. Analyze extracted text for legal arguments, issues, and patterns

For document selection strategies and cost optimization, use FPD_get_guidance('cost')."""
    try:
        # Input validation
        if not petition_id or len(petition_id.strip()) == 0:
            return format_error_response("Petition ID cannot be empty", 400)
        if not document_identifier or len(document_identifier.strip()) == 0:
            return format_error_response("Document identifier cannot be empty", 400)

        # Enhanced proxy port detection with centralized proxy support
        centralized_port = os.getenv('CENTRALIZED_PROXY_PORT', '').lower()
        if centralized_port and centralized_port != 'none':
            proxy_port = int(centralized_port)
            logger.info(f"Using centralized USPTO proxy on port {proxy_port} for extraction")
        else:
            # Check FPD_PROXY_PORT first (MCP-specific), then PROXY_PORT (generic)
            proxy_port = get_local_proxy_port()
            logger.info(f"Using local FPD proxy on port {proxy_port} for extraction")

        # Start proxy server if not already running (unless using centralized proxy)
        if not centralized_port or centralized_port == 'none':
            await _ensure_proxy_server_running(proxy_port)
            logger.info(f"Local proxy server ready on port {proxy_port} for document extraction")
        else:
            logger.info("Using centralized proxy for document extraction - no local proxy startup needed")

        # Use API client's hybrid extraction method
        result = await api_client.extract_document_content_hybrid(
            petition_id=petition_id,
            document_identifier=document_identifier,
            auto_optimize=auto_optimize
        )

        # Check for errors
        if "error" in result:
            return result

        # Add LLM guidance for text analysis
        result["llm_guidance"] = {
            "analysis_strategies": {
                "legal_argument_analysis": {
                    "description": "Analyze petition and decision text for legal reasoning",
                    "action": "Extract key arguments, Director's reasoning, legal citations"
                },
                "pattern_detection": {
                    "description": "Compare text across multiple petitions to find common themes",
                    "action": "Identify recurring denial reasons, successful argument patterns"
                },
                "cross_mcp_correlation": {
                    "description": "Correlate petition arguments with PTAB challenges",
                    "action": "Compare legal reasoning with PTAB IPR/PGR arguments"
                },
                "examiner_profiling": {
                    "description": "Analyze supervisory review petitions to profile examiner behavior",
                    "action": "Extract what examiner actions were challenged and Director's response"
                }
            },
            "extraction_quality": {
                "method": result.get("extraction_method", "Unknown"),
                "cost": f"${result.get('processing_cost_usd', 0):.4f}",
                "optimization": result.get("auto_optimization", "Unknown")
            },
            "next_steps": [
                "Analyze extracted content for key legal arguments",
                "Search for CFR citations (e.g., '37 CFR 1.137', '37 CFR 1.181')",
                "Identify petition outcome reasoning in decision text",
                "Cross-reference with PFW prosecution history for context",
                "Compare with PTAB challenge arguments if patent granted"
            ]
        }

        return result

    except ValueError as e:
        logger.warning(f"Validation error in extract document content: {str(e)}")
        return format_error_response(str(e), 400)
    except httpx.HTTPStatusError as e:
        logger.error(f"API error in extract document content: {e.response.status_code} - {e.response.text}")
        return format_error_response(f"API error: {e.response.text}", e.response.status_code)
    except httpx.TimeoutException as e:
        logger.error(f"API timeout in extract document content: {str(e)}")
        return format_error_response("Request timeout - please try again", 408)
    except Exception as e:
        logger.error(f"Unexpected error in extract document content: {str(e)}")
        return format_error_response(f"Internal error: {str(e)}", 500)

src/fpd_mcp/api/fpd_client.py:600-853 (helper)

Core helper method implementing the hybrid document content extraction logic called by the handler. Handles PDF download via proxy and PyPDF2/Mistral OCR extraction.

async def extract_document_content_hybrid(
    self,
    petition_id: str,
    document_identifier: str,
    auto_optimize: bool = True
) -> Dict[str, Any]:
    """
    Extract text from petition PDFs with hybrid approach.

    Workflow:
    1. Fetch petition details to get document metadata
    2. Download PDF content from proxy server
    3. If auto_optimize=True:
       a. Try PyPDF2 extraction (free)
       b. Check extraction quality
       c. If poor quality, fallback to Mistral OCR
    4. If auto_optimize=False: Use Mistral OCR directly
    5. Return extracted text with cost information
    """
    request_id = generate_request_id()

    # Check feature flags
    if not feature_flags.is_enabled("ocr_enabled"):
        logger.warning(f"[{request_id}] OCR feature disabled by feature flag")
        return format_error_response(
            "OCR feature is currently disabled",
            503,
            request_id
        )

    try:
        # Get petition details to verify document exists
        petition_data = await self.get_petition_by_id(petition_id, include_documents=True)

        if "error" in petition_data:
            return petition_data

        # Find document in documentBag - access it from the correct location
        petition_records = petition_data.get(FPDFields.PETITION_DECISION_DATA_BAG, [])
        if not petition_records:
            return format_error_response(
                f"Petition {petition_id} not found",
                404,
                request_id
            )

        document_bag = petition_records[0].get(FPDFields.DOCUMENT_BAG, [])
        document = None
        for doc in document_bag:
            if doc.get(FPDFields.DOCUMENT_IDENTIFIER) == document_identifier:
                document = doc
                break

        if not document:
            return format_error_response(
                f"Document {document_identifier} not found in petition {petition_id}",
                404,
                request_id
            )

        # Get document metadata
        document_code = document.get(FPDFields.DOCUMENT_CODE, "UNKNOWN")
        page_count = document.get(FPDFields.PAGE_COUNT, 0)

        # Extract direct download URL from document metadata (for proxy registration)
        # The proxy needs this URL to fetch PDFs from USPTO API on behalf of users
        # Find the PDF download option in downloadOptionBag
        download_options = document.get(FPDFields.DOWNLOAD_OPTION_BAG, [])
        direct_download_url = None
        for option in download_options:
            if option.get(FPDFields.MIME_TYPE_IDENTIFIER) == 'PDF':
                direct_download_url = option.get(FPDFields.DOWNLOAD_URL)
                break

        if not direct_download_url:
            # Try getting download URL directly from document (still for proxy registration)
            direct_download_url = document.get(FPDFields.DOWNLOAD_URL, "")

        # Download PDF from proxy server
        # Check for centralized proxy first, then local FPD proxy
        centralized_port = os.getenv('CENTRALIZED_PROXY_PORT', '').lower()
        pdf_content = None

        if centralized_port and centralized_port != 'none':
            # Convert to int for URL formatting
            proxy_port = int(centralized_port)
            logger.info(f"[{request_id}] Using centralized proxy on port {proxy_port}")

            # Register document with centralized proxy before downloading
            try:
                from ..shared.internal_auth import mcp_auth
                from ..proxy.server import generate_enhanced_filename

                register_url = f"http://localhost:{proxy_port}/register-fpd-document"

                # Extract metadata needed for token and filename generation
                petition_mail_date = petition_records[0].get(FPDFields.PETITION_MAIL_DATE)
                app_number = petition_records[0].get(FPDFields.APPLICATION_NUMBER_TEXT, "")
                patent_number = petition_records[0].get(FPDFields.PATENT_NUMBER)
                doc_description = document.get(FPDFields.DOCUMENT_CODE_DESCRIPTION_TEXT)

                # Create JWT access token (TTL is 10 minutes, hardcoded in internal_auth.py)
                access_token = mcp_auth.create_document_access_token(
                    petition_id=petition_id,
                    document_identifier=document_identifier,
                    application_number=app_number
                )

                # Generate enhanced filename using proper format
                enhanced_filename = generate_enhanced_filename(
                    petition_mail_date=petition_mail_date,
                    app_number=app_number or "UNKNOWN",
                    patent_number=patent_number,
                    document_description=doc_description,
                    document_code=document_code,
                    max_desc_length=40
                )

                registration_data = {
                    "source": "fpd",  # Required by PFW proxy
                    "petition_id": petition_id,
                    "document_identifier": document_identifier,
                    "download_url": direct_download_url,
                    "access_token": access_token,
                    "application_number": app_number,
                    "enhanced_filename": enhanced_filename
                }

                # PFW validates JWT token in request body, no auth header needed
                async with httpx.AsyncClient(timeout=30.0, limits=self.connection_limits) as reg_client:
                    response_reg = await reg_client.post(
                        register_url,
                        json=registration_data
                    )

                    if response_reg.status_code == 200:
                        logger.info(f"[{request_id}] Successfully registered FPD document with centralized proxy")
                    else:
                        logger.warning(f"[{request_id}] Failed to register document with centralized proxy: {response_reg.status_code}")

            except Exception as e:
                logger.warning(f"[{request_id}] Failed to register document with centralized proxy: {e}")
                # Continue anyway - will try download and may fail if not registered

            # Try downloading from centralized proxy
            download_url = f"http://localhost:{proxy_port}/download/{petition_id}/{document_identifier}"
            logger.info(f"[{request_id}] Attempting PDF download from centralized proxy: {download_url}")

            try:
                async with httpx.AsyncClient(timeout=self.download_timeout, limits=self.connection_limits) as client:
                    pdf_response = await client.get(download_url)
                    pdf_response.raise_for_status()
                    pdf_content = pdf_response.content
                    logger.info(f"[{request_id}] Downloaded {len(pdf_content)} bytes from centralized proxy")
            except httpx.HTTPStatusError as e:
                if e.response.status_code == 404:
                    # Centralized proxy doesn't have FPD routes yet - fallback to local FPD proxy
                    logger.warning(f"[{request_id}] Centralized proxy doesn't support FPD routes yet (404)")
                    logger.info(f"[{request_id}] Falling back to local FPD proxy")
                    pdf_content = None  # Will trigger fallback below
                else:
                    # Other HTTP error - re-raise
                    raise
            except Exception as e:
                # Network or other errors - log and fallback to local proxy
                logger.warning(f"[{request_id}] Centralized proxy download failed: {e}")
                logger.info(f"[{request_id}] Falling back to local FPD proxy")
                pdf_content = None  # Will trigger fallback below

        # Use local FPD proxy if centralized proxy not configured or failed
        if pdf_content is None:
            # Check FPD_PROXY_PORT first (MCP-specific), then PROXY_PORT (generic)
            # Handle 'none' sentinel value
            port_str = os.getenv("FPD_PROXY_PORT") or os.getenv("PROXY_PORT") or "8081"
            local_proxy_port = "8081" if port_str.lower() == "none" else port_str
            logger.info(f"[{request_id}] Using local FPD proxy on port {local_proxy_port}")

            download_url = f"http://localhost:{local_proxy_port}/download/{petition_id}/{document_identifier}"
            logger.info(f"[{request_id}] Downloading PDF from local FPD proxy: {download_url}")

            async with httpx.AsyncClient(timeout=self.download_timeout, limits=self.connection_limits) as client:
                pdf_response = await client.get(download_url)
                pdf_response.raise_for_status()
                pdf_content = pdf_response.content
                logger.info(f"[{request_id}] Downloaded {len(pdf_content)} bytes from local FPD proxy")

        # Extract text based on auto_optimize setting
        extraction_result = {
            "success": True,
            "document_code": document_code,
            "page_count": page_count,
            "request_id": request_id
        }

        if auto_optimize:
            # Try PyPDF2 first
            logger.info(f"[{request_id}] Attempting PyPDF2 extraction (free)")
            pypdf_text = await self.extract_with_pypdf2(pdf_content)

            if self.is_good_extraction(pypdf_text):
                # PyPDF2 worked!
                logger.info(f"[{request_id}] PyPDF2 extraction successful ({len(pypdf_text)} chars)")
                extraction_result.update({
                    "extracted_content": pypdf_text,
                    "extraction_method": "PyPDF2",
                    "processing_cost_usd": 0.0,
                    "cost_breakdown": "Free PyPDF2 extraction",
                    "auto_optimization": "PyPDF2 succeeded - no OCR needed"
                })
            else:
                # PyPDF2 failed - fallback to Mistral OCR
                logger.info(f"[{request_id}] PyPDF2 extraction poor quality, falling back to Mistral OCR")
                mistral_text, cost = await self.extract_with_mistral_ocr(pdf_content, page_count)

                logger.info(f"[{request_id}] Mistral OCR extraction successful ({len(mistral_text)} chars, ${cost:.4f})")
                extraction_result.update({
                    "extracted_content": mistral_text,
                    "extraction_method": "Mistral OCR (mistral-ocr-latest)",
                    "processing_cost_usd": round(cost, 4),
                    "cost_breakdown": f"${cost:.4f} for {page_count} pages at $0.001/page",
                    "auto_optimization": "PyPDF2 failed - Mistral OCR used"
                })
        else:
            # Use Mistral OCR directly
            logger.info(f"[{request_id}] Using Mistral OCR directly (auto_optimize=False)")
            mistral_text, cost = await self.extract_with_mistral_ocr(pdf_content, page_count)

            logger.info(f"[{request_id}] Mistral OCR extraction successful ({len(mistral_text)} chars, ${cost:.4f})")
            extraction_result.update({
                "extracted_content": mistral_text,
                "extraction_method": "Mistral OCR (mistral-ocr-latest)",
                "processing_cost_usd": round(cost, 4),
                "cost_breakdown": f"${cost:.4f} for {page_count} pages at $0.001/page",
                "auto_optimization": "Disabled - Mistral OCR used directly"
            })

        return extraction_result

    except ValueError as e:
        # MISTRAL_API_KEY missing or other validation error
        logger.error(f"[{request_id}] Validation error: {str(e)}")
        return format_error_response(
            f"{str(e)}. PyPDF2 extraction failed - document may be scanned. To enable OCR, configure MISTRAL_API_KEY.",
            400,
            request_id
        )
    except Exception as e:
        logger.error(f"[{request_id}] Error extracting document content: {str(e)}")
        return format_error_response(
            f"Failed to extract document content: {str(e)}",
            500,
            request_id
        )

src/fpd_mcp/api/fpd_client.py:481-599 (helper)

Specific helper for Mistral OCR extraction, used as fallback in hybrid method when PyPDF2 fails.

async def extract_with_mistral_ocr(self, pdf_content: bytes, page_count: int = 0) -> Tuple[str, float]:
    """
    Extract text using Mistral OCR API (no poppler/pdf2image required).
    Uses the same approach as Patent File Wrapper MCP.

    Args:
        pdf_content: PDF bytes
        page_count: Number of pages (for cost control)

    Returns:
        Tuple of (extracted_text, cost_usd)
    """
    # Check feature flag
    if not feature_flags.is_enabled("mistral_ocr_enabled"):
        raise ValueError("Mistral OCR feature is currently disabled")

    # Get Mistral API key from unified secure storage first, then environment variable
    mistral_api_key = None
    try:
        from ..shared_secure_storage import get_mistral_api_key
        mistral_api_key = get_mistral_api_key()
    except Exception:
        # Fall back to environment variable if secure storage fails
        pass

    # If still no key, try environment variable
    if not mistral_api_key:
        mistral_api_key = os.getenv("MISTRAL_API_KEY")

    if not mistral_api_key:
        raise ValueError("MISTRAL_API_KEY required for OCR extraction")

    mistral_base_url = "https://api.mistral.ai/v1"

    try:
        # Step 1: Upload PDF file to Mistral
        mistral_headers = {
            "Authorization": f"Bearer {mistral_api_key}",
        }

        files = {
            "file": ("document.pdf", pdf_content, "application/pdf")
        }

        data = {
            "purpose": "ocr"
        }

        async with httpx.AsyncClient(timeout=self.download_timeout, limits=self.connection_limits) as client:
            # Upload file
            upload_response = await client.post(
                f"{mistral_base_url}/files",
                headers=mistral_headers,
                files=files,
                data=data
            )
            upload_response.raise_for_status()
            upload_data = upload_response.json()
            file_id = upload_data.get("id")

            if not file_id:
                raise ValueError("Failed to upload file to Mistral OCR service")

            # Step 2: Process with OCR
            ocr_payload = {
                "model": "mistral-ocr-latest",
                "document": {
                    "type": "file",
                    "file_id": file_id
                },
                "pages": list(range(min(page_count, 50))) if page_count > 0 else None,  # Limit to first 50 pages for cost control
                "include_image_base64": False  # Save tokens
            }

            # Operation-level timeout for OCR (2x download timeout for large PDFs)
            ocr_timeout = self.download_timeout * api_constants.OCR_TIMEOUT_MULTIPLIER
            try:
                async with asyncio.timeout(ocr_timeout):
                    ocr_response = await client.post(
                        f"{mistral_base_url}/ocr",
                        headers={
                            "Authorization": f"Bearer {mistral_api_key}",
                            "Content-Type": "application/json"
                        },
                        json=ocr_payload
                    )
                    ocr_response.raise_for_status()
                    ocr_data = ocr_response.json()
            except asyncio.TimeoutError:
                raise ValueError(f"OCR operation timed out after {ocr_timeout}s - PDF may be too large or complex")

            # Extract content from OCR response
            pages_processed = ocr_data.get("usage_info", {}).get("pages_processed", 0)
            estimated_cost = pages_processed * 0.001  # $1 per 1000 pages

            # Combine all page content
            extracted_content = []
            for page in ocr_data.get("pages", []):
                page_markdown = page.get("markdown", "")
                if page_markdown.strip():
                    extracted_content.append(f"=== PAGE {page.get('index', 0) + 1} ===\n{page_markdown}")

            full_content = "\n\n".join(extracted_content)

            logger.info(f"Mistral OCR extracted {pages_processed} pages, cost: ${estimated_cost:.4f}")

            return full_content, estimated_cost

    except httpx.HTTPStatusError as e:
        if e.response.status_code == 401:
            raise ValueError("Mistral API authentication failed - check MISTRAL_API_KEY")
        elif e.response.status_code == 402:
            raise ValueError("Mistral API payment required - insufficient credits")
        else:
            raise ValueError(f"Mistral API error {e.response.status_code}: {e.response.text}")
    except Exception as e:
        logger.error(f"Mistral OCR extraction failed: {e}")
        raise

src/fpd_mcp/main.py:1328-1328 (registration)
MCP tool registration decorator specifying the exact tool name.
```
@mcp.tool(name="FPD_get_document_content_with_mistral_ocr")
```

src/fpd_mcp/config/tool_reflections.py:614-621 (schema)

Tool description and usage example in tool reflections, serving as informal schema/documentation.

#### 7. fpd_get_document_content

**Purpose:** Extract text from petition PDFs for LLM analysis
**Extraction:** Hybrid PyPDF2 (free) + Mistral OCR (~$0.001/page)

**Example:**
```python
fpd_get_document_content(

USPTO Final Petition Decisions MCP Server