FPD_get_document_content_with_mistral_ocr
Extract text from USPTO petition documents using hybrid extraction: free PyPDF2 for text-based PDFs, Mistral OCR for scanned documents. Analyze legal arguments, issues, and patterns in petition decisions.
Instructions
Extract full text from USPTO petition documents with intelligent hybrid extraction (PyPDF2 first, Mistral OCR fallback).
PREREQUISITE: First use fpd_get_petition_details to get document_identifier from documentBag. Auto-optimizes cost: free PyPDF2 for text-based PDFs, ~$0.001/page Mistral OCR only for scanned documents. MISTRAL_API_KEY is optional - without it, only PyPDF2 extraction is available (works well for text-based PDFs).
USE CASES:
Analyze petition legal arguments and Director's reasoning
Extract petition issues, CFR rules cited, statutory references
Detect patterns across multiple petitions (e.g., common denial reasons)
Correlate petition text with PTAB challenge strategies
Profile examiner behavior from supervisory review petitions
COST OPTIMIZATION:
auto_optimize=True (default): Try free PyPDF2 first, fallback to Mistral OCR if needed (70% cost savings)
auto_optimize=False: Use Mistral OCR directly (~$0.001/page)
Returns: extracted_content, extraction_method, processing_cost_usd, page_count
Example workflow:
fpd_get_petition_details(petition_id='0b71b685-...', include_documents=True)
fpd_get_document_content(petition_id='0b71b685-...', document_identifier='DSEN5APWPHOENIX')
Analyze extracted text for legal arguments, issues, and patterns
For document selection strategies and cost optimization, use FPD_get_guidance('cost').
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| petition_id | Yes | ||
| document_identifier | Yes | ||
| auto_optimize | No |
Implementation Reference
- src/fpd_mcp/main.py:1328-1443 (handler)Primary handler function decorated with @mcp.tool(name="FPD_get_document_content_with_mistral_ocr"). Delegates to FPDClient.extract_document_content_hybrid for hybrid PyPDF2 + Mistral OCR text extraction.@mcp.tool(name="FPD_get_document_content_with_mistral_ocr") @async_tool_error_handler("document_content") async def fpd_get_document_content( petition_id: str, document_identifier: str, auto_optimize: bool = True ) -> Dict[str, Any]: """Extract full text from USPTO petition documents with intelligent hybrid extraction (PyPDF2 first, Mistral OCR fallback). PREREQUISITE: First use fpd_get_petition_details to get document_identifier from documentBag. Auto-optimizes cost: free PyPDF2 for text-based PDFs, ~$0.001/page Mistral OCR only for scanned documents. MISTRAL_API_KEY is optional - without it, only PyPDF2 extraction is available (works well for text-based PDFs). USE CASES: - Analyze petition legal arguments and Director's reasoning - Extract petition issues, CFR rules cited, statutory references - Detect patterns across multiple petitions (e.g., common denial reasons) - Correlate petition text with PTAB challenge strategies - Profile examiner behavior from supervisory review petitions COST OPTIMIZATION: - auto_optimize=True (default): Try free PyPDF2 first, fallback to Mistral OCR if needed (70% cost savings) - auto_optimize=False: Use Mistral OCR directly (~$0.001/page) Returns: extracted_content, extraction_method, processing_cost_usd, page_count Example workflow: 1. fpd_get_petition_details(petition_id='0b71b685-...', include_documents=True) 2. fpd_get_document_content(petition_id='0b71b685-...', document_identifier='DSEN5APWPHOENIX') 3. Analyze extracted text for legal arguments, issues, and patterns For document selection strategies and cost optimization, use FPD_get_guidance('cost').""" try: # Input validation if not petition_id or len(petition_id.strip()) == 0: return format_error_response("Petition ID cannot be empty", 400) if not document_identifier or len(document_identifier.strip()) == 0: return format_error_response("Document identifier cannot be empty", 400) # Enhanced proxy port detection with centralized proxy support centralized_port = os.getenv('CENTRALIZED_PROXY_PORT', '').lower() if centralized_port and centralized_port != 'none': proxy_port = int(centralized_port) logger.info(f"Using centralized USPTO proxy on port {proxy_port} for extraction") else: # Check FPD_PROXY_PORT first (MCP-specific), then PROXY_PORT (generic) proxy_port = get_local_proxy_port() logger.info(f"Using local FPD proxy on port {proxy_port} for extraction") # Start proxy server if not already running (unless using centralized proxy) if not centralized_port or centralized_port == 'none': await _ensure_proxy_server_running(proxy_port) logger.info(f"Local proxy server ready on port {proxy_port} for document extraction") else: logger.info("Using centralized proxy for document extraction - no local proxy startup needed") # Use API client's hybrid extraction method result = await api_client.extract_document_content_hybrid( petition_id=petition_id, document_identifier=document_identifier, auto_optimize=auto_optimize ) # Check for errors if "error" in result: return result # Add LLM guidance for text analysis result["llm_guidance"] = { "analysis_strategies": { "legal_argument_analysis": { "description": "Analyze petition and decision text for legal reasoning", "action": "Extract key arguments, Director's reasoning, legal citations" }, "pattern_detection": { "description": "Compare text across multiple petitions to find common themes", "action": "Identify recurring denial reasons, successful argument patterns" }, "cross_mcp_correlation": { "description": "Correlate petition arguments with PTAB challenges", "action": "Compare legal reasoning with PTAB IPR/PGR arguments" }, "examiner_profiling": { "description": "Analyze supervisory review petitions to profile examiner behavior", "action": "Extract what examiner actions were challenged and Director's response" } }, "extraction_quality": { "method": result.get("extraction_method", "Unknown"), "cost": f"${result.get('processing_cost_usd', 0):.4f}", "optimization": result.get("auto_optimization", "Unknown") }, "next_steps": [ "Analyze extracted content for key legal arguments", "Search for CFR citations (e.g., '37 CFR 1.137', '37 CFR 1.181')", "Identify petition outcome reasoning in decision text", "Cross-reference with PFW prosecution history for context", "Compare with PTAB challenge arguments if patent granted" ] } return result except ValueError as e: logger.warning(f"Validation error in extract document content: {str(e)}") return format_error_response(str(e), 400) except httpx.HTTPStatusError as e: logger.error(f"API error in extract document content: {e.response.status_code} - {e.response.text}") return format_error_response(f"API error: {e.response.text}", e.response.status_code) except httpx.TimeoutException as e: logger.error(f"API timeout in extract document content: {str(e)}") return format_error_response("Request timeout - please try again", 408) except Exception as e: logger.error(f"Unexpected error in extract document content: {str(e)}") return format_error_response(f"Internal error: {str(e)}", 500)
- Core helper method implementing the hybrid document content extraction logic called by the handler. Handles PDF download via proxy and PyPDF2/Mistral OCR extraction.async def extract_document_content_hybrid( self, petition_id: str, document_identifier: str, auto_optimize: bool = True ) -> Dict[str, Any]: """ Extract text from petition PDFs with hybrid approach. Workflow: 1. Fetch petition details to get document metadata 2. Download PDF content from proxy server 3. If auto_optimize=True: a. Try PyPDF2 extraction (free) b. Check extraction quality c. If poor quality, fallback to Mistral OCR 4. If auto_optimize=False: Use Mistral OCR directly 5. Return extracted text with cost information """ request_id = generate_request_id() # Check feature flags if not feature_flags.is_enabled("ocr_enabled"): logger.warning(f"[{request_id}] OCR feature disabled by feature flag") return format_error_response( "OCR feature is currently disabled", 503, request_id ) try: # Get petition details to verify document exists petition_data = await self.get_petition_by_id(petition_id, include_documents=True) if "error" in petition_data: return petition_data # Find document in documentBag - access it from the correct location petition_records = petition_data.get(FPDFields.PETITION_DECISION_DATA_BAG, []) if not petition_records: return format_error_response( f"Petition {petition_id} not found", 404, request_id ) document_bag = petition_records[0].get(FPDFields.DOCUMENT_BAG, []) document = None for doc in document_bag: if doc.get(FPDFields.DOCUMENT_IDENTIFIER) == document_identifier: document = doc break if not document: return format_error_response( f"Document {document_identifier} not found in petition {petition_id}", 404, request_id ) # Get document metadata document_code = document.get(FPDFields.DOCUMENT_CODE, "UNKNOWN") page_count = document.get(FPDFields.PAGE_COUNT, 0) # Extract direct download URL from document metadata (for proxy registration) # The proxy needs this URL to fetch PDFs from USPTO API on behalf of users # Find the PDF download option in downloadOptionBag download_options = document.get(FPDFields.DOWNLOAD_OPTION_BAG, []) direct_download_url = None for option in download_options: if option.get(FPDFields.MIME_TYPE_IDENTIFIER) == 'PDF': direct_download_url = option.get(FPDFields.DOWNLOAD_URL) break if not direct_download_url: # Try getting download URL directly from document (still for proxy registration) direct_download_url = document.get(FPDFields.DOWNLOAD_URL, "") # Download PDF from proxy server # Check for centralized proxy first, then local FPD proxy centralized_port = os.getenv('CENTRALIZED_PROXY_PORT', '').lower() pdf_content = None if centralized_port and centralized_port != 'none': # Convert to int for URL formatting proxy_port = int(centralized_port) logger.info(f"[{request_id}] Using centralized proxy on port {proxy_port}") # Register document with centralized proxy before downloading try: from ..shared.internal_auth import mcp_auth from ..proxy.server import generate_enhanced_filename register_url = f"http://localhost:{proxy_port}/register-fpd-document" # Extract metadata needed for token and filename generation petition_mail_date = petition_records[0].get(FPDFields.PETITION_MAIL_DATE) app_number = petition_records[0].get(FPDFields.APPLICATION_NUMBER_TEXT, "") patent_number = petition_records[0].get(FPDFields.PATENT_NUMBER) doc_description = document.get(FPDFields.DOCUMENT_CODE_DESCRIPTION_TEXT) # Create JWT access token (TTL is 10 minutes, hardcoded in internal_auth.py) access_token = mcp_auth.create_document_access_token( petition_id=petition_id, document_identifier=document_identifier, application_number=app_number ) # Generate enhanced filename using proper format enhanced_filename = generate_enhanced_filename( petition_mail_date=petition_mail_date, app_number=app_number or "UNKNOWN", patent_number=patent_number, document_description=doc_description, document_code=document_code, max_desc_length=40 ) registration_data = { "source": "fpd", # Required by PFW proxy "petition_id": petition_id, "document_identifier": document_identifier, "download_url": direct_download_url, "access_token": access_token, "application_number": app_number, "enhanced_filename": enhanced_filename } # PFW validates JWT token in request body, no auth header needed async with httpx.AsyncClient(timeout=30.0, limits=self.connection_limits) as reg_client: response_reg = await reg_client.post( register_url, json=registration_data ) if response_reg.status_code == 200: logger.info(f"[{request_id}] Successfully registered FPD document with centralized proxy") else: logger.warning(f"[{request_id}] Failed to register document with centralized proxy: {response_reg.status_code}") except Exception as e: logger.warning(f"[{request_id}] Failed to register document with centralized proxy: {e}") # Continue anyway - will try download and may fail if not registered # Try downloading from centralized proxy download_url = f"http://localhost:{proxy_port}/download/{petition_id}/{document_identifier}" logger.info(f"[{request_id}] Attempting PDF download from centralized proxy: {download_url}") try: async with httpx.AsyncClient(timeout=self.download_timeout, limits=self.connection_limits) as client: pdf_response = await client.get(download_url) pdf_response.raise_for_status() pdf_content = pdf_response.content logger.info(f"[{request_id}] Downloaded {len(pdf_content)} bytes from centralized proxy") except httpx.HTTPStatusError as e: if e.response.status_code == 404: # Centralized proxy doesn't have FPD routes yet - fallback to local FPD proxy logger.warning(f"[{request_id}] Centralized proxy doesn't support FPD routes yet (404)") logger.info(f"[{request_id}] Falling back to local FPD proxy") pdf_content = None # Will trigger fallback below else: # Other HTTP error - re-raise raise except Exception as e: # Network or other errors - log and fallback to local proxy logger.warning(f"[{request_id}] Centralized proxy download failed: {e}") logger.info(f"[{request_id}] Falling back to local FPD proxy") pdf_content = None # Will trigger fallback below # Use local FPD proxy if centralized proxy not configured or failed if pdf_content is None: # Check FPD_PROXY_PORT first (MCP-specific), then PROXY_PORT (generic) # Handle 'none' sentinel value port_str = os.getenv("FPD_PROXY_PORT") or os.getenv("PROXY_PORT") or "8081" local_proxy_port = "8081" if port_str.lower() == "none" else port_str logger.info(f"[{request_id}] Using local FPD proxy on port {local_proxy_port}") download_url = f"http://localhost:{local_proxy_port}/download/{petition_id}/{document_identifier}" logger.info(f"[{request_id}] Downloading PDF from local FPD proxy: {download_url}") async with httpx.AsyncClient(timeout=self.download_timeout, limits=self.connection_limits) as client: pdf_response = await client.get(download_url) pdf_response.raise_for_status() pdf_content = pdf_response.content logger.info(f"[{request_id}] Downloaded {len(pdf_content)} bytes from local FPD proxy") # Extract text based on auto_optimize setting extraction_result = { "success": True, "document_code": document_code, "page_count": page_count, "request_id": request_id } if auto_optimize: # Try PyPDF2 first logger.info(f"[{request_id}] Attempting PyPDF2 extraction (free)") pypdf_text = await self.extract_with_pypdf2(pdf_content) if self.is_good_extraction(pypdf_text): # PyPDF2 worked! logger.info(f"[{request_id}] PyPDF2 extraction successful ({len(pypdf_text)} chars)") extraction_result.update({ "extracted_content": pypdf_text, "extraction_method": "PyPDF2", "processing_cost_usd": 0.0, "cost_breakdown": "Free PyPDF2 extraction", "auto_optimization": "PyPDF2 succeeded - no OCR needed" }) else: # PyPDF2 failed - fallback to Mistral OCR logger.info(f"[{request_id}] PyPDF2 extraction poor quality, falling back to Mistral OCR") mistral_text, cost = await self.extract_with_mistral_ocr(pdf_content, page_count) logger.info(f"[{request_id}] Mistral OCR extraction successful ({len(mistral_text)} chars, ${cost:.4f})") extraction_result.update({ "extracted_content": mistral_text, "extraction_method": "Mistral OCR (mistral-ocr-latest)", "processing_cost_usd": round(cost, 4), "cost_breakdown": f"${cost:.4f} for {page_count} pages at $0.001/page", "auto_optimization": "PyPDF2 failed - Mistral OCR used" }) else: # Use Mistral OCR directly logger.info(f"[{request_id}] Using Mistral OCR directly (auto_optimize=False)") mistral_text, cost = await self.extract_with_mistral_ocr(pdf_content, page_count) logger.info(f"[{request_id}] Mistral OCR extraction successful ({len(mistral_text)} chars, ${cost:.4f})") extraction_result.update({ "extracted_content": mistral_text, "extraction_method": "Mistral OCR (mistral-ocr-latest)", "processing_cost_usd": round(cost, 4), "cost_breakdown": f"${cost:.4f} for {page_count} pages at $0.001/page", "auto_optimization": "Disabled - Mistral OCR used directly" }) return extraction_result except ValueError as e: # MISTRAL_API_KEY missing or other validation error logger.error(f"[{request_id}] Validation error: {str(e)}") return format_error_response( f"{str(e)}. PyPDF2 extraction failed - document may be scanned. To enable OCR, configure MISTRAL_API_KEY.", 400, request_id ) except Exception as e: logger.error(f"[{request_id}] Error extracting document content: {str(e)}") return format_error_response( f"Failed to extract document content: {str(e)}", 500, request_id )
- Specific helper for Mistral OCR extraction, used as fallback in hybrid method when PyPDF2 fails.async def extract_with_mistral_ocr(self, pdf_content: bytes, page_count: int = 0) -> Tuple[str, float]: """ Extract text using Mistral OCR API (no poppler/pdf2image required). Uses the same approach as Patent File Wrapper MCP. Args: pdf_content: PDF bytes page_count: Number of pages (for cost control) Returns: Tuple of (extracted_text, cost_usd) """ # Check feature flag if not feature_flags.is_enabled("mistral_ocr_enabled"): raise ValueError("Mistral OCR feature is currently disabled") # Get Mistral API key from unified secure storage first, then environment variable mistral_api_key = None try: from ..shared_secure_storage import get_mistral_api_key mistral_api_key = get_mistral_api_key() except Exception: # Fall back to environment variable if secure storage fails pass # If still no key, try environment variable if not mistral_api_key: mistral_api_key = os.getenv("MISTRAL_API_KEY") if not mistral_api_key: raise ValueError("MISTRAL_API_KEY required for OCR extraction") mistral_base_url = "https://api.mistral.ai/v1" try: # Step 1: Upload PDF file to Mistral mistral_headers = { "Authorization": f"Bearer {mistral_api_key}", } files = { "file": ("document.pdf", pdf_content, "application/pdf") } data = { "purpose": "ocr" } async with httpx.AsyncClient(timeout=self.download_timeout, limits=self.connection_limits) as client: # Upload file upload_response = await client.post( f"{mistral_base_url}/files", headers=mistral_headers, files=files, data=data ) upload_response.raise_for_status() upload_data = upload_response.json() file_id = upload_data.get("id") if not file_id: raise ValueError("Failed to upload file to Mistral OCR service") # Step 2: Process with OCR ocr_payload = { "model": "mistral-ocr-latest", "document": { "type": "file", "file_id": file_id }, "pages": list(range(min(page_count, 50))) if page_count > 0 else None, # Limit to first 50 pages for cost control "include_image_base64": False # Save tokens } # Operation-level timeout for OCR (2x download timeout for large PDFs) ocr_timeout = self.download_timeout * api_constants.OCR_TIMEOUT_MULTIPLIER try: async with asyncio.timeout(ocr_timeout): ocr_response = await client.post( f"{mistral_base_url}/ocr", headers={ "Authorization": f"Bearer {mistral_api_key}", "Content-Type": "application/json" }, json=ocr_payload ) ocr_response.raise_for_status() ocr_data = ocr_response.json() except asyncio.TimeoutError: raise ValueError(f"OCR operation timed out after {ocr_timeout}s - PDF may be too large or complex") # Extract content from OCR response pages_processed = ocr_data.get("usage_info", {}).get("pages_processed", 0) estimated_cost = pages_processed * 0.001 # $1 per 1000 pages # Combine all page content extracted_content = [] for page in ocr_data.get("pages", []): page_markdown = page.get("markdown", "") if page_markdown.strip(): extracted_content.append(f"=== PAGE {page.get('index', 0) + 1} ===\n{page_markdown}") full_content = "\n\n".join(extracted_content) logger.info(f"Mistral OCR extracted {pages_processed} pages, cost: ${estimated_cost:.4f}") return full_content, estimated_cost except httpx.HTTPStatusError as e: if e.response.status_code == 401: raise ValueError("Mistral API authentication failed - check MISTRAL_API_KEY") elif e.response.status_code == 402: raise ValueError("Mistral API payment required - insufficient credits") else: raise ValueError(f"Mistral API error {e.response.status_code}: {e.response.text}") except Exception as e: logger.error(f"Mistral OCR extraction failed: {e}") raise
- src/fpd_mcp/main.py:1328-1328 (registration)MCP tool registration decorator specifying the exact tool name.@mcp.tool(name="FPD_get_document_content_with_mistral_ocr")
- Tool description and usage example in tool reflections, serving as informal schema/documentation.#### 7. fpd_get_document_content **Purpose:** Extract text from petition PDFs for LLM analysis **Extraction:** Hybrid PyPDF2 (free) + Mistral OCR (~$0.001/page) **Example:** ```python fpd_get_document_content(