load_article_to_context
Load arXiv article text into context for analysis using title or ID, with options for partial extraction and preview validation.
Instructions
Load the article text into context. Supports title or arXiv ID resolution and partial extraction.
Args: title: Article title. arxiv_id: arXiv ID. start_page: 1-based start page (inclusive). end_page: 1-based end page (inclusive). max_pages: hard cap on number of pages to extract. max_chars: hard cap on number of characters to extract. preview: if True, only validate availability and return minimal info.
Returns: Article text or structured error JSON.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| title | No | ||
| arxiv_id | No | ||
| start_page | No | ||
| end_page | No | ||
| max_pages | No | ||
| max_chars | No | ||
| preview | No |
Implementation Reference
- src/arxiv_server/server.py:216-290 (handler)The primary handler function for the 'load_article_to_context' tool, decorated with @mcp.tool(). It resolves the arXiv article by title or ID, fetches the PDF, extracts text from optional page ranges with character and page limits, handles preview mode, and returns the extracted text or error JSON.@mcp.tool() async def load_article_to_context( title: Optional[str] = None, arxiv_id: Optional[str] = None, start_page: Optional[int] = None, end_page: Optional[int] = None, max_pages: Optional[int] = None, max_chars: Optional[int] = None, preview: bool = False, ) -> str: """ Load the article text into context. Supports title or arXiv ID resolution and partial extraction. Args: title: Article title. arxiv_id: arXiv ID. start_page: 1-based start page (inclusive). end_page: 1-based end page (inclusive). max_pages: hard cap on number of pages to extract. max_chars: hard cap on number of characters to extract. preview: if True, only validate availability and return minimal info. Returns: Article text or structured error JSON. """ result = await resolve_article(title=title, arxiv_id=arxiv_id) if isinstance(result, str): return result article_url, resolved_id = result if preview: # Lightweight availability check try: async with httpx.AsyncClient(timeout=DEFAULT_TIMEOUT, limits=HTTP_LIMITS) as client: head = await client.head(article_url, headers={"User-Agent": USER_AGENT}) ok = head.status_code < 400 except Exception: ok = False return json.dumps({"status": "ok" if ok else "error", "reachable": ok, "arxiv_id": resolved_id, "url": article_url}) pdf_bytes = await get_pdf(article_url) if pdf_bytes is None: return _error("FETCH_FAILED", "Unable to retrieve the article from arXiv.org.") try: doc = fitz.open(stream=pdf_bytes, filetype="pdf") except Exception as e: return _error("PDF_OPEN_FAILED", f"Unable to open PDF: {e}") total_pages = doc.page_count # Normalize page bounds (1-based inputs) s = max(1, start_page) if start_page else 1 e = min(end_page, total_pages) if end_page else total_pages if s > e or s < 1: return _error("BAD_RANGE", f"Invalid page range [{s}, {e}] for total_pages={total_pages}") # Apply max_pages cap if max_pages is not None: e = min(e, s + max_pages - 1) parts = [] chars = 0 for p in range(s - 1, e): page_text = doc.load_page(p).get_text() if not page_text: continue if max_chars is not None and chars + len(page_text) > max_chars: remain = max_chars - chars if remain > 0: parts.append(page_text[:remain]) chars += remain break parts.append(page_text) chars += len(page_text) return "".join(parts)