load_article_to_context

Extract arXiv article content into context for analysis, supporting full or partial text retrieval by title or ID with customizable page and character limits.

Instructions

Load the article text into context. Supports title or arXiv ID resolution and partial extraction.

Args: title: Article title. arxiv_id: arXiv ID. start_page: 1-based start page (inclusive). end_page: 1-based end page (inclusive). max_pages: hard cap on number of pages to extract. max_chars: hard cap on number of characters to extract. preview: if True, only validate availability and return minimal info.

Returns: Article text or structured error JSON.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`title`	No
`arxiv_id`	No
`start_page`	No
`end_page`	No
`max_pages`	No
`max_chars`	No
`preview`	No

Implementation Reference

src/arxiv_server/server.py:216-290 (handler)
The core handler function for the 'load_article_to_context' tool. It resolves the arXiv article URL using title or ID, fetches the PDF, extracts text from the specified page range (with optional limits on pages and characters), and returns the concatenated text. Includes @mcp.tool() decorator for automatic registration with FastMCP. Handles preview mode and errors gracefully.
@mcp.tool() async def load_article_to_context( title: Optional[str] = None, arxiv_id: Optional[str] = None, start_page: Optional[int] = None, end_page: Optional[int] = None, max_pages: Optional[int] = None, max_chars: Optional[int] = None, preview: bool = False, ) -> str: """ Load the article text into context. Supports title or arXiv ID resolution and partial extraction. Args: title: Article title. arxiv_id: arXiv ID. start_page: 1-based start page (inclusive). end_page: 1-based end page (inclusive). max_pages: hard cap on number of pages to extract. max_chars: hard cap on number of characters to extract. preview: if True, only validate availability and return minimal info. Returns: Article text or structured error JSON. """ result = await resolve_article(title=title, arxiv_id=arxiv_id) if isinstance(result, str): return result article_url, resolved_id = result if preview: # Lightweight availability check try: async with httpx.AsyncClient(timeout=DEFAULT_TIMEOUT, limits=HTTP_LIMITS) as client: head = await client.head(article_url, headers={"User-Agent": USER_AGENT}) ok = head.status_code < 400 except Exception: ok = False return json.dumps({"status": "ok" if ok else "error", "reachable": ok, "arxiv_id": resolved_id, "url": article_url}) pdf_bytes = await get_pdf(article_url) if pdf_bytes is None: return _error("FETCH_FAILED", "Unable to retrieve the article from arXiv.org.") try: doc = fitz.open(stream=pdf_bytes, filetype="pdf") except Exception as e: return _error("PDF_OPEN_FAILED", f"Unable to open PDF: {e}") total_pages = doc.page_count # Normalize page bounds (1-based inputs) s = max(1, start_page) if start_page else 1 e = min(end_page, total_pages) if end_page else total_pages if s > e or s < 1: return _error("BAD_RANGE", f"Invalid page range [{s}, {e}] for total_pages={total_pages}") # Apply max_pages cap if max_pages is not None: e = min(e, s + max_pages - 1) parts = [] chars = 0 for p in range(s - 1, e): page_text = doc.load_page(p).get_text() if not page_text: continue if max_chars is not None and chars + len(page_text) > max_chars: remain = max_chars - chars if remain > 0: parts.append(page_text[:remain]) chars += remain break parts.append(page_text) chars += len(page_text) return "".join(parts)
src/arxiv_server/server.py:123-142 (helper)
Key helper function used by load_article_to_context to resolve the article's PDF URL and arXiv ID from either a title (via search) or direct ID.
async def resolve_article(title: Optional[str] = None, arxiv_id: Optional[str] = None) -> Tuple[str, str] | str: """ Resolve to a direct PDF URL and arXiv ID using either a title or an arXiv ID. Preference order: arxiv_id > title. """ if arxiv_id: m = ARXIV_ID_RE.match(arxiv_id.strip()) if not m: return _error("INVALID_ID", f"Not a valid arXiv ID: {arxiv_id}") vid = m.group("id") return (f"https://arxiv.org/pdf/{vid}", vid) if not title: return _error("MISSING_PARAM", "Provide either 'arxiv_id' or 'title'.") info = await fetch_information(title) if isinstance(info, str): return _error("NOT_FOUND", str(info)) resolved_id = info.id.split("/abs/")[-1] direct_pdf_url = f"https://arxiv.org/pdf/{resolved_id}" return (direct_pdf_url, resolved_id)
src/arxiv_server/server.py:68-82 (helper)
Helper function to download the PDF bytes from the resolved arXiv URL, with retry logic.
async def get_pdf(url: str) -> Optional[bytes]: """Get PDF document as bytes from arXiv.org with retries.""" headers = {"User-Agent": USER_AGENT, "Accept": "application/pdf"} async with httpx.AsyncClient(timeout=DEFAULT_TIMEOUT, limits=HTTP_LIMITS) as client: for attempt in range(RETRY_ATTEMPTS): try: resp = await client.get(url, headers=headers) resp.raise_for_status() return resp.content except Exception: if attempt < RETRY_ATTEMPTS - 1: await _retry_sleep(attempt) continue return None
src/arxiv_server/server.py:216-216 (registration)
The @mcp.tool() decorator registers the load_article_to_context function as an MCP tool with FastMCP instance 'mcp'.
@mcp.tool()
src/arxiv_server/server.py:217-240 (schema)
Function signature with type annotations and docstring defining the input schema (parameters) and output for the tool.
async def load_article_to_context( title: Optional[str] = None, arxiv_id: Optional[str] = None, start_page: Optional[int] = None, end_page: Optional[int] = None, max_pages: Optional[int] = None, max_chars: Optional[int] = None, preview: bool = False, ) -> str: """ Load the article text into context. Supports title or arXiv ID resolution and partial extraction. Args: title: Article title. arxiv_id: arXiv ID. start_page: 1-based start page (inclusive). end_page: 1-based end page (inclusive). max_pages: hard cap on number of pages to extract. max_chars: hard cap on number of characters to extract. preview: if True, only validate availability and return minimal info. Returns: Article text or structured error JSON. """

arXiv MCP Server