Skip to main content
Glama

load_article_to_context

Extract arXiv article content into context for analysis, supporting full or partial text retrieval by title or ID with customizable page and character limits.

Instructions

Load the article text into context. Supports title or arXiv ID resolution and partial extraction.

Args: title: Article title. arxiv_id: arXiv ID. start_page: 1-based start page (inclusive). end_page: 1-based end page (inclusive). max_pages: hard cap on number of pages to extract. max_chars: hard cap on number of characters to extract. preview: if True, only validate availability and return minimal info.

Returns: Article text or structured error JSON.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
titleNo
arxiv_idNo
start_pageNo
end_pageNo
max_pagesNo
max_charsNo
previewNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The core handler function for the 'load_article_to_context' tool. It resolves the arXiv article URL using title or ID, fetches the PDF, extracts text from the specified page range (with optional limits on pages and characters), and returns the concatenated text. Includes @mcp.tool() decorator for automatic registration with FastMCP. Handles preview mode and errors gracefully.
    @mcp.tool()
    async def load_article_to_context(
        title: Optional[str] = None,
        arxiv_id: Optional[str] = None,
        start_page: Optional[int] = None,
        end_page: Optional[int] = None,
        max_pages: Optional[int] = None,
        max_chars: Optional[int] = None,
        preview: bool = False,
    ) -> str:
        """
        Load the article text into context. Supports title or arXiv ID resolution and partial extraction.
    
        Args:
            title: Article title.
            arxiv_id: arXiv ID.
            start_page: 1-based start page (inclusive).
            end_page: 1-based end page (inclusive).
            max_pages: hard cap on number of pages to extract.
            max_chars: hard cap on number of characters to extract.
            preview: if True, only validate availability and return minimal info.
    
        Returns:
            Article text or structured error JSON.
        """
        result = await resolve_article(title=title, arxiv_id=arxiv_id)
        if isinstance(result, str):
            return result
        article_url, resolved_id = result
    
        if preview:
            # Lightweight availability check
            try:
                async with httpx.AsyncClient(timeout=DEFAULT_TIMEOUT, limits=HTTP_LIMITS) as client:
                    head = await client.head(article_url, headers={"User-Agent": USER_AGENT})
                    ok = head.status_code < 400
            except Exception:
                ok = False
            return json.dumps({"status": "ok" if ok else "error", "reachable": ok, "arxiv_id": resolved_id, "url": article_url})
    
        pdf_bytes = await get_pdf(article_url)
        if pdf_bytes is None:
            return _error("FETCH_FAILED", "Unable to retrieve the article from arXiv.org.")
    
        try:
            doc = fitz.open(stream=pdf_bytes, filetype="pdf")
        except Exception as e:
            return _error("PDF_OPEN_FAILED", f"Unable to open PDF: {e}")
    
        total_pages = doc.page_count
        # Normalize page bounds (1-based inputs)
        s = max(1, start_page) if start_page else 1
        e = min(end_page, total_pages) if end_page else total_pages
        if s > e or s < 1:
            return _error("BAD_RANGE", f"Invalid page range [{s}, {e}] for total_pages={total_pages}")
    
        # Apply max_pages cap
        if max_pages is not None:
            e = min(e, s + max_pages - 1)
    
        parts = []
        chars = 0
        for p in range(s - 1, e):
            page_text = doc.load_page(p).get_text()
            if not page_text:
                continue
            if max_chars is not None and chars + len(page_text) > max_chars:
                remain = max_chars - chars
                if remain > 0:
                    parts.append(page_text[:remain])
                    chars += remain
                break
            parts.append(page_text)
            chars += len(page_text)
        return "".join(parts)
  • Key helper function used by load_article_to_context to resolve the article's PDF URL and arXiv ID from either a title (via search) or direct ID.
    async def resolve_article(title: Optional[str] = None, arxiv_id: Optional[str] = None) -> Tuple[str, str] | str:
        """
        Resolve to a direct PDF URL and arXiv ID using either a title or an arXiv ID.
        Preference order: arxiv_id > title.
        """
        if arxiv_id:
            m = ARXIV_ID_RE.match(arxiv_id.strip())
            if not m:
                return _error("INVALID_ID", f"Not a valid arXiv ID: {arxiv_id}")
            vid = m.group("id")
            return (f"https://arxiv.org/pdf/{vid}", vid)
        if not title:
            return _error("MISSING_PARAM", "Provide either 'arxiv_id' or 'title'.")
        info = await fetch_information(title)
        if isinstance(info, str):
            return _error("NOT_FOUND", str(info))
        resolved_id = info.id.split("/abs/")[-1]
        direct_pdf_url = f"https://arxiv.org/pdf/{resolved_id}"
        return (direct_pdf_url, resolved_id)
  • Helper function to download the PDF bytes from the resolved arXiv URL, with retry logic.
    async def get_pdf(url: str) -> Optional[bytes]:
        """Get PDF document as bytes from arXiv.org with retries."""
        headers = {"User-Agent": USER_AGENT, "Accept": "application/pdf"}
        async with httpx.AsyncClient(timeout=DEFAULT_TIMEOUT, limits=HTTP_LIMITS) as client:
            for attempt in range(RETRY_ATTEMPTS):
                try:
                    resp = await client.get(url, headers=headers)
                    resp.raise_for_status()
                    return resp.content
                except Exception:
                    if attempt < RETRY_ATTEMPTS - 1:
                        await _retry_sleep(attempt)
                        continue
                    return None
  • The @mcp.tool() decorator registers the load_article_to_context function as an MCP tool with FastMCP instance 'mcp'.
    @mcp.tool()
  • Function signature with type annotations and docstring defining the input schema (parameters) and output for the tool.
    async def load_article_to_context(
        title: Optional[str] = None,
        arxiv_id: Optional[str] = None,
        start_page: Optional[int] = None,
        end_page: Optional[int] = None,
        max_pages: Optional[int] = None,
        max_chars: Optional[int] = None,
        preview: bool = False,
    ) -> str:
        """
        Load the article text into context. Supports title or arXiv ID resolution and partial extraction.
    
        Args:
            title: Article title.
            arxiv_id: arXiv ID.
            start_page: 1-based start page (inclusive).
            end_page: 1-based end page (inclusive).
            max_pages: hard cap on number of pages to extract.
            max_chars: hard cap on number of characters to extract.
            preview: if True, only validate availability and return minimal info.
    
        Returns:
            Article text or structured error JSON.
        """
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden. It discloses that the tool supports 'partial extraction' via page/character limits and a 'preview' mode for validation, which adds useful behavioral context beyond the schema. However, it doesn't mention error handling (returns 'structured error JSON'), performance characteristics, rate limits, or authentication needs, leaving gaps for a tool with 7 parameters.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with a purpose statement followed by 'Args' and 'Returns' sections, making it easy to parse. It's appropriately sized—every sentence adds value, though the 'Args' section could be more integrated into the flow rather than a separate block. No wasted words.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (7 parameters, no annotations, but with an output schema), the description is mostly complete. It covers parameter semantics thoroughly and mentions the return types ('Article text or structured error JSON'). The output schema likely details the return structure, so the description doesn't need to elaborate further. However, it lacks guidance on tool selection among siblings.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage for 7 parameters, the description compensates fully by explaining each parameter's purpose in the 'Args' section. It clarifies that 'title' and 'arxiv_id' are alternative identifiers, pages are '1-based' and 'inclusive', limits are 'hard caps', and 'preview' validates availability. This adds significant meaning beyond the bare schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Load the article text into context' with support for 'title or arXiv ID resolution and partial extraction'. This specifies the verb (load), resource (article text), and key capabilities. However, it doesn't explicitly differentiate from sibling tools like 'download_article' or 'get_details', which likely have overlapping functionality.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. With siblings like 'download_article', 'get_article_url', 'get_details', and 'search_arxiv', the agent has no indication of which tool to choose for loading article text versus downloading files, getting URLs, retrieving metadata, or searching. No prerequisites or exclusions are mentioned.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lecigarevolant/arxiv-mcp-server-gpt'

If you have feedback or need assistance with the MCP directory API, please join our Discord server