Skip to main content
Glama
geosp
by geosp

get_single_web_page_content

Extract full text content from a specific web page URL for analysis or reference, with optional character limit control.

Instructions

Extract and return the full content from a single web page URL.

Use this when you have a specific URL and need the full text content for analysis or reference.

Args: url: The URL of the web page to extract content from max_content_length: Maximum characters for the extracted content (0 = no limit)

Returns: Formatted text containing the extracted page content with word count

Parameter Usage Guidelines

url (required)

  • Must be a valid HTTP or HTTPS URL

  • Include the full URL with protocol (http:// or https://)

  • Examples:

    • "https://example.com/article"

    • "https://docs.python.org/3/library/asyncio.html"

    • "https://github.com/user/repo/blob/main/README.md"

max_content_length (optional, default unlimited)

  • Limits the extracted content to specified character count

  • Common values: 10000 (summaries), 50000 (full pages), null (no limit)

Usage Examples

Basic content extraction:

{
  "url": "https://example.com/blog/ai-trends-2024"
}

Extract with content limit:

{
  "url": "https://docs.example.com/api-reference",
  "max_content_length": 20000
}

Extract documentation:

{
  "url": "https://github.com/project/docs/installation.md",
  "max_content_length": 10000
}

Extract complete article:

{
  "url": "https://techblog.com/comprehensive-guide"
}

Complete parameter example:

{
  "url": "https://docs.python.org/3/library/asyncio.html",
  "max_content_length": 50000
}

When to Choose This Tool

  • Choose this when you have a specific URL from search results or references

  • Choose this for extracting content from documentation, articles, or blog posts

  • Choose this when you need to analyze or reference specific webpage content

  • Choose this for following up on URLs found in search results

  • Choose this when extracting content from GitHub README files or documentation

Error Handling

  • If URL is inaccessible, an error message will be provided

  • Some sites may block automated access - try alternative URLs

  • Dynamic content may require multiple attempts

  • Large pages may timeout - use content length limits

Alternative Tools

  • Use full_web_search when you need to find relevant pages first

  • Use get_web_search_summaries for discovering URLs to extract

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
max_content_lengthNo

Implementation Reference

  • MCP tool handler function that executes the tool logic: logs input, calls WebSearchService.extract_single_page, formats response with content and word count, returns MCP-formatted content block.
    @mcp.tool()
    @inject_docstring(lambda: load_instruction("instructions_single_page.md", __file__))
    async def get_single_web_page_content(url: str, max_content_length: int = None) -> Dict[str, Any]:
        """Extract content from a single webpage"""
        try:
            logger.info(f"MCP tool get_single_web_page_content: url='{url}'")
    
            content = await web_search_service.extract_single_page(
                url=url,
                max_content_length=max_content_length
            )
    
            word_count = len(content.split())
    
            response_text = f"**Page Content from: {url}**\n\n{content}\n\n"
            response_text += f"**Word count:** {word_count}\n"
    
            logger.info(f"MCP tool get_single_web_page_content completed: {word_count} words")
    
            return {
                "content": [{"type": "text", "text": response_text}]
            }
    
        except Exception as e:
            logger.error(f"MCP tool get_single_web_page_content error: {e}")
            raise
  • Function to register all web search MCP tools, including get_single_web_page_content, by defining them with @mcp.tool() decorators inside.
    def register_tool(mcp: FastMCP, web_search_service: WebSearchService) -> None:
        """
        Register web search tools with the MCP server
    
        Args:
            mcp: FastMCP server instance
            web_search_service: WebSearchService instance
        """
  • Pydantic schema defining input parameters for single page content extraction (url, max_content_length), matching the tool signature.
    class SinglePageRequest(BaseModel):
        """Request model for single page content extraction"""
        url: str = Field(..., description="URL to extract content from")
        max_content_length: Optional[int] = Field(default=None, ge=0,
                                                description="Maximum content length")
  • Core service method implementing single page content extraction by delegating to _extract_page_content.
    async def extract_single_page(self, url: str, max_content_length: Optional[int] = None) -> str:
        """Extract content from a single webpage"""
        logger.info(f"Extracting content from: {url}")
    
        return await self._extract_page_content(url, max_content_length)
  • Main content extraction logic: attempts fast HTTP extraction with BeautifulSoup, falls back to Playwright browser rendering for dynamic content.
    async def _extract_page_content(self, url: str, max_content_length: Optional[int]) -> str:
        """Extract readable content from a webpage"""
        try:
            # Try fast HTTP extraction first
            content = await self._extract_with_httpx(url, max_content_length)
            if self._is_meaningful_content(content):
                return content
        except Exception as e:
            logger.debug(f"HTTP extraction failed for {url}: {e}")
    
        # Fallback to browser extraction
        return await self._extract_with_browser(url, max_content_length)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/geosp/mcp-mixsearch'

If you have feedback or need assistance with the MCP directory API, please join our Discord server