Skip to main content
Glama

get_single_web_page_content

Extract full text content from a specific web page URL for analysis or reference. Provide a URL to retrieve formatted text with word count, optionally limiting content length.

Instructions

Extract and return the full content from a single web page URL.

Use this when you have a specific URL and need the full text content for analysis or reference.

Args: url: The URL of the web page to extract content from max_content_length: Maximum characters for the extracted content (0 = no limit)

Returns: Formatted text containing the extracted page content with word count

Parameter Usage Guidelines

url (required)

  • Must be a valid HTTP or HTTPS URL

  • Include the full URL with protocol (http:// or https://)

  • Examples:

    • "https://example.com/article"

    • "https://docs.python.org/3/library/asyncio.html"

    • "https://github.com/user/repo/blob/main/README.md"

max_content_length (optional, default unlimited)

  • Limits the extracted content to specified character count

  • Common values: 10000 (summaries), 50000 (full pages), null (no limit)

Usage Examples

Basic content extraction:

{ "url": "https://example.com/blog/ai-trends-2024" }

Extract with content limit:

{ "url": "https://docs.example.com/api-reference", "max_content_length": 20000 }

Extract documentation:

{ "url": "https://github.com/project/docs/installation.md", "max_content_length": 10000 }

Extract complete article:

{ "url": "https://techblog.com/comprehensive-guide" }

Complete parameter example:

{ "url": "https://docs.python.org/3/library/asyncio.html", "max_content_length": 50000 }

When to Choose This Tool

  • Choose this when you have a specific URL from search results or references

  • Choose this for extracting content from documentation, articles, or blog posts

  • Choose this when you need to analyze or reference specific webpage content

  • Choose this for following up on URLs found in search results

  • Choose this when extracting content from GitHub README files or documentation

Error Handling

  • If URL is inaccessible, an error message will be provided

  • Some sites may block automated access - try alternative URLs

  • Dynamic content may require multiple attempts

  • Large pages may timeout - use content length limits

Alternative Tools

  • Use full_web_search when you need to find relevant pages first

  • Use get_web_search_summaries for discovering URLs to extract

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
max_content_lengthNo

Implementation Reference

  • The MCP tool handler that executes the get_single_web_page_content logic: extracts page content using WebSearchService, adds word count, and returns formatted text block.
    @mcp.tool() @inject_docstring(lambda: load_instruction("instructions_single_page.md", __file__)) async def get_single_web_page_content(url: str, max_content_length: int = None) -> Dict[str, Any]: """Extract content from a single webpage""" try: logger.info(f"MCP tool get_single_web_page_content: url='{url}'") content = await web_search_service.extract_single_page( url=url, max_content_length=max_content_length ) word_count = len(content.split()) response_text = f"**Page Content from: {url}**\n\n{content}\n\n" response_text += f"**Word count:** {word_count}\n" logger.info(f"MCP tool get_single_web_page_content completed: {word_count} words") return { "content": [{"type": "text", "text": response_text}] } except Exception as e: logger.error(f"MCP tool get_single_web_page_content error: {e}") raise
  • server.py:39-42 (registration)
    Calls register_tool to register all web search MCP tools including get_single_web_page_content.
    def register_mcp_tools(self, mcp): """Register MCP tools""" register_tool(mcp, self.web_search_service)
  • Core helper function in WebSearchService that handles single page content extraction, delegating to internal _extract_page_content method (HTTP then browser fallback).
    async def extract_single_page(self, url: str, max_content_length: Optional[int] = None) -> str: """Extract content from a single webpage""" logger.info(f"Extracting content from: {url}") return await self._extract_page_content(url, max_content_length)
  • Main content extraction logic: tries fast HTTP parsing first, falls back to browser rendering for dynamic content, with quality checks.
    async def _extract_page_content(self, url: str, max_content_length: Optional[int]) -> str: """Extract readable content from a webpage""" try: # Try fast HTTP extraction first content = await self._extract_with_httpx(url, max_content_length) if self._is_meaningful_content(content): return content except Exception as e: logger.debug(f"HTTP extraction failed for {url}: {e}") # Fallback to browser extraction return await self._extract_with_browser(url, max_content_length) async def _extract_with_httpx(self, url: str, max_content_length: Optional[int]) -> str: """Fast HTTP-based content extraction""" async with httpx.AsyncClient(timeout=10.0, follow_redirects=True) as client: response = await client.get(url, headers={'User-Agent': self.ua.random}) response.raise_for_status() soup = BeautifulSoup(response.text, 'lxml') # Remove unwanted elements for tag in soup(['script', 'style', 'nav', 'header', 'footer', 'aside', 'ads', '.ad', '.advertisement']): tag.decompose() # Extract main content content = "" for tag in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li']): text = tag.get_text().strip() if text and len(text) > 20: # Filter short fragments content += text + "\n\n" content = content.strip() if max_content_length and len(content) > max_content_length: content = content[:max_content_length] return content async def _extract_with_browser(self, url: str, max_content_length: Optional[int]) -> str: """Browser-based content extraction for dynamic sites""" async with async_playwright() as p: browser = await p.firefox.launch(headless=True) try: context = await browser.new_context( user_agent=self.ua.random, viewport={'width': 1920, 'height': 1080} ) page = await context.new_page() await page.goto(url, wait_until='networkidle') await page.wait_for_timeout(2000) # Wait for dynamic content # Extract readable text content content = await page.evaluate(""" () => { // Remove unwanted elements const elements = document.querySelectorAll('script, style, nav, header, footer, aside, .ad, .advertisement'); elements.forEach(el => el.remove()); // Extract main content const contentSelectors = ['main', 'article', '.content', '.post', '.entry', '#content', '#main']; let content = ''; for (const selector of contentSelectors) { const element = document.querySelector(selector); if (element) { content = element.textContent.trim(); break; } } // Fallback to body text if (!content) { content = document.body.textContent.trim(); } return content; } """) if max_content_length and len(content) > max_content_length: content = content[:max_content_length] return content finally: await browser.close()

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/geosp/mcp-mixsearch'

If you have feedback or need assistance with the MCP directory API, please join our Discord server