get_single_web_page_content

Extract full text content from a specific web page URL for analysis or reference. Provide a URL to retrieve formatted text with word count, optionally limiting content length.

Instructions

Extract and return the full content from a single web page URL.

Use this when you have a specific URL and need the full text content for analysis or reference.

Args: url: The URL of the web page to extract content from max_content_length: Maximum characters for the extracted content (0 = no limit)

Returns: Formatted text containing the extracted page content with word count

Parameter Usage Guidelines

url (required)

Must be a valid HTTP or HTTPS URL
Include the full URL with protocol (http:// or https://)
Examples:
- "https://example.com/article"
- "https://docs.python.org/3/library/asyncio.html"
- "https://github.com/user/repo/blob/main/README.md"

max_content_length (optional, default unlimited)

Limits the extracted content to specified character count
Common values: 10000 (summaries), 50000 (full pages), null (no limit)

Usage Examples

Basic content extraction:

{ "url": "https://example.com/blog/ai-trends-2024" }

Extract with content limit:

{ "url": "https://docs.example.com/api-reference", "max_content_length": 20000 }

Extract documentation:

{ "url": "https://github.com/project/docs/installation.md", "max_content_length": 10000 }

Extract complete article:

{ "url": "https://techblog.com/comprehensive-guide" }

Complete parameter example:

{ "url": "https://docs.python.org/3/library/asyncio.html", "max_content_length": 50000 }

When to Choose This Tool

Choose this when you have a specific URL from search results or references
Choose this for extracting content from documentation, articles, or blog posts
Choose this when you need to analyze or reference specific webpage content
Choose this for following up on URLs found in search results
Choose this when extracting content from GitHub README files or documentation

Error Handling

If URL is inaccessible, an error message will be provided
Some sites may block automated access - try alternative URLs
Dynamic content may require multiple attempts
Large pages may timeout - use content length limits

Alternative Tools

Use full_web_search when you need to find relevant pages first
Use get_web_search_summaries for discovering URLs to extract

Input Schema

TableJSON Schema

Name	Required	Description	Default
`url`	Yes
`max_content_length`	No

Implementation Reference

features/web_search/tool.py:156-181 (handler)
The MCP tool handler that executes the get_single_web_page_content logic: extracts page content using WebSearchService, adds word count, and returns formatted text block.
@mcp.tool() @inject_docstring(lambda: load_instruction("instructions_single_page.md", __file__)) async def get_single_web_page_content(url: str, max_content_length: int = None) -> Dict[str, Any]: """Extract content from a single webpage""" try: logger.info(f"MCP tool get_single_web_page_content: url='{url}'") content = await web_search_service.extract_single_page( url=url, max_content_length=max_content_length ) word_count = len(content.split()) response_text = f"**Page Content from: {url}**\n\n{content}\n\n" response_text += f"**Word count:** {word_count}\n" logger.info(f"MCP tool get_single_web_page_content completed: {word_count} words") return { "content": [{"type": "text", "text": response_text}] } except Exception as e: logger.error(f"MCP tool get_single_web_page_content error: {e}") raise
server.py:39-42 (registration)
Calls register_tool to register all web search MCP tools including get_single_web_page_content.
def register_mcp_tools(self, mcp): """Register MCP tools""" register_tool(mcp, self.web_search_service)
features/web_search/service.py:105-110 (helper)
Core helper function in WebSearchService that handles single page content extraction, delegating to internal _extract_page_content method (HTTP then browser fallback).
async def extract_single_page(self, url: str, max_content_length: Optional[int] = None) -> str: """Extract content from a single webpage""" logger.info(f"Extracting content from: {url}") return await self._extract_page_content(url, max_content_length)
features/web_search/service.py:408-495 (helper)
Main content extraction logic: tries fast HTTP parsing first, falls back to browser rendering for dynamic content, with quality checks.
async def _extract_page_content(self, url: str, max_content_length: Optional[int]) -> str: """Extract readable content from a webpage""" try: # Try fast HTTP extraction first content = await self._extract_with_httpx(url, max_content_length) if self._is_meaningful_content(content): return content except Exception as e: logger.debug(f"HTTP extraction failed for {url}: {e}") # Fallback to browser extraction return await self._extract_with_browser(url, max_content_length) async def _extract_with_httpx(self, url: str, max_content_length: Optional[int]) -> str: """Fast HTTP-based content extraction""" async with httpx.AsyncClient(timeout=10.0, follow_redirects=True) as client: response = await client.get(url, headers={'User-Agent': self.ua.random}) response.raise_for_status() soup = BeautifulSoup(response.text, 'lxml') # Remove unwanted elements for tag in soup(['script', 'style', 'nav', 'header', 'footer', 'aside', 'ads', '.ad', '.advertisement']): tag.decompose() # Extract main content content = "" for tag in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li']): text = tag.get_text().strip() if text and len(text) > 20: # Filter short fragments content += text + "\n\n" content = content.strip() if max_content_length and len(content) > max_content_length: content = content[:max_content_length] return content async def _extract_with_browser(self, url: str, max_content_length: Optional[int]) -> str: """Browser-based content extraction for dynamic sites""" async with async_playwright() as p: browser = await p.firefox.launch(headless=True) try: context = await browser.new_context( user_agent=self.ua.random, viewport={'width': 1920, 'height': 1080} ) page = await context.new_page() await page.goto(url, wait_until='networkidle') await page.wait_for_timeout(2000) # Wait for dynamic content # Extract readable text content content = await page.evaluate(""" () => { // Remove unwanted elements const elements = document.querySelectorAll('script, style, nav, header, footer, aside, .ad, .advertisement'); elements.forEach(el => el.remove()); // Extract main content const contentSelectors = ['main', 'article', '.content', '.post', '.entry', '#content', '#main']; let content = ''; for (const selector of contentSelectors) { const element = document.querySelector(selector); if (element) { content = element.textContent.trim(); break; } } // Fallback to body text if (!content) { content = document.body.textContent.trim(); } return content; } """) if max_content_length and len(content) > max_content_length: content = content[:max_content_length] return content finally: await browser.close()

MCP MixSearch