fetch-text
Extract text content from URLs using HTTP methods, converting HTML to Markdown format for readable output.
Instructions
Fetch text content from a URL using various HTTP methods. Defaults to converting HTML to Markdown format.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL to get text content from | |
| method | No | HTTP method to use (GET, POST, PUT, DELETE, PATCH, etc.). Default is GET. | GET |
| data | No | Request body data for POST/PUT/PATCH requests. Can be a JSON object or string. | |
| headers | No | Additional HTTP headers to include in the request | |
| output_format | No | Output format: 'markdown' (default), 'clean_text', or 'raw_html'. | markdown |
Implementation Reference
- src/jsonrpc_mcp/server.py:215-225 (handler)Handler logic for the 'fetch-text' tool within the @server.call_tool() dispatcher. Validates the URL argument and invokes fetch_url_content with as_json=False for text processing.elif tool_name == "fetch-text": url = args.get("url") if not url or not isinstance(url, str): result = "Failed to call tool, error: Missing required property: url" else: method = args.get("method", "GET") data = args.get("data") headers = args.get("headers") output_format = args.get("output_format", "markdown") result = await fetch_url_content(url, as_json=False, method=method, data=data, headers=headers, output_format=output_format)
- src/jsonrpc_mcp/server.py:60-93 (registration)Tool registration in @server.list_tools(), defining the name, description, and input schema for 'fetch-text'.types.Tool( name="fetch-text", description="Fetch text content from a URL using various HTTP methods. Defaults to converting HTML to Markdown format.", inputSchema={ "type": "object", "properties": { "url": { "type": "string", "description": "The URL to get text content from", }, "method": { "type": "string", "description": "HTTP method to use (GET, POST, PUT, DELETE, PATCH, etc.). Default is GET.", "default": "GET" }, "data": { "type": ["object", "string", "null"], "description": "Request body data for POST/PUT/PATCH requests. Can be a JSON object or string.", }, "headers": { "type": "object", "description": "Additional HTTP headers to include in the request", "additionalProperties": {"type": "string"} }, "output_format": { "type": "string", "description": "Output format: 'markdown' (default), 'clean_text', or 'raw_html'.", "enum": ["markdown", "clean_text", "raw_html"], "default": "markdown" } }, "required": ["url"], }, ),
- src/jsonrpc_mcp/server.py:63-93 (schema)Input schema definition for the 'fetch-text' tool, specifying parameters like url (required), method, data, headers, and output_format.inputSchema={ "type": "object", "properties": { "url": { "type": "string", "description": "The URL to get text content from", }, "method": { "type": "string", "description": "HTTP method to use (GET, POST, PUT, DELETE, PATCH, etc.). Default is GET.", "default": "GET" }, "data": { "type": ["object", "string", "null"], "description": "Request body data for POST/PUT/PATCH requests. Can be a JSON object or string.", }, "headers": { "type": "object", "description": "Additional HTTP headers to include in the request", "additionalProperties": {"type": "string"} }, "output_format": { "type": "string", "description": "Output format: 'markdown' (default), 'clean_text', or 'raw_html'.", "enum": ["markdown", "clean_text", "raw_html"], "default": "markdown" } }, "required": ["url"], }, ),
- src/jsonrpc_mcp/utils.py:232-343 (helper)Primary helper function implementing URL fetching logic for 'fetch-text' tool (called with as_json=False). Performs HTTP requests using httpx, validates responses, handles various methods, and processes text content via extract_text_content.async def fetch_url_content( url: str, as_json: bool = True, method: str = "GET", data: dict | str | None = None, headers: dict[str, str] | None = None, output_format: str = "markdown" ) -> str: """ Fetch content from a URL using different HTTP methods. Args: url: URL to fetch content from as_json: If True, validates content as JSON; if False, returns text content method: HTTP method (GET, POST, PUT, DELETE, etc.) data: Request body data (for POST/PUT requests) headers: Additional headers to include in the request output_format: If as_json=False, output format - "markdown", "clean_text", or "raw_html" Returns: String content from the URL (JSON, Markdown, clean text, or raw HTML) Raises: httpx.RequestError: For network-related errors json.JSONDecodeError: If as_json=True and content is not valid JSON ValueError: If URL is invalid or unsafe """ # Validate URL first validate_url(url) config = await get_http_client_config() max_size = config.pop("max_size", 10 * 1024 * 1024) # Remove from client config # Merge additional headers with config headers (user headers override defaults) if headers: if config.get("headers"): config["headers"].update(headers) else: config["headers"] = headers async with httpx.AsyncClient(**config) as client: # Handle different HTTP methods method = method.upper() if method == "GET": response = await client.get(url) elif method == "POST": if isinstance(data, dict): response = await client.post(url, json=data) else: response = await client.post(url, content=data) elif method == "PUT": if isinstance(data, dict): response = await client.put(url, json=data) else: response = await client.put(url, content=data) elif method == "DELETE": response = await client.delete(url) elif method == "PATCH": if isinstance(data, dict): response = await client.patch(url, json=data) else: response = await client.patch(url, content=data) elif method == "HEAD": response = await client.head(url) elif method == "OPTIONS": response = await client.options(url) else: # For any other method, use the generic request method if isinstance(data, dict): response = await client.request(method, url, json=data) else: response = await client.request(method, url, content=data) response.raise_for_status() # Check response size content_length = len(response.content) if content_length > max_size: raise ValueError(f"Response size ({content_length} bytes) exceeds maximum allowed ({max_size} bytes)") if as_json: # For JSON responses, use response.text directly (no compression expected) content_to_parse = response.text if not content_to_parse: # If response.text is empty, try decoding content directly try: content_to_parse = response.content.decode('utf-8') except UnicodeDecodeError: content_to_parse = "" if content_to_parse: try: json.loads(content_to_parse) return content_to_parse except json.JSONDecodeError: # If text parsing fails, try content decoding as fallback if content_to_parse == response.text: try: fallback_content = response.content.decode('utf-8') json.loads(fallback_content) return fallback_content except (json.JSONDecodeError, UnicodeDecodeError): pass raise json.JSONDecodeError("Response is not valid JSON", content_to_parse, 0) else: # Empty response return "" else: # For text content, apply format conversion return extract_text_content(response.text, output_format)
- src/jsonrpc_mcp/utils.py:62-121 (helper)Supporting utility for formatting fetched HTML content into markdown, clean text, or raw HTML. Invoked by fetch_url_content for 'fetch-text' tool.def extract_text_content(html_content: str, output_format: str = "markdown") -> str: """ Extract text content from HTML in different formats. Args: html_content: Raw HTML content output_format: Output format - "markdown" (default), "clean_text", or "raw_html" Returns: Extracted content in the specified format """ if output_format == "raw_html": return html_content try: from markdownify import markdownify as md if output_format == "markdown": # Convert HTML to Markdown markdown_text = md(html_content, heading_style="ATX", # Use # for headings bullets="*", # Use * for bullets strip=["script", "style", "noscript"]) # Clean up extra whitespace lines = (line.rstrip() for line in markdown_text.splitlines()) markdown_text = '\n'.join(line for line in lines if line.strip() or not line) return markdown_text.strip() elif output_format == "clean_text": # Parse HTML with BeautifulSoup soup = BeautifulSoup(html_content, 'html.parser') # Remove script and style elements for script in soup(["script", "style", "noscript"]): script.decompose() # Get text content text = soup.get_text() # Break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # Break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # Drop blank lines text = ' '.join(chunk for chunk in chunks if chunk) return text else: # Unknown format, return raw HTML return html_content except Exception: # If processing fails, return original content return html_content