webscraping_ai_text
Extract clean text from web pages with JavaScript rendering, proxy support, and output in plain text, XML, or JSON formats.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL of the target page. | |
| text_format | No | Format of the text response. | json |
| return_links | No | Return links from the page body text. | |
| timeout | No | Maximum web page retrieval time in ms (20000 by default, maximum is 30000). | |
| js | No | Execute on-page JavaScript using a headless browser (false by default). | |
| js_timeout | No | Maximum JavaScript rendering time in ms (3000 by default). | |
| wait_for | No | CSS selector to wait for before returning the page content. | |
| proxy | No | Type of proxy, datacenter or residential (datacenter by default). | datacenter |
| country | No | Country of the proxy to use (US by default). | |
| custom_proxy | No | Your own proxy URL in "http://user:password@host:port" format. | |
| device | No | Type of device emulation. | |
| error_on_404 | No | Return error on 404 HTTP status on the target page (false by default). | |
| error_on_redirect | No | Return error on redirect on the target page (false by default). | |
| js_script | No | Custom JavaScript code to execute on the target page. |
Implementation Reference
- src/index.js:285-309 (handler)The tool handler function for 'webscraping_ai_text'. It calls client.text() with the URL and options, then returns the response via createSanitizedResponse.
server.tool( 'webscraping_ai_text', { url: z.string().describe('URL of the target page.'), text_format: z.enum(['plain', 'xml', 'json']).optional().default('json').describe('Format of the text response.'), return_links: z.boolean().optional().describe('Return links from the page body text.'), ...commonOptionsSchema }, async ({ url, text_format, return_links, ...options }) => { try { const result = await client.text(url, { ...options, text_format, return_links }); const content = typeof result === 'object' ? JSON.stringify(result) : result; return createSanitizedResponse(content, url); } catch (error) { const errorObj = JSON.parse(error.message); return createSanitizedResponse(JSON.stringify(errorObj), url, true); } } ); - src/index.js:287-292 (schema)Input schema for the 'webscraping_ai_text' tool, defining url, text_format, return_links, and common options.
{ url: z.string().describe('URL of the target page.'), text_format: z.enum(['plain', 'xml', 'json']).optional().default('json').describe('Format of the text response.'), return_links: z.boolean().optional().describe('Return links from the page body text.'), ...commonOptionsSchema }, - src/index.js:285-309 (registration)Registration of the 'webscraping_ai_text' tool using server.tool() with the MCP SDK.
server.tool( 'webscraping_ai_text', { url: z.string().describe('URL of the target page.'), text_format: z.enum(['plain', 'xml', 'json']).optional().default('json').describe('Format of the text response.'), return_links: z.boolean().optional().describe('Return links from the page body text.'), ...commonOptionsSchema }, async ({ url, text_format, return_links, ...options }) => { try { const result = await client.text(url, { ...options, text_format, return_links }); const content = typeof result === 'object' ? JSON.stringify(result) : result; return createSanitizedResponse(content, url); } catch (error) { const errorObj = JSON.parse(error.message); return createSanitizedResponse(JSON.stringify(errorObj), url, true); } } ); - src/index.js:97-101 (helper)The client.text() helper method that makes the actual HTTP request to the /text endpoint of the WebScraping.AI API.
async text(url, options = {}) { return this.request('/text', { url, ...options }); - src/index.js:192-207 (helper)The createSanitizedResponse helper used by the handler to format and optionally sandbox the response content.
function createSanitizedResponse(content, url, isError = false) { if (isError) { return { content: [{ type: 'text', text: content }], isError: true }; } // Process the content (apply sandboxing if enabled) const result = sanitizer.sanitize(content, { url }); // Create response return { content: [{ type: 'text', text: result.content }] }; }