Skip to main content
Glama

extract

Extract structured JSON data from webpages using AI, bypassing bot detection and CAPTCHA to scrape any site reliably.

Instructions

Scrape a webpage and extract structured data as JSON. First scrapes the page as markdown, then uses AI sampling to convert it to structured JSON format. This tool can unlock any webpage even if it uses bot detection or CAPTCHA.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
extraction_promptNoCustom prompt to guide the extraction process. If not provided, will extract general structured data from the page.

Implementation Reference

  • The main handler function for the 'extract' tool. It scrapes the given URL using Bright Data API to get markdown content, then uses the MCP session's AI sampling to extract structured JSON data based on the optional extraction_prompt.
    execute: tool_fn('extract', async ({ url, extraction_prompt }, ctx) => { let scrape_response = await axios({ url: 'https://api.brightdata.com/request', method: 'POST', data: { url, zone: unlocker_zone, format: 'raw', data_format: 'markdown', }, headers: api_headers(), responseType: 'text', }); let markdown_content = scrape_response.data; let system_prompt = 'You are a data extraction specialist. You MUST respond with ONLY valid JSON, no other text or formatting. ' + 'Extract the requested information from the markdown content and return it as a properly formatted JSON object. ' + 'Do not include any explanations, markdown formatting, or text outside the JSON response.'; let user_prompt = extraction_prompt || 'Extract the requested information from this markdown content and return ONLY a JSON object:'; let session = server.sessions[0]; // Get the first active session if (!session) throw new Error('No active session available for sampling'); let sampling_response = await session.requestSampling({ messages: [ { role: "user", content: { type: "text", text: `${user_prompt}\n\nMarkdown content:\n${markdown_content}\n\nRemember: Respond with ONLY valid JSON, no other text.`, }, }, ], systemPrompt: system_prompt, includeContext: "thisServer", }); return sampling_response.content.text; }),
  • Zod schema defining the input parameters for the 'extract' tool: required 'url' (string URL) and optional 'extraction_prompt' (string).
    parameters: z.object({ url: z.string().url(), extraction_prompt: z.string().optional().describe( 'Custom prompt to guide the extraction process. If not provided, ' + 'will extract general structured data from the page.' ), }),
  • server.js:207-262 (registration)
    The addTool call that registers the 'extract' tool on the FastMCP server, including name, description, parameters schema, and execute handler.
    addTool({ name: 'extract', description: 'Scrape a webpage and extract structured data as JSON. ' + 'First scrapes the page as markdown, then uses AI sampling to convert ' + 'it to structured JSON format. This tool can unlock any webpage even ' + 'if it uses bot detection or CAPTCHA.', parameters: z.object({ url: z.string().url(), extraction_prompt: z.string().optional().describe( 'Custom prompt to guide the extraction process. If not provided, ' + 'will extract general structured data from the page.' ), }), execute: tool_fn('extract', async ({ url, extraction_prompt }, ctx) => { let scrape_response = await axios({ url: 'https://api.brightdata.com/request', method: 'POST', data: { url, zone: unlocker_zone, format: 'raw', data_format: 'markdown', }, headers: api_headers(), responseType: 'text', }); let markdown_content = scrape_response.data; let system_prompt = 'You are a data extraction specialist. You MUST respond with ONLY valid JSON, no other text or formatting. ' + 'Extract the requested information from the markdown content and return it as a properly formatted JSON object. ' + 'Do not include any explanations, markdown formatting, or text outside the JSON response.'; let user_prompt = extraction_prompt || 'Extract the requested information from this markdown content and return ONLY a JSON object:'; let session = server.sessions[0]; // Get the first active session if (!session) throw new Error('No active session available for sampling'); let sampling_response = await session.requestSampling({ messages: [ { role: "user", content: { type: "text", text: `${user_prompt}\n\nMarkdown content:\n${markdown_content}\n\nRemember: Respond with ONLY valid JSON, no other text.`, }, }, ], systemPrompt: system_prompt, includeContext: "thisServer", }); return sampling_response.content.text; }), });

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dsouza-anush/brightdata-mcp-heroku'

If you have feedback or need assistance with the MCP directory API, please join our Discord server