normalize_web_data

Extract, sanitize, and normalize unstructured web content into clean Markdown or JSON, designed for LLM context windows and complex structures.

Instructions

Extracts, sanitizes, and normalizes unstructured web content into clean Markdown or JSON. Highly optimized for LLM context windows. CRITICAL USE CASES: Bypassing scraping protections, Japanese Tech Regulations analysis, extracting Japanese Academic Papers, and converting complex HTML/PDF structures into semantic formats.

Input Schema

TableJSON Schema

Name	Required	Description
`url`	Yes	The target URL to extract and normalize.
`format_type`	No	Desired output format. Supported values: 'json', 'markdown'.
`fields`	No	Schema Filtering (Lite GraphQL): Array of fields to extract, minimizing token consumption.
`target_tier`	No	Extraction schema tier (e.g., 'a1' for async processing, 'a2' for actionable data, 'a3' for compliance). Defaults to standard.
`webhook`	No	Webhook configuration for asynchronous processing. Required if target_tier is 'a1'.

Implementation Reference

src/index.ts:57-109 (handler)

The tool execution/handler function. Receives url, format_type, fields, target_tier, webhook parameters, constructs a payload, relays the request to the gateway API (normalize_web_data endpoint) with Polar.sh auth, and returns the normalized content as JSON.

async ({ url, format_type, fields, target_tier, webhook }) => {
    const polarApiKey = process.env.POLAR_API_KEY;

    if (!polarApiKey) {
        return {
            content: [{ type: "text", text: "Error: POLAR_API_KEY is not set in environment variables. Please check your MCP server configuration." }],
            isError: true,
        };
    }

    try {
        // Construct Payload
        const payload: Record<string, any> = { url };
        if (format_type) payload.format_type = format_type;
        if (fields && fields.length > 0) payload.fields = fields;
        if (target_tier) payload.target_tier = target_tier;

        // FIX: Add webhook to payload only if it exists and the URL is not an empty string (otherwise fallback to synchronous processing)
        if (webhook && webhook.url && webhook.url.trim() !== "") {
            payload.webhook = webhook;
        }

        // Relay request to Layer A with Polar.sh Auth
        const response = await fetch(GATEWAY_URL, {
            method: "POST",
            headers: {
                "Content-Type": "application/json",
                "Authorization": `Bearer ${polarApiKey}`,
            },
            body: JSON.stringify(payload),
        });

        const data = (await response.json()) as Record<string, any>;

        // [Phase 4: Step 3/5] Agent-Compliant Error Handling with Trace ID
        if (!response.ok) {
            return {
                content: [{ type: "text", text: formatAgentErrorMessage(response.status, data) }],
                isError: true,
            };
        }

        // Normal Execution
        return {
            content: [{ type: "text", text: JSON.stringify(data, null, 2) }],
        };
    } catch (error) {
        return {
            content: [{ type: "text", text: `Connection Error: ${error instanceof Error ? error.message : String(error)}` }],
            isError: true,
        };
    }
}

src/index.ts:44-53 (schema)

Input schema definition for the normalize_web_data tool. Defines parameters: url (required URL string), format_type (optional enum: 'markdown'|'json'), fields (optional array of strings for schema filtering), target_tier (optional string for extraction tier), webhook (optional object with url field for async processing).

{
    url: z.string().url().describe("The target URL to extract and normalize."),
    format_type: z.enum(["markdown", "json"]).optional().describe("Desired output format. Supported values: 'json', 'markdown'."),
    fields: z.array(z.string()).optional().describe("Schema Filtering (Lite GraphQL): Array of fields to extract, minimizing token consumption."),
    target_tier: z.string().optional().describe("Extraction schema tier (e.g., 'a1' for async processing, 'a2' for actionable data, 'a3' for compliance). Defaults to standard."),
    webhook: z.object({
        // FIX: Remove strict .url() validation to allow empty strings for fallback handling
        url: z.string().optional().describe("The webhook endpoint URL to receive async results.")
    }).optional().describe("Webhook configuration for asynchronous processing. Required if target_tier is 'a1'.")
},

src/index.ts:41-110 (registration)

The registration of the 'normalize_web_data' tool with the MCP server via server.tool(). This binds the tool name, description, schema, and handler together. The description details extraction of web content into clean Markdown/JSON for LLM context consumption.

server.tool(
    "normalize_web_data",
    "Extracts, sanitizes, and normalizes unstructured web content into clean Markdown or JSON. Highly optimized for LLM context windows. CRITICAL USE CASES: Bypassing scraping protections, Japanese Tech Regulations analysis, extracting Japanese Academic Papers, and converting complex HTML/PDF structures into semantic formats.",
    {
        url: z.string().url().describe("The target URL to extract and normalize."),
        format_type: z.enum(["markdown", "json"]).optional().describe("Desired output format. Supported values: 'json', 'markdown'."),
        fields: z.array(z.string()).optional().describe("Schema Filtering (Lite GraphQL): Array of fields to extract, minimizing token consumption."),
        target_tier: z.string().optional().describe("Extraction schema tier (e.g., 'a1' for async processing, 'a2' for actionable data, 'a3' for compliance). Defaults to standard."),
        webhook: z.object({
            // FIX: Remove strict .url() validation to allow empty strings for fallback handling
            url: z.string().optional().describe("The webhook endpoint URL to receive async results.")
        }).optional().describe("Webhook configuration for asynchronous processing. Required if target_tier is 'a1'.")
    },
    // ==========================================
    // 4. Tool Execution (Relay Logic)
    // ==========================================
    async ({ url, format_type, fields, target_tier, webhook }) => {
        const polarApiKey = process.env.POLAR_API_KEY;

        if (!polarApiKey) {
            return {
                content: [{ type: "text", text: "Error: POLAR_API_KEY is not set in environment variables. Please check your MCP server configuration." }],
                isError: true,
            };
        }

        try {
            // Construct Payload
            const payload: Record<string, any> = { url };
            if (format_type) payload.format_type = format_type;
            if (fields && fields.length > 0) payload.fields = fields;
            if (target_tier) payload.target_tier = target_tier;

            // FIX: Add webhook to payload only if it exists and the URL is not an empty string (otherwise fallback to synchronous processing)
            if (webhook && webhook.url && webhook.url.trim() !== "") {
                payload.webhook = webhook;
            }

            // Relay request to Layer A with Polar.sh Auth
            const response = await fetch(GATEWAY_URL, {
                method: "POST",
                headers: {
                    "Content-Type": "application/json",
                    "Authorization": `Bearer ${polarApiKey}`,
                },
                body: JSON.stringify(payload),
            });

            const data = (await response.json()) as Record<string, any>;

            // [Phase 4: Step 3/5] Agent-Compliant Error Handling with Trace ID
            if (!response.ok) {
                return {
                    content: [{ type: "text", text: formatAgentErrorMessage(response.status, data) }],
                    isError: true,
                };
            }

            // Normal Execution
            return {
                content: [{ type: "text", text: JSON.stringify(data, null, 2) }],
            };
        } catch (error) {
            return {
                content: [{ type: "text", text: `Connection Error: ${error instanceof Error ? error.message : String(error)}` }],
                isError: true,
            };
        }
    }
);

src/index.ts:19-36 (helper)

Helper function used by the tool handler to format error messages in an agent-compliant way, supporting status codes 402 (payment required with top-up URL), 429 (rate limit), 403 (security block), and generic errors with trace IDs.

function formatAgentErrorMessage(status: number, data: Record<string, any>): string {
    const traceIdStr = data.trace_id ? `\nTrace ID: ${data.trace_id}` : '';

    if (status === 402 && data.top_up_url) {
        return `[PAYMENT REQUIRED] ${data.message}\nInstruction: ${data.agent_instruction}\nTop-up URL: ${data.top_up_url}${traceIdStr}`;
    }
    if (status === 429) {
        return `[RATE LIMIT EXCEEDED] ${data.message}\nInstruction: ${data.agent_instruction}${traceIdStr}`;
    }
    if (status === 403) {
        return `[SECURITY BLOCK] ${data.message}\nInstruction: ${data.agent_instruction}${traceIdStr}`;
    }
    if (data.trace_id) {
        return `[API ERROR] ${data.message || 'Unknown Error'}\nInstruction: ${data.agent_instruction || 'Check Trace ID'}${traceIdStr}`;
    }

    return `API Error (${status}): ${JSON.stringify(data)}`;
}

SakuttoWorks-Data-Normalizer

normalize_web_data

Instructions

Input Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API