Skip to main content
Glama
SakuttoWorks

SakuttoWorks-Data-Normalizer

normalize_web_data

Extract, sanitize, and normalize unstructured web content into clean Markdown or JSON, designed for LLM context windows and complex structures.

Instructions

Extracts, sanitizes, and normalizes unstructured web content into clean Markdown or JSON. Highly optimized for LLM context windows. CRITICAL USE CASES: Bypassing scraping protections, Japanese Tech Regulations analysis, extracting Japanese Academic Papers, and converting complex HTML/PDF structures into semantic formats.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe target URL to extract and normalize.
format_typeNoDesired output format. Supported values: 'json', 'markdown'.
fieldsNoSchema Filtering (Lite GraphQL): Array of fields to extract, minimizing token consumption.
target_tierNoExtraction schema tier (e.g., 'a1' for async processing, 'a2' for actionable data, 'a3' for compliance). Defaults to standard.
webhookNoWebhook configuration for asynchronous processing. Required if target_tier is 'a1'.

Implementation Reference

  • The tool execution/handler function. Receives url, format_type, fields, target_tier, webhook parameters, constructs a payload, relays the request to the gateway API (normalize_web_data endpoint) with Polar.sh auth, and returns the normalized content as JSON.
    async ({ url, format_type, fields, target_tier, webhook }) => {
        const polarApiKey = process.env.POLAR_API_KEY;
    
        if (!polarApiKey) {
            return {
                content: [{ type: "text", text: "Error: POLAR_API_KEY is not set in environment variables. Please check your MCP server configuration." }],
                isError: true,
            };
        }
    
        try {
            // Construct Payload
            const payload: Record<string, any> = { url };
            if (format_type) payload.format_type = format_type;
            if (fields && fields.length > 0) payload.fields = fields;
            if (target_tier) payload.target_tier = target_tier;
    
            // FIX: Add webhook to payload only if it exists and the URL is not an empty string (otherwise fallback to synchronous processing)
            if (webhook && webhook.url && webhook.url.trim() !== "") {
                payload.webhook = webhook;
            }
    
            // Relay request to Layer A with Polar.sh Auth
            const response = await fetch(GATEWAY_URL, {
                method: "POST",
                headers: {
                    "Content-Type": "application/json",
                    "Authorization": `Bearer ${polarApiKey}`,
                },
                body: JSON.stringify(payload),
            });
    
            const data = (await response.json()) as Record<string, any>;
    
            // [Phase 4: Step 3/5] Agent-Compliant Error Handling with Trace ID
            if (!response.ok) {
                return {
                    content: [{ type: "text", text: formatAgentErrorMessage(response.status, data) }],
                    isError: true,
                };
            }
    
            // Normal Execution
            return {
                content: [{ type: "text", text: JSON.stringify(data, null, 2) }],
            };
        } catch (error) {
            return {
                content: [{ type: "text", text: `Connection Error: ${error instanceof Error ? error.message : String(error)}` }],
                isError: true,
            };
        }
    }
  • Input schema definition for the normalize_web_data tool. Defines parameters: url (required URL string), format_type (optional enum: 'markdown'|'json'), fields (optional array of strings for schema filtering), target_tier (optional string for extraction tier), webhook (optional object with url field for async processing).
    {
        url: z.string().url().describe("The target URL to extract and normalize."),
        format_type: z.enum(["markdown", "json"]).optional().describe("Desired output format. Supported values: 'json', 'markdown'."),
        fields: z.array(z.string()).optional().describe("Schema Filtering (Lite GraphQL): Array of fields to extract, minimizing token consumption."),
        target_tier: z.string().optional().describe("Extraction schema tier (e.g., 'a1' for async processing, 'a2' for actionable data, 'a3' for compliance). Defaults to standard."),
        webhook: z.object({
            // FIX: Remove strict .url() validation to allow empty strings for fallback handling
            url: z.string().optional().describe("The webhook endpoint URL to receive async results.")
        }).optional().describe("Webhook configuration for asynchronous processing. Required if target_tier is 'a1'.")
    },
  • src/index.ts:41-110 (registration)
    The registration of the 'normalize_web_data' tool with the MCP server via server.tool(). This binds the tool name, description, schema, and handler together. The description details extraction of web content into clean Markdown/JSON for LLM context consumption.
    server.tool(
        "normalize_web_data",
        "Extracts, sanitizes, and normalizes unstructured web content into clean Markdown or JSON. Highly optimized for LLM context windows. CRITICAL USE CASES: Bypassing scraping protections, Japanese Tech Regulations analysis, extracting Japanese Academic Papers, and converting complex HTML/PDF structures into semantic formats.",
        {
            url: z.string().url().describe("The target URL to extract and normalize."),
            format_type: z.enum(["markdown", "json"]).optional().describe("Desired output format. Supported values: 'json', 'markdown'."),
            fields: z.array(z.string()).optional().describe("Schema Filtering (Lite GraphQL): Array of fields to extract, minimizing token consumption."),
            target_tier: z.string().optional().describe("Extraction schema tier (e.g., 'a1' for async processing, 'a2' for actionable data, 'a3' for compliance). Defaults to standard."),
            webhook: z.object({
                // FIX: Remove strict .url() validation to allow empty strings for fallback handling
                url: z.string().optional().describe("The webhook endpoint URL to receive async results.")
            }).optional().describe("Webhook configuration for asynchronous processing. Required if target_tier is 'a1'.")
        },
        // ==========================================
        // 4. Tool Execution (Relay Logic)
        // ==========================================
        async ({ url, format_type, fields, target_tier, webhook }) => {
            const polarApiKey = process.env.POLAR_API_KEY;
    
            if (!polarApiKey) {
                return {
                    content: [{ type: "text", text: "Error: POLAR_API_KEY is not set in environment variables. Please check your MCP server configuration." }],
                    isError: true,
                };
            }
    
            try {
                // Construct Payload
                const payload: Record<string, any> = { url };
                if (format_type) payload.format_type = format_type;
                if (fields && fields.length > 0) payload.fields = fields;
                if (target_tier) payload.target_tier = target_tier;
    
                // FIX: Add webhook to payload only if it exists and the URL is not an empty string (otherwise fallback to synchronous processing)
                if (webhook && webhook.url && webhook.url.trim() !== "") {
                    payload.webhook = webhook;
                }
    
                // Relay request to Layer A with Polar.sh Auth
                const response = await fetch(GATEWAY_URL, {
                    method: "POST",
                    headers: {
                        "Content-Type": "application/json",
                        "Authorization": `Bearer ${polarApiKey}`,
                    },
                    body: JSON.stringify(payload),
                });
    
                const data = (await response.json()) as Record<string, any>;
    
                // [Phase 4: Step 3/5] Agent-Compliant Error Handling with Trace ID
                if (!response.ok) {
                    return {
                        content: [{ type: "text", text: formatAgentErrorMessage(response.status, data) }],
                        isError: true,
                    };
                }
    
                // Normal Execution
                return {
                    content: [{ type: "text", text: JSON.stringify(data, null, 2) }],
                };
            } catch (error) {
                return {
                    content: [{ type: "text", text: `Connection Error: ${error instanceof Error ? error.message : String(error)}` }],
                    isError: true,
                };
            }
        }
    );
  • Helper function used by the tool handler to format error messages in an agent-compliant way, supporting status codes 402 (payment required with top-up URL), 429 (rate limit), 403 (security block), and generic errors with trace IDs.
    function formatAgentErrorMessage(status: number, data: Record<string, any>): string {
        const traceIdStr = data.trace_id ? `\nTrace ID: ${data.trace_id}` : '';
    
        if (status === 402 && data.top_up_url) {
            return `[PAYMENT REQUIRED] ${data.message}\nInstruction: ${data.agent_instruction}\nTop-up URL: ${data.top_up_url}${traceIdStr}`;
        }
        if (status === 429) {
            return `[RATE LIMIT EXCEEDED] ${data.message}\nInstruction: ${data.agent_instruction}${traceIdStr}`;
        }
        if (status === 403) {
            return `[SECURITY BLOCK] ${data.message}\nInstruction: ${data.agent_instruction}${traceIdStr}`;
        }
        if (data.trace_id) {
            return `[API ERROR] ${data.message || 'Unknown Error'}\nInstruction: ${data.agent_instruction || 'Check Trace ID'}${traceIdStr}`;
        }
    
        return `API Error (${status}): ${JSON.stringify(data)}`;
    }
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations provided, so description carries burden. It notes it's 'optimized for LLM context windows' and mentions 'bypassing scraping protections', which implies potential risk. But does not disclose auth needs, rate limits, or side effects beyond the listed use cases.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Description is front-loaded with core function and lists use cases in a structured way. Slightly verbose with capitalized 'CRITICAL USE CASES', but overall efficient and readable.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

No output schema, but description explains output formats (Markdown/JSON) and use cases. It lacks error handling, size limits, or rate limit info, but for a web extraction tool, it provides sufficient context for an AI agent to decide usage.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with descriptions for each parameter. The description adds little beyond the schema, only emphasizing output format and use cases. Baseline 3 is appropriate as the schema already provides sufficient meaning.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states it extracts, sanitizes, and normalizes web content into Markdown/JSON, with specific use cases listed. Verb+resource+output are explicit, and no sibling tools exist to confuse.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Provides critical use cases (bypassing scraping protections, Japanese content, complex conversions), giving context on when to use. However, no explicit when-not-to-use or alternatives are mentioned, but since no siblings, it's adequate.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/SakuttoWorks/ghost-ship-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server