Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

get_html

Extract sanitized HTML from web pages to analyze structure, identify form fields, and plan automation selectors for web crawling operations.

Instructions

[STATELESS] Get sanitized/processed HTML for inspection and automation planning. Use when: finding form fields/selectors, analyzing page structure before automation, building schemas. Returns cleaned HTML showing element names, IDs, and classes - perfect for identifying selectors for subsequent crawl operations. Commonly used before crawl to find selectors for automation. Creates new browser each time.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL to extract HTML from

Implementation Reference

  • MCP handler that executes the get_html tool: calls service to fetch HTML and formats response as MCP text content block.
    async getHTML(options: HTMLEndpointOptions) {
      try {
        const result: HTMLEndpointResponse = await this.service.getHTML(options);
    
        // Response has { html: string, url: string, success: true }
        return {
          content: [
            {
              type: 'text',
              text: result.html || '',
            },
          ],
        };
      } catch (error) {
        throw this.formatError(error, 'get HTML');
      }
    }
  • Core service implementation: validates URL and POSTs to Crawl4AI backend /html endpoint to retrieve processed HTML.
    async getHTML(options: HTMLEndpointOptions): Promise<HTMLEndpointResponse> {
      // Validate URL
      if (!validateURL(options.url)) {
        throw new Error('Invalid URL format');
      }
    
      try {
        const response = await this.axiosClient.post('/html', {
          url: options.url,
          // Only url is supported by the endpoint
        });
    
        return response.data;
      } catch (error) {
        return handleAxiosError(error);
      }
    }
  • Input schema validation using Zod: requires a valid URL string.
    export const GetHtmlSchema = createStatelessSchema(
      z.object({
        url: z.string().url(),
      }),
      'get_html',
    );
  • src/server.ts:857-860 (registration)
    Tool call routing: switch case in server handles get_html requests, validates with schema, delegates to handler.
    case 'get_html':
      return await this.validateAndExecute('get_html', args, GetHtmlSchema, async (validatedArgs) =>
        this.contentHandlers.getHTML(validatedArgs),
      );
  • src/server.ts:274-287 (registration)
    Tool metadata registration: defines name, description, and input schema advertised in listTools response.
      name: 'get_html',
      description:
        '[STATELESS] Get sanitized/processed HTML for inspection and automation planning. Use when: finding form fields/selectors, analyzing page structure before automation, building schemas. Returns cleaned HTML showing element names, IDs, and classes - perfect for identifying selectors for subsequent crawl operations. Commonly used before crawl to find selectors for automation. Creates new browser each time.',
      inputSchema: {
        type: 'object',
        properties: {
          url: {
            type: 'string',
            description: 'The URL to extract HTML from',
          },
        },
        required: ['url'],
      },
    },
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively describes key behaviors: it's stateless ('[STATELESS]'), returns cleaned HTML with specific details ('element names, IDs, and classes'), and has a side effect ('Creates new browser each time'). However, it doesn't mention potential limitations like rate limits, error handling, or authentication needs, which would be useful for a tool that creates browser instances.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is appropriately sized and front-loaded, starting with the core purpose. Most sentences earn their place by providing usage guidelines and behavioral context. However, the repetition of 'for automation' in 'automation planning' and 'for subsequent crawl operations' slightly reduces efficiency, preventing a perfect score.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (single parameter, no output schema, no annotations), the description is largely complete. It covers purpose, usage, behavior, and output characteristics. The main gap is the lack of output schema, so the description doesn't detail the exact structure of the returned HTML (e.g., format, size limits), but it does describe the content ('cleaned HTML showing element names, IDs, and classes').

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents the single 'url' parameter. The description doesn't add any additional semantic information about parameters beyond what's in the schema (e.g., URL format requirements, handling of invalid URLs). This meets the baseline of 3 when the schema provides complete parameter documentation.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Get sanitized/processed HTML for inspection and automation planning.' It specifies the verb ('Get'), resource ('sanitized/processed HTML'), and distinguishes from siblings by focusing on HTML extraction rather than crawling, screenshot capture, or other operations listed in sibling tools.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description explicitly states when to use this tool: 'Use when: finding form fields/selectors, analyzing page structure before automation, building schemas.' It also provides context on alternatives by noting it's 'commonly used before crawl to find selectors for automation,' distinguishing it from actual crawl operations like 'crawl' or 'smart_crawl'.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server