Skip to main content
Glama
cablate

Simple Document Processing MCP Server

html_extract_resources

Extract images, videos, and links from an HTML file and save them to a specified directory.

Instructions

Extract all resources (images, videos, links) from HTML

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
inputPathYesPath to the input HTML file
outputDirYesDirectory where resources should be saved

Implementation Reference

  • The Tool schema definition for html_extract_resources, defining its name, description, and input schema (inputPath, outputDir).
    // HTML 資源提取工具
    export const HTML_EXTRACT_RESOURCES_TOOL: Tool = {
      name: "html_extract_resources",
      description: "Extract all resources (images, videos, links) from HTML",
      inputSchema: {
        type: "object",
        properties: {
          inputPath: {
            type: "string",
            description: "Path to the input HTML file",
          },
          outputDir: {
            type: "string",
            description: "Directory where resources should be saved",
          },
        },
        required: ["inputPath", "outputDir"],
      },
    };
  • The handler function extractHtmlResources that implements the resource extraction logic: reads HTML, extracts images/videos/links, and saves to JSON.
    // HTML 資源提取實作
    export async function extractHtmlResources(
      inputPath: string,
      outputDir: string
    ) {
      try {
        console.error(`Starting resource extraction...`);
        console.error(`Input file: ${inputPath}`);
        console.error(`Output directory: ${outputDir}`);
    
        // 確保輸出目錄存在
        try {
          await fs.access(outputDir);
          console.error(`Output directory exists: ${outputDir}`);
        } catch {
          console.error(`Creating output directory: ${outputDir}`);
          await fs.mkdir(outputDir, { recursive: true });
          console.error(`Created output directory: ${outputDir}`);
        }
    
        const uniqueId = generateUniqueId();
        const htmlContent = await fs.readFile(inputPath, "utf-8");
        const dom = new JSDOM(htmlContent);
        const { document } = dom.window;
    
        // 提取資源
        const resources = {
          images: Array.from(document.querySelectorAll("img")).map(
            (img) => (img as HTMLImageElement).src
          ),
          links: Array.from(document.querySelectorAll("a")).map(
            (a) => (a as HTMLAnchorElement).href
          ),
          videos: Array.from(document.querySelectorAll("video source")).map(
            (video) => (video as HTMLSourceElement).src
          ),
        };
    
        const outputPath = path.join(outputDir, `resources_${uniqueId}.json`);
        await fs.writeFile(outputPath, JSON.stringify(resources, null, 2));
        console.error(`Written resources to ${outputPath}`);
    
        return {
          success: true,
          data: `Successfully extracted resources: ${outputPath}`,
        };
      } catch (error) {
        console.error(`Error in extractHtmlResources:`, error);
        return {
          success: false,
          error: error instanceof Error ? error.message : "Unknown error",
        };
      }
    }
  • The tool registration/index file that imports HTML_EXTRACT_RESOURCES_TOOL and includes it in the exported tools array.
    import { DOCUMENT_READER_TOOL } from "./documentReader.js";
    import { DOCX_TO_HTML_TOOL, DOCX_TO_PDF_TOOL } from "./docxTools.js";
    import { EXCEL_READ_TOOL } from "./excelTools.js";
    import { FORMAT_CONVERTER_TOOL } from "./formatConverterPlus.js";
    import { HTML_CLEAN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_TO_TEXT_TOOL } from "./htmlTools.js";
    import { PDF_MERGE_TOOL, PDF_SPLIT_TOOL } from "./pdfTools.js";
    import { TEXT_DIFF_TOOL, TEXT_ENCODING_CONVERT_TOOL, TEXT_FORMAT_TOOL, TEXT_SPLIT_TOOL } from "./txtTools.js";
    
    export const tools = [DOCUMENT_READER_TOOL, PDF_MERGE_TOOL, PDF_SPLIT_TOOL, DOCX_TO_PDF_TOOL, DOCX_TO_HTML_TOOL, HTML_CLEAN_TOOL, HTML_TO_TEXT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, TEXT_DIFF_TOOL, TEXT_SPLIT_TOOL, TEXT_FORMAT_TOOL, TEXT_ENCODING_CONVERT_TOOL, EXCEL_READ_TOOL, FORMAT_CONVERTER_TOOL];
    
    export * from "./documentReader.js";
    export * from "./docxTools.js";
    export * from "./excelTools.js";
    export * from "./formatConverterPlus.js";
    export * from "./htmlTools.js";
    export * from "./pdfTools.js";
    export * from "./txtTools.js";
  • src/index.ts:204-220 (registration)
    The request handler in src/index.ts that routes the 'html_extract_resources' tool call to the extractHtmlResources handler function.
    if (name === "html_extract_resources") {
      const { inputPath, outputDir } = args as {
        inputPath: string;
        outputDir: string;
      };
      const result = await extractHtmlResources(inputPath, outputDir);
      if (!result.success) {
        return {
          content: [{ type: "text", text: `Error: ${result.error}` }],
          isError: true,
        };
      }
      return {
        content: [{ type: "text", text: fileOperationResponse(result.data) }],
        isError: false,
      };
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, and the description does not disclose behavioral traits such as whether the tool is read-only, modifies input, or how it handles errors. The description carries the full burden but only states the basic purpose.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single sentence, which is concise but lacks necessary detail. It could be slightly expanded to cover what extraction entails without becoming verbose.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

The tool likely produces output files, but the description does not specify return values or side effects. With no output schema and no annotations, the description is insufficient for an agent to fully understand the tool's behavior.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, and both parameters have descriptions in the schema. The tool description adds no additional meaning beyond what is already in the schema (e.g., that resources are saved to an output directory).

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool extracts resources (images, videos, links) from HTML, differentiating it from sibling tools like html_cleaner or html_formatter. However, it could be more specific about the extraction process (e.g., whether it saves files or extracts URLs).

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives such as html_cleaner or html_to_text. There is no mention of prerequisites or typical use cases.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cablate/mcp-doc-forge'

If you have feedback or need assistance with the MCP directory API, please join our Discord server