Skip to main content
Glama
cablate

Simple Document Processing MCP Server

html_cleaner

Remove unnecessary HTML tags and attributes to clean document files for processing. Specify input file path and output directory to streamline HTML content.

Instructions

Clean HTML by removing unnecessary tags and attributes

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
inputPathYesPath to the input HTML file
outputDirYesDirectory where cleaned HTML should be saved

Implementation Reference

  • Core handler function implementing the html_cleaner tool logic: loads HTML with JSDOM, removes unwanted tags (script, style, etc.) and attributes (onclick, etc.), serializes cleaned HTML, saves to unique filename in output directory.
    export async function cleanHtml(inputPath: string, outputDir: string) { try { console.error(`Starting HTML cleaning...`); console.error(`Input file: ${inputPath}`); console.error(`Output directory: ${outputDir}`); // 確保輸出目錄存在 try { await fs.access(outputDir); console.error(`Output directory exists: ${outputDir}`); } catch { console.error(`Creating output directory: ${outputDir}`); await fs.mkdir(outputDir, { recursive: true }); console.error(`Created output directory: ${outputDir}`); } const uniqueId = generateUniqueId(); const htmlContent = await fs.readFile(inputPath, "utf-8"); const dom = new JSDOM(htmlContent); const { document } = dom.window; // 移除不必要的標籤和屬性 const unwantedTags = ["script", "style", "iframe", "noscript"]; const unwantedAttrs = ["onclick", "onload", "onerror", "style"]; unwantedTags.forEach((tag) => { document.querySelectorAll(tag).forEach((el) => el.remove()); }); document.querySelectorAll("*").forEach((el) => { unwantedAttrs.forEach((attr) => el.removeAttribute(attr)); }); const cleanedHtml = dom.serialize(); const outputPath = path.join(outputDir, `cleaned_${uniqueId}.html`); await fs.writeFile(outputPath, cleanedHtml); console.error(`Written cleaned HTML to ${outputPath}`); return { success: true, data: `Successfully cleaned HTML and saved to ${outputPath}`, }; } catch (error) { console.error(`Error in cleanHtml:`, error); return { success: false, error: error instanceof Error ? error.message : "Unknown error", }; } }
  • Schema definition for the html_cleaner tool, specifying name, description, and input parameters (inputPath and outputDir).
    export const HTML_CLEAN_TOOL: Tool = { name: "html_cleaner", description: "Clean HTML by removing unnecessary tags and attributes", inputSchema: { type: "object", properties: { inputPath: { type: "string", description: "Path to the input HTML file", }, outputDir: { type: "string", description: "Directory where cleaned HTML should be saved", }, }, required: ["inputPath", "outputDir"], }, };
  • Registers the html_cleaner tool (as HTML_CLEAN_TOOL) in the central tools array used for listing available tools in MCP server.
    import { HTML_CLEAN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_TO_TEXT_TOOL } from "./htmlTools.js"; import { PDF_MERGE_TOOL, PDF_SPLIT_TOOL } from "./pdfTools.js"; import { TEXT_DIFF_TOOL, TEXT_ENCODING_CONVERT_TOOL, TEXT_FORMAT_TOOL, TEXT_SPLIT_TOOL } from "./txtTools.js"; export const tools = [DOCUMENT_READER_TOOL, PDF_MERGE_TOOL, PDF_SPLIT_TOOL, DOCX_TO_PDF_TOOL, DOCX_TO_HTML_TOOL, HTML_CLEAN_TOOL, HTML_TO_TEXT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, TEXT_DIFF_TOOL, TEXT_SPLIT_TOOL, TEXT_FORMAT_TOOL, TEXT_ENCODING_CONVERT_TOOL, EXCEL_READ_TOOL, FORMAT_CONVERTER_TOOL];
  • MCP server request handler that dispatches calls to 'html_cleaner' by invoking cleanHtml function and formatting the response.
    if (name === "html_cleaner") { const { inputPath, outputDir } = args as { inputPath: string; outputDir: string; }; const result = await cleanHtml(inputPath, outputDir); if (!result.success) { return { content: [{ type: "text", text: `Error: ${result.error}` }], isError: true, }; } return { content: [{ type: "text", text: fileOperationResponse(result.data) }], isError: false, }; }
  • Helper utility to generate unique hexadecimal IDs for output filenames used in html_cleaner.
    function generateUniqueId(): string { return randomBytes(9).toString("hex"); }

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cablate/mcp-doc-forge'

If you have feedback or need assistance with the MCP directory API, please join our Discord server