html_cleaner

html_cleaner

Simplify HTML files by stripping unnecessary tags and attributes. Input a file path, and save the cleaned output to a specified directory with this MCP server tool.

Instructions

Clean HTML by removing unnecessary tags and attributes

Input Schema

TableJSON Schema

Name	Required	Description	Default
`inputPath`	Yes	Path to the input HTML file
`outputDir`	Yes	Directory where cleaned HTML should be saved

Implementation Reference

src/tools/htmlTools.ts:113-162 (handler)
The core handler function for the 'html_cleaner' tool. It reads an HTML file, uses JSDOM to parse and clean it by removing unwanted tags (script, style, etc.) and attributes (onclick, onload, etc.), then serializes and saves the cleaned HTML to a uniquely named file in the output directory.
export async function cleanHtml(inputPath: string, outputDir: string) { try { console.error(`Starting HTML cleaning...`); console.error(`Input file: ${inputPath}`); console.error(`Output directory: ${outputDir}`); // 確保輸出目錄存在 try { await fs.access(outputDir); console.error(`Output directory exists: ${outputDir}`); } catch { console.error(`Creating output directory: ${outputDir}`); await fs.mkdir(outputDir, { recursive: true }); console.error(`Created output directory: ${outputDir}`); } const uniqueId = generateUniqueId(); const htmlContent = await fs.readFile(inputPath, "utf-8"); const dom = new JSDOM(htmlContent); const { document } = dom.window; // 移除不必要的標籤和屬性 const unwantedTags = ["script", "style", "iframe", "noscript"]; const unwantedAttrs = ["onclick", "onload", "onerror", "style"]; unwantedTags.forEach((tag) => { document.querySelectorAll(tag).forEach((el) => el.remove()); }); document.querySelectorAll("*").forEach((el) => { unwantedAttrs.forEach((attr) => el.removeAttribute(attr)); }); const cleanedHtml = dom.serialize(); const outputPath = path.join(outputDir, `cleaned_${uniqueId}.html`); await fs.writeFile(outputPath, cleanedHtml); console.error(`Written cleaned HTML to ${outputPath}`); return { success: true, data: `Successfully cleaned HTML and saved to ${outputPath}`, }; } catch (error) { console.error(`Error in cleanHtml:`, error); return { success: false, error: error instanceof Error ? error.message : "Unknown error", }; } }
src/tools/htmlTools.ts:13-30 (schema)
The Tool object definition providing the name, description, and inputSchema for the 'html_cleaner' tool, used for tool listing and validation.
export const HTML_CLEAN_TOOL: Tool = { name: "html_cleaner", description: "Clean HTML by removing unnecessary tags and attributes", inputSchema: { type: "object", properties: { inputPath: { type: "string", description: "Path to the input HTML file", }, outputDir: { type: "string", description: "Directory where cleaned HTML should be saved", }, }, required: ["inputPath", "outputDir"], }, };
src/index.ts:150-166 (registration)
The dispatch logic in the MCP server's CallToolRequest handler that matches the tool name 'html_cleaner', extracts arguments, calls the cleanHtml handler, and formats the response.
if (name === "html_cleaner") { const { inputPath, outputDir } = args as { inputPath: string; outputDir: string; }; const result = await cleanHtml(inputPath, outputDir); if (!result.success) { return { content: [{ type: "text", text: `Error: ${result.error}` }], isError: true, }; } return { content: [{ type: "text", text: fileOperationResponse(result.data) }], isError: false, }; }
src/tools/_index.ts:5-9 (registration)
The import of HTML_CLEAN_TOOL from htmlTools.js and its inclusion in the exported 'tools' array, which is used by the MCP server to list available tools.
import { HTML_CLEAN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_TO_TEXT_TOOL } from "./htmlTools.js"; import { PDF_MERGE_TOOL, PDF_SPLIT_TOOL } from "./pdfTools.js"; import { TEXT_DIFF_TOOL, TEXT_ENCODING_CONVERT_TOOL, TEXT_FORMAT_TOOL, TEXT_SPLIT_TOOL } from "./txtTools.js"; export const tools = [DOCUMENT_READER_TOOL, PDF_MERGE_TOOL, PDF_SPLIT_TOOL, DOCX_TO_PDF_TOOL, DOCX_TO_HTML_TOOL, HTML_CLEAN_TOOL, HTML_TO_TEXT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, TEXT_DIFF_TOOL, TEXT_SPLIT_TOOL, TEXT_FORMAT_TOOL, TEXT_ENCODING_CONVERT_TOOL, EXCEL_READ_TOOL, FORMAT_CONVERTER_TOOL];
src/tools/htmlTools.ts:8-10 (helper)
Helper function to generate a unique ID for output filenames, used in the cleanHtml handler.
function generateUniqueId(): string { return randomBytes(9).toString("hex"); }

Simple Document Processing MCP Server

Instructions

Input Schema

Implementation Reference

Other Tools

Related Tools

Latest Blog Posts

MCP directory API