html_cleaner
Remove unnecessary HTML tags and attributes to clean document files for processing. Specify input file path and output directory to streamline HTML content.
Instructions
Clean HTML by removing unnecessary tags and attributes
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| inputPath | Yes | Path to the input HTML file | |
| outputDir | Yes | Directory where cleaned HTML should be saved |
Implementation Reference
- src/tools/htmlTools.ts:113-162 (handler)Core handler function implementing the html_cleaner tool logic: loads HTML with JSDOM, removes unwanted tags (script, style, etc.) and attributes (onclick, etc.), serializes cleaned HTML, saves to unique filename in output directory.export async function cleanHtml(inputPath: string, outputDir: string) { try { console.error(`Starting HTML cleaning...`); console.error(`Input file: ${inputPath}`); console.error(`Output directory: ${outputDir}`); // 確保輸出目錄存在 try { await fs.access(outputDir); console.error(`Output directory exists: ${outputDir}`); } catch { console.error(`Creating output directory: ${outputDir}`); await fs.mkdir(outputDir, { recursive: true }); console.error(`Created output directory: ${outputDir}`); } const uniqueId = generateUniqueId(); const htmlContent = await fs.readFile(inputPath, "utf-8"); const dom = new JSDOM(htmlContent); const { document } = dom.window; // 移除不必要的標籤和屬性 const unwantedTags = ["script", "style", "iframe", "noscript"]; const unwantedAttrs = ["onclick", "onload", "onerror", "style"]; unwantedTags.forEach((tag) => { document.querySelectorAll(tag).forEach((el) => el.remove()); }); document.querySelectorAll("*").forEach((el) => { unwantedAttrs.forEach((attr) => el.removeAttribute(attr)); }); const cleanedHtml = dom.serialize(); const outputPath = path.join(outputDir, `cleaned_${uniqueId}.html`); await fs.writeFile(outputPath, cleanedHtml); console.error(`Written cleaned HTML to ${outputPath}`); return { success: true, data: `Successfully cleaned HTML and saved to ${outputPath}`, }; } catch (error) { console.error(`Error in cleanHtml:`, error); return { success: false, error: error instanceof Error ? error.message : "Unknown error", }; } }
- src/tools/htmlTools.ts:13-30 (schema)Schema definition for the html_cleaner tool, specifying name, description, and input parameters (inputPath and outputDir).export const HTML_CLEAN_TOOL: Tool = { name: "html_cleaner", description: "Clean HTML by removing unnecessary tags and attributes", inputSchema: { type: "object", properties: { inputPath: { type: "string", description: "Path to the input HTML file", }, outputDir: { type: "string", description: "Directory where cleaned HTML should be saved", }, }, required: ["inputPath", "outputDir"], }, };
- src/tools/_index.ts:5-9 (registration)Registers the html_cleaner tool (as HTML_CLEAN_TOOL) in the central tools array used for listing available tools in MCP server.import { HTML_CLEAN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_TO_TEXT_TOOL } from "./htmlTools.js"; import { PDF_MERGE_TOOL, PDF_SPLIT_TOOL } from "./pdfTools.js"; import { TEXT_DIFF_TOOL, TEXT_ENCODING_CONVERT_TOOL, TEXT_FORMAT_TOOL, TEXT_SPLIT_TOOL } from "./txtTools.js"; export const tools = [DOCUMENT_READER_TOOL, PDF_MERGE_TOOL, PDF_SPLIT_TOOL, DOCX_TO_PDF_TOOL, DOCX_TO_HTML_TOOL, HTML_CLEAN_TOOL, HTML_TO_TEXT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, TEXT_DIFF_TOOL, TEXT_SPLIT_TOOL, TEXT_FORMAT_TOOL, TEXT_ENCODING_CONVERT_TOOL, EXCEL_READ_TOOL, FORMAT_CONVERTER_TOOL];
- src/index.ts:150-166 (handler)MCP server request handler that dispatches calls to 'html_cleaner' by invoking cleanHtml function and formatting the response.if (name === "html_cleaner") { const { inputPath, outputDir } = args as { inputPath: string; outputDir: string; }; const result = await cleanHtml(inputPath, outputDir); if (!result.success) { return { content: [{ type: "text", text: `Error: ${result.error}` }], isError: true, }; } return { content: [{ type: "text", text: fileOperationResponse(result.data) }], isError: false, }; }
- src/tools/htmlTools.ts:8-10 (helper)Helper utility to generate unique hexadecimal IDs for output filenames used in html_cleaner.function generateUniqueId(): string { return randomBytes(9).toString("hex"); }