html_cleaner
Cleans HTML files by stripping unnecessary tags and attributes, outputting simplified markup to a specified directory.
Instructions
Clean HTML by removing unnecessary tags and attributes
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| inputPath | Yes | Path to the input HTML file | |
| outputDir | Yes | Directory where cleaned HTML should be saved |
Implementation Reference
- src/tools/htmlTools.ts:113-162 (handler)The actual handler function 'cleanHtml' that executes the html_cleaner tool logic. It reads an HTML file via JSDOM, removes unwanted tags (script, style, iframe, noscript) and attributes (onclick, onload, onerror, style), then writes the cleaned HTML to an output file.
export async function cleanHtml(inputPath: string, outputDir: string) { try { console.error(`Starting HTML cleaning...`); console.error(`Input file: ${inputPath}`); console.error(`Output directory: ${outputDir}`); // 確保輸出目錄存在 try { await fs.access(outputDir); console.error(`Output directory exists: ${outputDir}`); } catch { console.error(`Creating output directory: ${outputDir}`); await fs.mkdir(outputDir, { recursive: true }); console.error(`Created output directory: ${outputDir}`); } const uniqueId = generateUniqueId(); const htmlContent = await fs.readFile(inputPath, "utf-8"); const dom = new JSDOM(htmlContent); const { document } = dom.window; // 移除不必要的標籤和屬性 const unwantedTags = ["script", "style", "iframe", "noscript"]; const unwantedAttrs = ["onclick", "onload", "onerror", "style"]; unwantedTags.forEach((tag) => { document.querySelectorAll(tag).forEach((el) => el.remove()); }); document.querySelectorAll("*").forEach((el) => { unwantedAttrs.forEach((attr) => el.removeAttribute(attr)); }); const cleanedHtml = dom.serialize(); const outputPath = path.join(outputDir, `cleaned_${uniqueId}.html`); await fs.writeFile(outputPath, cleanedHtml); console.error(`Written cleaned HTML to ${outputPath}`); return { success: true, data: `Successfully cleaned HTML and saved to ${outputPath}`, }; } catch (error) { console.error(`Error in cleanHtml:`, error); return { success: false, error: error instanceof Error ? error.message : "Unknown error", }; } } - src/tools/htmlTools.ts:13-30 (schema)The tool definition/schema for 'html_cleaner', including its name, description, and inputSchema (inputPath, outputDir).
export const HTML_CLEAN_TOOL: Tool = { name: "html_cleaner", description: "Clean HTML by removing unnecessary tags and attributes", inputSchema: { type: "object", properties: { inputPath: { type: "string", description: "Path to the input HTML file", }, outputDir: { type: "string", description: "Directory where cleaned HTML should be saved", }, }, required: ["inputPath", "outputDir"], }, }; - src/index.ts:150-166 (registration)The registration/handler dispatch in index.ts where the server routes 'html_cleaner' requests to the cleanHtml function.
if (name === "html_cleaner") { const { inputPath, outputDir } = args as { inputPath: string; outputDir: string; }; const result = await cleanHtml(inputPath, outputDir); if (!result.success) { return { content: [{ type: "text", text: `Error: ${result.error}` }], isError: true, }; } return { content: [{ type: "text", text: fileOperationResponse(result.data) }], isError: false, }; } - src/tools/_index.ts:9-9 (registration)The tool is registered in the 'tools' array exported from _index.ts, which is used by the ListToolsRequestSchema handler.
export const tools = [DOCUMENT_READER_TOOL, PDF_MERGE_TOOL, PDF_SPLIT_TOOL, DOCX_TO_PDF_TOOL, DOCX_TO_HTML_TOOL, HTML_CLEAN_TOOL, HTML_TO_TEXT_TOOL, HTML_TO_MARKDOWN_TOOL, HTML_EXTRACT_RESOURCES_TOOL, HTML_FORMAT_TOOL, TEXT_DIFF_TOOL, TEXT_SPLIT_TOOL, TEXT_FORMAT_TOOL, TEXT_ENCODING_CONVERT_TOOL, EXCEL_READ_TOOL, FORMAT_CONVERTER_TOOL]; - src/tools/htmlTools.ts:8-10 (helper)Helper function 'generateUniqueId' used by cleanHtml to generate unique output filenames.
function generateUniqueId(): string { return randomBytes(9).toString("hex"); }