check_site
Recursively crawl websites to identify broken links by scanning internal and external URLs across multiple pages, with options to respect robots.txt and limit concurrent requests.
Instructions
Recursively crawl and check all links across an entire website. This will scan multiple pages and check all internal and external links found. Use with caution on large sites as it may take significant time.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The starting URL of the site to check | |
| excludeExternalLinks | No | If true, only check internal links (default: false) | |
| honorRobotExclusions | No | If true, respect robots.txt and meta robots tags (default: true) | |
| maxSocketsPerHost | No | Maximum concurrent requests per host (default: 1) |
Implementation Reference
- server.js:71-109 (handler)Core handler function for the 'check_site' tool. Uses SiteChecker from 'broken-link-checker' to recursively scan the site starting from the given URL, collects link check results, pages discovered, and any errors encountered.function checkSite(url, options = {}) { return new Promise((resolve, reject) => { const results = []; const errors = []; const pages = []; const siteChecker = new SiteChecker(options, { link: (result) => { results.push({ url: result.url.resolved, base: result.base.resolved, html: { tagName: result.html.tagName, text: result.html.text, }, broken: result.broken, brokenReason: result.brokenReason, excluded: result.excluded, excludedReason: result.excludedReason, http: { statusCode: result.http?.response?.statusCode, }, }); }, page: (error, pageUrl) => { if (error) { errors.push({ pageUrl, error: error.message }); } else { pages.push(pageUrl); } }, end: () => { resolve({ results, errors, pages }); }, }); siteChecker.enqueue(url); }); }
- server.js:146-173 (schema)Input schema defining the parameters for the 'check_site' tool: required 'url', optional 'excludeExternalLinks', 'honorRobotExclusions', and 'maxSocketsPerHost'.inputSchema: { type: "object", properties: { url: { type: "string", description: "The starting URL of the site to check", }, excludeExternalLinks: { type: "boolean", description: "If true, only check internal links (default: false)", default: false, }, honorRobotExclusions: { type: "boolean", description: "If true, respect robots.txt and meta robots tags (default: true)", default: true, }, maxSocketsPerHost: { type: "number", description: "Maximum concurrent requests per host (default: 1)", default: 1, }, }, required: ["url"], },
- server.js:142-174 (registration)Registration of the 'check_site' tool in the ListToolsRequestSchema handler, providing name, description, and input schema.{ name: "check_site", description: "Recursively crawl and check all links across an entire website. This will scan multiple pages and check all internal and external links found. Use with caution on large sites as it may take significant time.", inputSchema: { type: "object", properties: { url: { type: "string", description: "The starting URL of the site to check", }, excludeExternalLinks: { type: "boolean", description: "If true, only check internal links (default: false)", default: false, }, honorRobotExclusions: { type: "boolean", description: "If true, respect robots.txt and meta robots tags (default: true)", default: true, }, maxSocketsPerHost: { type: "number", description: "Maximum concurrent requests per host (default: 1)", default: 1, }, }, required: ["url"], }, },
- server.js:215-249 (handler)Dispatcher logic in CallToolRequestSchema handler that processes 'check_site' tool calls, prepares options, invokes checkSite, processes results into summary and broken links, and formats the MCP response.} else if (name === "check_site") { const options = { excludeExternalLinks: args.excludeExternalLinks || false, honorRobotExclusions: args.honorRobotExclusions !== false, maxSocketsPerHost: args.maxSocketsPerHost || 1, }; const result = await checkSite(args.url, options); const brokenLinks = result.results.filter((link) => link.broken); const summary = { pagesScanned: result.pages.length, totalLinks: result.results.length, brokenLinks: brokenLinks.length, workingLinks: result.results.length - brokenLinks.length, errors: result.errors.length, }; return { content: [ { type: "text", text: JSON.stringify( { summary, brokenLinks, pages: result.pages, errors: result.errors, }, null, 2 ), }, ], };