Broken Link Checker MCP Server

USAGE.md•7.07 kB

# Broken Link Checker MCP Server - Usage Guide ## Installation 1. Clone or download this repository 2. Install dependencies: ```bash pip install -e . ``` For development with testing: ```bash pip install -e ".[dev]" ``` ## Starting the Server Start the MCP server with HTTP transport: ```bash python src/server.py ``` The server will run on `http://127.0.0.1:8000` by default. You should see output like: ``` 2025-10-11 12:00:00 - src.server - INFO - Starting Broken Link Checker MCP Server on HTTP transport ``` ## Using the Server with Claude Once the server is running, you can connect to it using Claude or any other MCP client that supports HTTP transport. ### Configuring Claude Desktop Add the following to your Claude Desktop configuration: ```json { "mcpServers": { "broken-link-checker": { "url": "http://127.0.0.1:8000" } } } ``` ## Available Tools ### 1. check_page Check all links on a single web page. **Usage in Claude:** ``` Please check this page for broken links: https://example.com ``` **Parameters:** - `url` (string, required): The URL of the page to check **Example Response:** ```json { "results": [ { "page_url": "https://example.com", "link_reference": "About Us", "link_url": "https://example.com/about", "status": "Good" }, { "page_url": "https://example.com", "link_reference": "<img alt='logo'>", "link_url": "https://example.com/missing-image.png", "status": "Bad" } ], "summary": { "total_links": 15, "good_links": 14, "bad_links": 1, "pages_scanned": 1 } } ``` ### 2. check_domain Recursively check all pages within a domain for broken links. **Usage in Claude:** ``` Please check the entire example.com domain for broken links, with a maximum depth of 2 levels. ``` **Parameters:** - `url` (string, required): The root URL of the domain to check - `max_depth` (integer, optional): Maximum crawl depth. Default is -1 (unlimited) - `-1`: Unlimited depth (crawl all reachable pages) - `0`: Only check the starting page - `1`: Check starting page and pages directly linked from it - `2`: Check starting page, directly linked pages, and pages linked from those **Example Response:** ```json { "results": [ { "page_url": "https://example.com", "link_reference": "Contact", "link_url": "https://example.com/contact", "status": "Good" }, { "page_url": "https://example.com/about", "link_reference": "<script>", "link_url": "https://cdn.example.com/script.js", "status": "Good" }, { "page_url": "https://example.com/products", "link_reference": "Old Product", "link_url": "https://example.com/product/discontinued", "status": "Bad" } ], "summary": { "total_links": 150, "good_links": 145, "bad_links": 5, "pages_scanned": 10 } } ``` ## Understanding Results ### Link Reference The `link_reference` field shows: - **For hyperlinks**: The anchor text (e.g., "Click here", "About Us") - **For images**: `<img alt="description">` or just `<img>` - **For scripts**: `<script>` - **For stylesheets**: `<link rel="stylesheet">` - **For media**: `<video>`, `<audio>`, `<video><source>`, `<audio><source>` - **For iframes**: `<iframe>` ### Link Status - **Good**: Link is accessible (HTTP 2xx or 3xx status codes) - **Bad**: Link is broken (HTTP 4xx, 5xx, timeout, DNS failure, or other errors) ### Summary Statistics - `total_links`: Total number of links checked across all pages - `good_links`: Number of working links - `bad_links`: Number of broken links - `pages_scanned`: Number of pages crawled (only for domain checks) ## Features ### Politeness Controls The checker implements several politeness features: - **robots.txt compliance**: Respects robots.txt directives - **Rate limiting**: 1-second delay between requests to the same domain by default - **Crawl delay**: Automatically applies crawl delay if specified in robots.txt - **User agent**: Identifies itself as "BrokenLinkChecker/1.0" ### What Gets Checked The tool validates: - Hyperlinks (`<a href>`) - Images (`<img src>`) - Scripts (`<script src>`) - Stylesheets (`<link href>`) - Video sources (`<video>`, `<source>`) - Audio sources (`<audio>`, `<source>`) - Iframes (`<iframe src>`) ### Crawling Behavior When checking domains: - Only follows hyperlinks for crawling (not images, scripts, etc.) - Only follows internal links (same domain) - Validates ALL link types on each page - Prevents duplicate page visits - Respects max depth setting ### Timeouts and Errors - Request timeout: 10 seconds per link - Follows redirects automatically (up to 10 redirects) - Uses HEAD requests first for efficiency, falls back to GET if needed - Handles DNS failures, connection errors, and timeouts gracefully ## Example Use Cases ### 1. Quick Page Check Check a single page before deploying: ``` Check https://mysite.com/new-page for any broken links ``` ### 2. Full Site Audit Check entire site with depth limit: ``` Check all pages on https://mysite.com for broken links, limiting to 3 levels deep ``` ### 3. Documentation Site Check documentation with unlimited depth: ``` Please scan https://docs.myproject.com completely for broken links ``` ### 4. Blog Posts Check specific blog post: ``` Verify all links work on https://blog.example.com/2025/01/new-post ``` ## Performance Considerations ### Single Page Checks - Fast, typically completes in seconds - Concurrent validation of all links on the page - No crawling overhead ### Domain Checks - Can take significant time for large sites - Time depends on: - Number of pages to crawl - Number of links per page - Rate limiting delays - Network latency - Server response times - Consider using `max_depth` to limit scope ### Recommended Depth Limits - **Small sites (< 50 pages)**: Unlimited (`-1`) - **Medium sites (50-500 pages)**: Depth 3-5 - **Large sites (> 500 pages)**: Depth 2-3 - **Very large sites**: Use single page checks on critical pages instead ## Troubleshooting ### Server won't start - Check if port 8000 is available: `lsof -i :8000` - Verify all dependencies are installed: `pip install -e .` ### All links show as "Bad" - Check your internet connection - Verify the target site is accessible - Check firewall settings ### Domain check times out - Reduce `max_depth` parameter - Check if the site has many pages - Verify robots.txt isn't blocking the checker ### Pages skipped during crawl - Check robots.txt for disallowed paths - Verify links are internal (same domain) - Check log output for details ## Running Tests Run the test suite: ```bash pytest ``` Run with coverage: ```bash pytest --cov=src --cov-report=html ``` ## Logging The server logs detailed information about its operations. To see debug output, modify the logging level in `src/server.py`: ```python logging.basicConfig( level=logging.DEBUG, # Change from INFO to DEBUG format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", ) ```

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/davinoishi/BLC-ground'

If you have feedback or need assistance with the MCP directory API, please join our Discord server