USAGE.md•7.07 kB
# Broken Link Checker MCP Server - Usage Guide
## Installation
1. Clone or download this repository
2. Install dependencies:
```bash
pip install -e .
```
For development with testing:
```bash
pip install -e ".[dev]"
```
## Starting the Server
Start the MCP server with HTTP transport:
```bash
python src/server.py
```
The server will run on `http://127.0.0.1:8000` by default.
You should see output like:
```
2025-10-11 12:00:00 - src.server - INFO - Starting Broken Link Checker MCP Server on HTTP transport
```
## Using the Server with Claude
Once the server is running, you can connect to it using Claude or any other MCP client that supports HTTP transport.
### Configuring Claude Desktop
Add the following to your Claude Desktop configuration:
```json
{
"mcpServers": {
"broken-link-checker": {
"url": "http://127.0.0.1:8000"
}
}
}
```
## Available Tools
### 1. check_page
Check all links on a single web page.
**Usage in Claude:**
```
Please check this page for broken links: https://example.com
```
**Parameters:**
- `url` (string, required): The URL of the page to check
**Example Response:**
```json
{
"results": [
{
"page_url": "https://example.com",
"link_reference": "About Us",
"link_url": "https://example.com/about",
"status": "Good"
},
{
"page_url": "https://example.com",
"link_reference": "<img alt='logo'>",
"link_url": "https://example.com/missing-image.png",
"status": "Bad"
}
],
"summary": {
"total_links": 15,
"good_links": 14,
"bad_links": 1,
"pages_scanned": 1
}
}
```
### 2. check_domain
Recursively check all pages within a domain for broken links.
**Usage in Claude:**
```
Please check the entire example.com domain for broken links, with a maximum depth of 2 levels.
```
**Parameters:**
- `url` (string, required): The root URL of the domain to check
- `max_depth` (integer, optional): Maximum crawl depth. Default is -1 (unlimited)
- `-1`: Unlimited depth (crawl all reachable pages)
- `0`: Only check the starting page
- `1`: Check starting page and pages directly linked from it
- `2`: Check starting page, directly linked pages, and pages linked from those
**Example Response:**
```json
{
"results": [
{
"page_url": "https://example.com",
"link_reference": "Contact",
"link_url": "https://example.com/contact",
"status": "Good"
},
{
"page_url": "https://example.com/about",
"link_reference": "<script>",
"link_url": "https://cdn.example.com/script.js",
"status": "Good"
},
{
"page_url": "https://example.com/products",
"link_reference": "Old Product",
"link_url": "https://example.com/product/discontinued",
"status": "Bad"
}
],
"summary": {
"total_links": 150,
"good_links": 145,
"bad_links": 5,
"pages_scanned": 10
}
}
```
## Understanding Results
### Link Reference
The `link_reference` field shows:
- **For hyperlinks**: The anchor text (e.g., "Click here", "About Us")
- **For images**: `<img alt="description">` or just `<img>`
- **For scripts**: `<script>`
- **For stylesheets**: `<link rel="stylesheet">`
- **For media**: `<video>`, `<audio>`, `<video><source>`, `<audio><source>`
- **For iframes**: `<iframe>`
### Link Status
- **Good**: Link is accessible (HTTP 2xx or 3xx status codes)
- **Bad**: Link is broken (HTTP 4xx, 5xx, timeout, DNS failure, or other errors)
### Summary Statistics
- `total_links`: Total number of links checked across all pages
- `good_links`: Number of working links
- `bad_links`: Number of broken links
- `pages_scanned`: Number of pages crawled (only for domain checks)
## Features
### Politeness Controls
The checker implements several politeness features:
- **robots.txt compliance**: Respects robots.txt directives
- **Rate limiting**: 1-second delay between requests to the same domain by default
- **Crawl delay**: Automatically applies crawl delay if specified in robots.txt
- **User agent**: Identifies itself as "BrokenLinkChecker/1.0"
### What Gets Checked
The tool validates:
- Hyperlinks (`<a href>`)
- Images (`<img src>`)
- Scripts (`<script src>`)
- Stylesheets (`<link href>`)
- Video sources (`<video>`, `<source>`)
- Audio sources (`<audio>`, `<source>`)
- Iframes (`<iframe src>`)
### Crawling Behavior
When checking domains:
- Only follows hyperlinks for crawling (not images, scripts, etc.)
- Only follows internal links (same domain)
- Validates ALL link types on each page
- Prevents duplicate page visits
- Respects max depth setting
### Timeouts and Errors
- Request timeout: 10 seconds per link
- Follows redirects automatically (up to 10 redirects)
- Uses HEAD requests first for efficiency, falls back to GET if needed
- Handles DNS failures, connection errors, and timeouts gracefully
## Example Use Cases
### 1. Quick Page Check
Check a single page before deploying:
```
Check https://mysite.com/new-page for any broken links
```
### 2. Full Site Audit
Check entire site with depth limit:
```
Check all pages on https://mysite.com for broken links, limiting to 3 levels deep
```
### 3. Documentation Site
Check documentation with unlimited depth:
```
Please scan https://docs.myproject.com completely for broken links
```
### 4. Blog Posts
Check specific blog post:
```
Verify all links work on https://blog.example.com/2025/01/new-post
```
## Performance Considerations
### Single Page Checks
- Fast, typically completes in seconds
- Concurrent validation of all links on the page
- No crawling overhead
### Domain Checks
- Can take significant time for large sites
- Time depends on:
- Number of pages to crawl
- Number of links per page
- Rate limiting delays
- Network latency
- Server response times
- Consider using `max_depth` to limit scope
### Recommended Depth Limits
- **Small sites (< 50 pages)**: Unlimited (`-1`)
- **Medium sites (50-500 pages)**: Depth 3-5
- **Large sites (> 500 pages)**: Depth 2-3
- **Very large sites**: Use single page checks on critical pages instead
## Troubleshooting
### Server won't start
- Check if port 8000 is available: `lsof -i :8000`
- Verify all dependencies are installed: `pip install -e .`
### All links show as "Bad"
- Check your internet connection
- Verify the target site is accessible
- Check firewall settings
### Domain check times out
- Reduce `max_depth` parameter
- Check if the site has many pages
- Verify robots.txt isn't blocking the checker
### Pages skipped during crawl
- Check robots.txt for disallowed paths
- Verify links are internal (same domain)
- Check log output for details
## Running Tests
Run the test suite:
```bash
pytest
```
Run with coverage:
```bash
pytest --cov=src --cov-report=html
```
## Logging
The server logs detailed information about its operations. To see debug output, modify the logging level in `src/server.py`:
```python
logging.basicConfig(
level=logging.DEBUG, # Change from INFO to DEBUG
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
```