MCP Windows Website Downloader Server
by angrysky56
# MCP Website Downloader
Simple MCP server for downloading documentation websites and preparing them for RAG indexing.
## Features
- Downloads complete documentation sites, well big chunks anyway.
- Maintains link structure and navigation, not really. lol
- Downloads and organizes assets (CSS, JS, images), but isn't really AI friendly and it all probably needs some kind of parsing or vectorizing into a db or something.
- Creates clean index for RAG systems, currently seems to make an index in each folder, not even looked at it.
- Simple single-purpose MCP interface, yup.
## Installation
Fork and download, cd to the repository.
```bash
uv venv
./venv/Scripts/activate
pip install -e .
```
Put this in your claude_desktop_config.json with your own paths:
```json
"mcp-windows-website-downloader": {
"command": "uv",
"args": [
"--directory",
"F:/GithubRepos/mcp-windows-website-downloader",
"run",
"mcp-windows-website-downloader",
"--library",
"F:/GithubRepos/mcp-windows-website-downloader/website_library"
]
},
```

## Other Usage you don't need to worry about and may be hallucinatory lol:
1. Start the server:
```bash
python -m mcp_windows_website_downloader.server --library docs_library
```
2. Use through Claude Desktop or other MCP clients:
```python
result = await server.call_tool("download", {
"url": "https://docs.example.com"
})
```
## Output Structure
```
docs_library/
domain_name/
index.html
about.html
docs/
getting-started.html
...
assets/
css/
js/
images/
fonts/
rag_index.json
```
## Development
The server follows standard MCP architecture:
```
src/
mcp_windows_website_downloader/
__init__.py
server.py # MCP server implementation
core.py # Core downloader functionality
utils.py # Helper utilities
```
### Components
- `server.py`: Main MCP server implementation that handles tool registration and requests
- `core.py`: Core website downloading functionality with proper asset handling
- `utils.py`: Helper utilities for file handling and URL processing
### Design Principles
1. Single Responsibility
- Each module has one clear purpose
- Server handles MCP interface
- Core handles downloading
- Utils handles common operations
2. Clean Structure
- Maintains original site structure
- Organizes assets by type
- Creates clear index for RAG systems
3. Robust Operation
- Proper error handling
- Reasonable depth limits
- Asset download verification
- Clean URL/path processing
### RAG Index
The `rag_index.json` file contains:
```json
{
"url": "https://docs.example.com",
"domain": "docs.example.com",
"pages": 42,
"path": "/path/to/site"
}
```
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Submit a pull request
## License
MIT License - See LICENSE file
## Error Handling
The server handles common issues:
- Invalid URLs
- Network errors
- Asset download failures
- Malformed HTML
- Deep recursion
- File system errors
Error responses follow the format:
```json
{
"status": "error",
"error": "Detailed error message"
}
```
Success responses:
```json
{
"status": "success",
"path": "/path/to/downloaded/site",
"pages": 42
}
```