README.md•4.57 kB
# Spider MCP - Web Search Crawler Service
A web search MCP service based on pure crawler technology, built with Node.js.
## Features
- ❌ **No Official API Required**: Completely based on crawler technology, no dependency on third-party official APIs
- 🔍 **Intelligent Search**: Supports Bing web and news search
- 📰 **News Search**: Built-in news search with time filtering
- 🕷️ **Pure Crawler**: No official API dependency, uses Puppeteer for web scraping
- 🚀 **High Performance**: Supports batch web scraping
- 📊 **Health Monitoring**: Complete health check and metrics monitoring
- 📝 **Structured Logging**: Uses Winston for structured logs
- 🔒 **Anti-Detection**: Supports User-Agent rotation and other anti-bot measures
- 🔗 **Smart URL Cleaning**: Automatically cleans promotional parameters while preserving essential information
## Tech Stack
- **Node.js** (>= 18.0.0)
- **Express.js** - Web framework
- **Puppeteer** - Browser automation
- **Cheerio** - HTML parsing
- **Axios** - HTTP client
- **Winston** - Logging
- **@modelcontextprotocol/sdk** - MCP protocol support
## Quick Start
### 1. Install dependencies
```bash
npm install
```
or use `pnpm`
```bash
pnpm install
```
### 2. Download Puppeteer browser
```bash
npx puppeteer browsers install chrome
```
### 3. Environment configuration
Copy and configure the environment variables file:
```bash
cp .env.example .env
```
Edit the `.env` file according to your needs.
### 4. Start the service
Development mode:
```bash
npm run dev
```
Production mode:
```bash
npm start
```
The service will start at `http://localhost:3000`.
## MCP Tools
### web_search
Unified search tool supporting both web and news search:
- **Web Search**: `searchType: "web"`
- **News Search**: `searchType: "news"` with time filtering
- **Note**: `searchType` is a required parameter and must be explicitly specified
#### Usage Examples:
```
# Web search
Use web_search tool to search "Node.js tutorial" with searchType set to web, return 10 results
# News search
Use web_search tool to search "tech news" with searchType set to news, return 5 results from past 24 hours
```
### Other Tools
- `get_webpage_content`: Get webpage content and convert to specified format
- `get_webpage_source`: Get raw HTML source code of webpage
- `batch_webpage_scrape`: Batch scrape multiple webpages
## MCP Configuration
### Chatbox Configuration
Create `mcp-config.json` file in Chatbox:
```json
{
"mcpServers": {
"spider-mcp": {
"command": "node",
"args": ["src/mcp/server.js"],
"env": {
"NODE_ENV": "production"
},
"description": "Spider MCP - Web search and webpage scraping tools",
"capabilities": {
"tools": {}
}
}
}
}
```
### Other MCP Clients
```json
{
"mcpServers": {
"spider-mcp": {
"command": "node",
"args": ["path/to/spider-mcp/src/mcp/server.js"]
}
}
}
```
## Important Notes
1. **Anti-bot Measures**: This service uses various techniques to avoid detection, but still needs to comply with robots.txt and terms of use
2. **Rate Limiting**: It's recommended to control request frequency reasonably to avoid putting pressure on target websites
3. **Legal Compliance**: Please ensure compliance with local laws and website terms of use when using this service
4. **Resource Consumption**: Puppeteer will start Chrome browser, please pay attention to memory and CPU usage
5. **URL Cleaning**: Automatically cleans promotional parameters but may affect some special link functionality
## Development
### Project Structure
```tree
spider-mcp/
├── src/
│ ├── index.js # Main entry file
│ ├── mcp/
│ │ └── server.js # MCP server
│ ├── routes/ # Route definitions
│ │ ├── search.js # Search routes
│ │ └── health.js # Health check routes
│ ├── services/ # Business logic
│ │ └── searchService.js # Search service
│ └── utils/ # Utility functions
│ └── logger.js # Logging utility
├── logs/ # Log files directory
├── tests/ # Test files
├── package.json # Project configuration
├── .env.example # Environment variables example
├── mcp-config.json # MCP configuration example
└── README.md # Project documentation
```
## License
MIT License
## Contributing
Issues and Pull Requests are welcome!