README.mdā¢4.57 kB
# Spider MCP - Web Search Crawler Service
A web search MCP service based on pure crawler technology, built with Node.js.
## Features
- ā **No Official API Required**: Completely based on crawler technology, no dependency on third-party official APIs
- š **Intelligent Search**: Supports Bing web and news search
- š° **News Search**: Built-in news search with time filtering
- š·ļø **Pure Crawler**: No official API dependency, uses Puppeteer for web scraping
- š **High Performance**: Supports batch web scraping
- š **Health Monitoring**: Complete health check and metrics monitoring
- š **Structured Logging**: Uses Winston for structured logs
- š **Anti-Detection**: Supports User-Agent rotation and other anti-bot measures
- š **Smart URL Cleaning**: Automatically cleans promotional parameters while preserving essential information
## Tech Stack
- **Node.js** (>= 18.0.0)
- **Express.js** - Web framework
- **Puppeteer** - Browser automation
- **Cheerio** - HTML parsing
- **Axios** - HTTP client
- **Winston** - Logging
- **@modelcontextprotocol/sdk** - MCP protocol support
## Quick Start
### 1. Install dependencies
```bash
npm install
```
or use `pnpm`
```bash
pnpm install
```
### 2. Download Puppeteer browser
```bash
npx puppeteer browsers install chrome
```
### 3. Environment configuration
Copy and configure the environment variables file:
```bash
cp .env.example .env
```
Edit the `.env` file according to your needs.
### 4. Start the service
Development mode:
```bash
npm run dev
```
Production mode:
```bash
npm start
```
The service will start at `http://localhost:3000`.
## MCP Tools
### web_search
Unified search tool supporting both web and news search:
- **Web Search**: `searchType: "web"`
- **News Search**: `searchType: "news"` with time filtering
- **Note**: `searchType` is a required parameter and must be explicitly specified
#### Usage Examples:
```
# Web search
Use web_search tool to search "Node.js tutorial" with searchType set to web, return 10 results
# News search
Use web_search tool to search "tech news" with searchType set to news, return 5 results from past 24 hours
```
### Other Tools
- `get_webpage_content`: Get webpage content and convert to specified format
- `get_webpage_source`: Get raw HTML source code of webpage
- `batch_webpage_scrape`: Batch scrape multiple webpages
## MCP Configuration
### Chatbox Configuration
Create `mcp-config.json` file in Chatbox:
```json
{
"mcpServers": {
"spider-mcp": {
"command": "node",
"args": ["src/mcp/server.js"],
"env": {
"NODE_ENV": "production"
},
"description": "Spider MCP - Web search and webpage scraping tools",
"capabilities": {
"tools": {}
}
}
}
}
```
### Other MCP Clients
```json
{
"mcpServers": {
"spider-mcp": {
"command": "node",
"args": ["path/to/spider-mcp/src/mcp/server.js"]
}
}
}
```
## Important Notes
1. **Anti-bot Measures**: This service uses various techniques to avoid detection, but still needs to comply with robots.txt and terms of use
2. **Rate Limiting**: It's recommended to control request frequency reasonably to avoid putting pressure on target websites
3. **Legal Compliance**: Please ensure compliance with local laws and website terms of use when using this service
4. **Resource Consumption**: Puppeteer will start Chrome browser, please pay attention to memory and CPU usage
5. **URL Cleaning**: Automatically cleans promotional parameters but may affect some special link functionality
## Development
### Project Structure
```tree
spider-mcp/
āāā src/
ā āāā index.js # Main entry file
ā āāā mcp/
ā ā āāā server.js # MCP server
ā āāā routes/ # Route definitions
ā ā āāā search.js # Search routes
ā ā āāā health.js # Health check routes
ā āāā services/ # Business logic
ā ā āāā searchService.js # Search service
ā āāā utils/ # Utility functions
ā āāā logger.js # Logging utility
āāā logs/ # Log files directory
āāā tests/ # Test files
āāā package.json # Project configuration
āāā .env.example # Environment variables example
āāā mcp-config.json # MCP configuration example
āāā README.md # Project documentation
```
## License
MIT License
## Contributing
Issues and Pull Requests are welcome!