README.mdโข5.26 kB
# MCP Web Research Agent
> A powerful MCP (Model Context Protocol) tool for automated web research, scraping, and intelligence gathering.
[](LICENSE)
[](https://www.python.org/downloads/)
[](https://spec.modelcontextprotocol.io/)
A sophisticated web research automation tool that converts your existing scraper into an MCP-compatible agent for enhanced AI workflows. Perfect for competitive intelligence, market research, and automated data collection.
## ๐ Features
- **๐ Intelligent Scraping**: Recursive web crawling with configurable depth
- **๐ Search Integration**: Multi-engine search with result processing
- **๐พ Database Storage**: Persistent SQLite storage with advanced querying
- **๐ Multiple Export Formats**: JSON, Markdown, and CSV exports
- **๐ค MCP Integration**: Seamless integration with AI assistants
- **โก Async Ready**: Built for concurrent operations
- **๐ง Configurable**: Adjustable settings for any use case
## ๐ ๏ธ Installation
### Prerequisites
- Python 3.8+
- MCP-compatible client (Claude Desktop, etc.)
### Quick Install
```bash
# Clone the repository
git clone https://github.com/yourusername/mcp-web-research-agent.git
cd mcp-web-research-agent
# Install dependencies
pip install -e .
```
### MCP Client Configuration
Add to your MCP client configuration:
```json
{
"mcpServers": {
"web-research-agent": {
"command": "python",
"args": ["/path/to/mcp-web-research-agent/server.py"]
}
}
}
```
## ๐ Usage
### Available Tools
#### `scrape_url`
Scrape a single URL for specific keywords
```python
result = await scrape_url(
url="https://example.com",
keywords=["python", "automation", "scraping"],
extract_links=False,
max_depth=1
)
```
#### `search_and_scrape`
Search the web and automatically scrape results
```python
result = await search_and_scrape(
query="web scraping best practices",
keywords=["python", "beautifulsoup", "requests"],
search_engine_url="https://searx.gophernuttz.us/search/",
max_results=10
)
```
#### `get_scraping_results`
Query the database for previous scraping results
```python
result = await get_scraping_results(
keyword_filter="python",
limit=50
)
```
#### `export_results`
Export results to various formats
```python
result = await export_results(
format="markdown",
keyword_filter="python",
output_path="/path/to/output.md"
)
```
#### `get_scraping_stats`
Get current statistics and status
```python
result = await get_scraping_stats()
```
## ๐๏ธ Database Schema
The agent uses SQLite with the following structure:
```sql
-- URLs table
CREATE TABLE urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE NOT NULL,
title TEXT,
content TEXT,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Keywords table
CREATE TABLE keywords (
id INTEGER PRIMARY KEY AUTOINCREMENT,
keyword TEXT UNIQUE NOT NULL
);
-- URL-Keyword relationships
CREATE TABLE url_keywords (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url_id INTEGER,
keyword_id INTEGER,
matches INTEGER DEFAULT 1,
context TEXT,
FOREIGN KEY (url_id) REFERENCES urls (id),
FOREIGN KEY (keyword_id) REFERENCES keywords (id),
UNIQUE(url_id, keyword_id)
);
```
## ๐ง Configuration
### Default Settings
- **Max Depth**: 3 levels of recursive crawling
- **Request Delay**: 1 second between requests
- **User Agent**: Modern Chrome browser simulation
- **Database**: `scraper_results.db` (auto-created)
### Customization
Modify settings in the MCPWebScraper constructor:
```python
scraper = MCPWebScraper(
db_manager=db_manager,
max_depth=5, # Increase crawl depth
delay=0.5 # Faster requests
)
```
## ๐งช Development
### Running Tests
```bash
python test_mcp_scraper.py
```
### Example Usage
```bash
python example_usage.py
```
### Project Structure
```
mcp-web-research-agent/
โโโ server.py # MCP server implementation
โโโ scraper.py # Core scraping logic
โโโ database.py # Database management
โโโ requirements.txt # Python dependencies
โโโ pyproject.toml # Package configuration
โโโ test_mcp_scraper.py # Unit tests
โโโ example_usage.py # Usage examples
โโโ README.md # This file
```
## ๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Acknowledgments
- Built on the [Model Context Protocol](https://spec.modelcontextprotocol.io/)
- Inspired by modern web scraping best practices
- Thanks to the open-source community for amazing tools
---
**Built with โค๏ธ for the MCP ecosystem**