MCP Web Research Agent

MIT License

Overview InspectNew Endpoints Schema Related Servers Reviews Score

mcp-web-research-agent

README.md•5.26 kB

# MCP Web Research Agent > A powerful MCP (Model Context Protocol) tool for automated web research, scraping, and intelligence gathering. [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![MCP Protocol](https://img.shields.io/badge/MCP-Protocol-green.svg)](https://spec.modelcontextprotocol.io/) A sophisticated web research automation tool that converts your existing scraper into an MCP-compatible agent for enhanced AI workflows. Perfect for competitive intelligence, market research, and automated data collection. ## 🚀 Features - **🔍 Intelligent Scraping**: Recursive web crawling with configurable depth - **🔎 Search Integration**: Multi-engine search with result processing - **💾 Database Storage**: Persistent SQLite storage with advanced querying - **📊 Multiple Export Formats**: JSON, Markdown, and CSV exports - **🤖 MCP Integration**: Seamless integration with AI assistants - **⚡ Async Ready**: Built for concurrent operations - **🔧 Configurable**: Adjustable settings for any use case ## 🛠️ Installation ### Prerequisites - Python 3.8+ - MCP-compatible client (Claude Desktop, etc.) ### Quick Install ```bash # Clone the repository git clone https://github.com/yourusername/mcp-web-research-agent.git cd mcp-web-research-agent # Install dependencies pip install -e . ``` ### MCP Client Configuration Add to your MCP client configuration: ```json { "mcpServers": { "web-research-agent": { "command": "python", "args": ["/path/to/mcp-web-research-agent/server.py"] } } } ``` ## 📖 Usage ### Available Tools #### `scrape_url` Scrape a single URL for specific keywords ```python result = await scrape_url( url="https://example.com", keywords=["python", "automation", "scraping"], extract_links=False, max_depth=1 ) ``` #### `search_and_scrape` Search the web and automatically scrape results ```python result = await search_and_scrape( query="web scraping best practices", keywords=["python", "beautifulsoup", "requests"], search_engine_url="https://searx.gophernuttz.us/search/", max_results=10 ) ``` #### `get_scraping_results` Query the database for previous scraping results ```python result = await get_scraping_results( keyword_filter="python", limit=50 ) ``` #### `export_results` Export results to various formats ```python result = await export_results( format="markdown", keyword_filter="python", output_path="/path/to/output.md" ) ``` #### `get_scraping_stats` Get current statistics and status ```python result = await get_scraping_stats() ``` ## 🗃️ Database Schema The agent uses SQLite with the following structure: ```sql -- URLs table CREATE TABLE urls ( id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT UNIQUE NOT NULL, title TEXT, content TEXT, timestamp DATETIME DEFAULT CURRENT_TIMESTAMP ); -- Keywords table CREATE TABLE keywords ( id INTEGER PRIMARY KEY AUTOINCREMENT, keyword TEXT UNIQUE NOT NULL ); -- URL-Keyword relationships CREATE TABLE url_keywords ( id INTEGER PRIMARY KEY AUTOINCREMENT, url_id INTEGER, keyword_id INTEGER, matches INTEGER DEFAULT 1, context TEXT, FOREIGN KEY (url_id) REFERENCES urls (id), FOREIGN KEY (keyword_id) REFERENCES keywords (id), UNIQUE(url_id, keyword_id) ); ``` ## 🔧 Configuration ### Default Settings - **Max Depth**: 3 levels of recursive crawling - **Request Delay**: 1 second between requests - **User Agent**: Modern Chrome browser simulation - **Database**: `scraper_results.db` (auto-created) ### Customization Modify settings in the MCPWebScraper constructor: ```python scraper = MCPWebScraper( db_manager=db_manager, max_depth=5, # Increase crawl depth delay=0.5 # Faster requests ) ``` ## 🧪 Development ### Running Tests ```bash python test_mcp_scraper.py ``` ### Example Usage ```bash python example_usage.py ``` ### Project Structure ``` mcp-web-research-agent/ ├── server.py # MCP server implementation ├── scraper.py # Core scraping logic ├── database.py # Database management ├── requirements.txt # Python dependencies ├── pyproject.toml # Package configuration ├── test_mcp_scraper.py # Unit tests ├── example_usage.py # Usage examples └── README.md # This file ``` ## 🤝 Contributing Contributions are welcome! Please feel free to submit a Pull Request. 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/amazing-feature`) 3. Commit your changes (`git commit -m 'Add some amazing feature'`) 4. Push to the branch (`git push origin feature/amazing-feature`) 5. Open a Pull Request ## 📄 License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## 🙏 Acknowledgments - Built on the [Model Context Protocol](https://spec.modelcontextprotocol.io/) - Inspired by modern web scraping best practices - Thanks to the open-source community for amazing tools --- **Built with ❤️ for the MCP ecosystem**

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/SnotacusNexus/mcp-web-research-agent'

If you have feedback or need assistance with the MCP directory API, please join our Discord server