Web Research Assistant

Overview Schema Related Servers Score Discussions

TASKS.md•15.3 KiB

# Development Tasks - SearXNG MCP Server **Last Updated:** 2025-01-15 **Update this file whenever you start/complete/block on a task!** --- ## Task Status Legend - 🔴 **Not Started** - No work begun - 🟡 **In Progress** - Active development - 🟢 **Complete** - Done and tested - ⚫ **Blocked** - Waiting on dependency or decision - 🔵 **Testing** - Implementation done, needs verification --- ## Phase 1: Core Infrastructure ✅ ### Task #0: Project Setup **Status:** 🟢 Complete **Priority:** P0 (Critical) **Complexity:** Low **Estimated Time:** 2-3 hours **Actual Time:** ~3 hours **Description:** Set up minimal MCP server with SearXNG integration and crawl4ai page fetching. **Implementation Checklist:** - [x] Create modular package structure (`searxng_mcp/`) - [x] Implement `config.py` with env-based settings - [x] Implement `search.py` with SearXNG JSON API client - [x] Implement `crawler.py` with crawl4ai wrapper - [x] Implement `server.py` with FastMCP and two tools - [x] Add `pyproject.toml` with dependencies and console script - [x] Write `README.md` with setup/usage instructions - [x] Register in OpenCode global config (`~/.config/opencode/opencode.json`) - [x] Test with `uv run searxng-mcp` **Completion Notes:** - Successfully tested with OpenCode - Response limits work correctly (8KB default) - Both tools (`web_search`, `crawl_url`) operational - Environment variables properly override defaults **Files Modified:** - Created: `searxng_mcp/` package - Created: `pyproject.toml`, `requirements.txt`, `README.md` - Modified: `~/.config/opencode/opencode.json` --- ## Phase 2: High-Impact Tools 🚧 ### Task #1: Package Registry Search Tool **Status:** 🟢 Complete **Priority:** P1 (High) **Complexity:** Medium **Estimated Time:** 4-6 hours **Actual Time:** ~2 hours **Completed:** 2025-01-15 **Description:** Add tool to search npm, PyPI, crates.io, and Go packages. Return structured info: downloads, last update, dependencies, security advisories. **User Story:** As a developer, I want to quickly check package details without leaving my coding environment, so I can evaluate libraries faster. **Implementation Plan:** 1. **Create module:** `searxng_mcp/registry.py` - Class: `PackageRegistryClient` - Methods: - `async search_npm(name: str) -> PackageInfo` - `async search_pypi(name: str) -> PackageInfo` - `async search_crates(name: str) -> PackageInfo` - `async search_go(name: str) -> PackageInfo` - Use official registry APIs (not scraping) 2. **Add constants to config.py:** ```python NPM_REGISTRY_URL = "https://registry.npmjs.org" PYPI_API_URL = "https://pypi.org/pypi" CRATES_API_URL = "https://crates.io/api/v1" GO_PROXY_URL = "https://proxy.golang.org" ``` 3. **Add tool to server.py:** ```python @mcp.tool() async def package_info( name: str, registry: Literal["npm", "pypi", "crates", "go"] = "npm" ) -> str: """Fetch package information from registries.""" ``` 4. **Response format:** ``` Package: express (npm) Version: 4.18.2 (latest) Downloads: 25M/week Last Updated: 2023-12-15 License: MIT Dependencies: 30 Security: ✅ No known vulnerabilities Repository: https://github.com/expressjs/express Description: Fast, unopinionated, minimalist web framework ``` **Testing Checklist:** - [x] npm package lookup works - [x] PyPI package lookup works - [x] crates.io package lookup works - [x] Go module lookup works - [x] Handles non-existent packages gracefully - [x] Response fits within 8KB limit - [x] Error handling for network failures - [ ] Caching for repeated lookups (deferred) **Completion Notes:** - All four registries (npm, PyPI, crates.io, Go) working perfectly - Added weekly download stats for npm (51M/week for express) - Proper error handling for 404s with helpful messages - Response formatting clean and readable - Time-ago formatting (e.g., "7 months ago") improves UX - No new dependencies required (used existing httpx) **Files Created:** - `searxng_mcp/registry.py` (237 lines) - Registry client with all 4 APIs - Updated: `searxng_mcp/server.py` - Added `package_info` tool **Performance:** - npm lookups: ~1-2s - PyPI lookups: ~1s - crates.io lookups: ~1s - Go lookups: ~1-2s **Related Files:** - Spec: `docs/new-features/package-registry/SPEC.md` - Tests: Manual testing completed, unit tests optional --- ### Task #2: Error Translator Tool **Status:** 🔴 Not Started **Priority:** P1 (High) **Complexity:** High **Estimated Time:** 8-10 hours **Assigned To:** Unassigned **Description:** Tool that takes error messages/stack traces and returns ranked solutions from Stack Overflow, GitHub issues, and documentation. **User Story:** As a developer hitting an error, I want instant ranked solutions instead of manually searching Stack Overflow and clicking through multiple answers. **Implementation Plan:** 1. **Create module:** `searxng_mcp/error_solver.py` - Class: `ErrorSolver` - Methods: - `async analyze_error(error_text: str, context: dict) -> List[Solution]` - `async search_stackoverflow(error_sig: str) -> List[SOAnswer]` - `async search_github_issues(error_sig: str) -> List[GitHubIssue]` 2. **Error signature extraction:** - Strip file paths, line numbers (too specific) - Keep error type, core message - Identify language/framework from stack trace 3. **Ranking algorithm:** - Stack Overflow: accepted answer > votes > recency - GitHub: closed issues > issue reactions > recency - Filter: solutions must be <2 years old (default) 4. **Add tool to server.py:** ```python @mcp.tool() async def solve_error( error: str, language: str = "auto", max_solutions: int = 3 ) -> str: """Find solutions for error messages/stack traces.""" ``` 5. **Response format:** ``` Error Analysis: Type: TypeError Language: Python Likely Cause: NoneType attribute access Solution #1 (Stack Overflow - Accepted, 234 votes) Check for None before accessing attributes: if obj is not None: obj.method() Source: https://stackoverflow.com/... Solution #2 (GitHub Issue - Closed) [Solution details...] ``` **Testing Checklist:** - [ ] Extracts error signatures correctly - [ ] Python errors work - [ ] JavaScript errors work - [ ] Type errors, syntax errors, runtime errors all handled - [ ] Solutions ranked appropriately - [ ] No duplicate solutions - [ ] Handles malformed error text gracefully - [ ] Response fits size limits **Dependencies:** - May need: `beautifulsoup4` (already in crawl4ai deps) - May need: Stack Overflow API token (optional, higher rate limits) **Blockers:** - Need to decide: Use Stack Overflow API or scrape? (API has rate limits but cleaner) **Related Files:** - Spec: `docs/new-features/error-translator/SPEC.md` - Tests: `tests/test_error_solver.py` --- ### Task #3: GitHub Repository Info Tool **Status:** 🟢 Complete **Priority:** P2 (Medium) **Complexity:** Low **Estimated Time:** 3-4 hours **Actual Time:** ~1.5 hours **Completed:** 2025-01-15 **Description:** Tool to fetch GitHub repo metadata: stars, issues, recent commits, README, contributors. **User Story:** As a developer evaluating a library, I want quick repo health metrics without opening GitHub in browser. **Implementation Plan:** 1. **Create module:** `searxng_mcp/github.py` - Class: `GitHubClient` - Use GitHub REST API (no auth needed for public repos, but support token for rate limits) - Methods: - `async get_repo_info(owner: str, repo: str) -> RepoInfo` - `async get_recent_commits(owner: str, repo: str, count: int) -> List[Commit]` 2. **Add constants to config.py:** ```python GITHUB_API_URL = "https://api.github.com" GITHUB_TOKEN = _env_str("GITHUB_TOKEN", "") # Optional, for higher rate limits ``` 3. **Add tool to server.py:** ```python @mcp.tool() async def github_repo( repo: str, # Format: "owner/repo" or full URL include_commits: bool = True ) -> str: """Fetch GitHub repository information and health metrics.""" ``` 4. **Response format:** ``` Repository: unclecode/crawl4ai Stars: 15.2k | Forks: 1.3k | Watchers: 234 License: Apache-2.0 Last Updated: 2 hours ago Open Issues: 45 | Open PRs: 12 Recent Activity (Last 5 Commits): - [2h ago] feat: add webhook support (unclecode) - [1d ago] fix: memory leak in crawler (contributor) ... Description: Open-source LLM Friendly Web Crawler & Scraper Homepage: https://crawl4ai.com ``` **Testing Checklist:** - [x] Parses "owner/repo" format - [x] Parses full GitHub URLs - [x] Works without GitHub token - [x] Respects rate limits (60/hr without token, 5000/hr with) - [x] Handles non-existent repos - [x] Handles private repos (returns "not accessible") - [x] Response fits size limits **Completion Notes:** - Full GitHub REST API integration working perfectly - Supports both `owner/repo` and full URL formats - Returns comprehensive repo health metrics: stars, forks, issues, PRs, language, license - Includes recent commit history (optional) - Proper error handling for 404s and rate limits - Optional GitHub token support via `GITHUB_TOKEN` env var - Response time: ~2-3s per request **Files Created:** - `searxng_mcp/github.py` (209 lines) - Complete GitHub API client - Updated: `searxng_mcp/server.py` - Added `github_repo` tool **Performance:** - Repo info: ~1-2s - With commits: ~2-3s - Rate limits: 60 req/hr (no token), 5000 req/hr (with token) **Related Files:** - Spec: `docs/new-features/github-info/SPEC.md` - Tests: Manual testing completed --- ### Task #4: Structured Data Extraction Enhancement **Status:** 🔴 Not Started **Priority:** P2 (Medium) **Complexity:** Medium **Estimated Time:** 5-7 hours **Assigned To:** Unassigned **Description:** Enhance existing `crawl_url` or add new tool to extract structured data (tables, prices, specs) using CSS selectors or LLM extraction. **User Story:** As a developer, I want to extract specific fields from web pages (e.g., "get all prices from this e-commerce page") without processing full page markdown. **Implementation Plan:** 1. **Enhance crawler.py:** - Add method: `async extract_structured(url: str, schema: dict) -> dict` - Use crawl4ai's built-in extraction strategies: - `JsonCssExtractionStrategy` for CSS selector-based extraction - `LLMExtractionStrategy` for AI-powered extraction (optional) 2. **Add tool to server.py:** ```python @mcp.tool() async def extract_data( url: str, fields: str, # JSON string describing what to extract method: Literal["css", "llm"] = "css" ) -> str: """Extract structured data from a web page.""" ``` 3. **Example usage:** ``` extract_data( url="https://example.com/products", fields='{"name": "h2.product-title", "price": "span.price"}', method="css" ) Returns: [ {"name": "Product A", "price": "$29.99"}, {"name": "Product B", "price": "$39.99"} ] ``` **Testing Checklist:** - [ ] CSS selector extraction works - [ ] Handles missing elements gracefully - [ ] Works with tables - [ ] Works with lists - [ ] LLM extraction works (if implemented) - [ ] Response is valid JSON - [ ] Handles extraction failures **Dependencies:** - Already have crawl4ai with extraction support - Optional: LLM provider API key for LLM extraction mode **Blockers:** - Need to decide: CSS-only first, or include LLM extraction from start? **Related Files:** - Spec: `docs/new-features/structured-extract/SPEC.md` - Tests: `tests/test_extraction.py` --- ### Task #5: API Explorer Tool **Status:** 🔴 Not Started **Priority:** P2 (Medium) **Complexity:** High **Estimated Time:** 8-10 hours **Assigned To:** Unassigned **Description:** Tool to fetch OpenAPI/Swagger specs, generate example requests, test endpoints. **User Story:** As a developer integrating with an API, I want to explore endpoints and see example requests without leaving my environment. **Implementation Plan:** 1. **Create module:** `searxng_mcp/api_explorer.py` - Class: `APIExplorer` - Methods: - `async fetch_openapi_spec(url: str) -> dict` - `async list_endpoints(spec: dict) -> List[Endpoint]` - `async generate_example(endpoint: str, method: str) -> str` - `async test_endpoint(url: str, method: str, headers: dict) -> Response` 2. **Add tool to server.py:** ```python @mcp.tool() async def explore_api( spec_url: str, # URL to OpenAPI/Swagger JSON endpoint: str = "", # Optional: specific endpoint to explore ) -> str: """Explore API using OpenAPI/Swagger specification.""" ``` 3. **Response format:** ``` API: Stripe API v1 Base URL: https://api.stripe.com/v1 Authentication: Bearer token Available Endpoints (showing 5 of 234): 1. POST /v1/charges Description: Create a charge Required: amount, currency Example: curl -X POST https://api.stripe.com/v1/charges \ -H "Authorization: Bearer sk_test_..." \ -d amount=2000 \ -d currency=usd 2. GET /v1/customers/{id} ... ``` **Testing Checklist:** - [ ] Fetches OpenAPI 3.0 specs - [ ] Fetches Swagger 2.0 specs - [ ] Lists all endpoints - [ ] Generates valid curl examples - [ ] Handles authentication schemes - [ ] Identifies required vs optional params - [ ] Response fits size limits - [ ] Handles invalid/malformed specs **Dependencies:** - May need: `pydantic` for spec validation (already transitive dep of crawl4ai) - May need: `openapi-spec-validator` (optional) **Blockers:** - None **Related Files:** - Spec: `docs/new-features/api-explorer/SPEC.md` - Tests: `tests/test_api_explorer.py` --- ## Phase 3: Advanced Features 📅 ### Task #6: Multi-Source Aggregation **Status:** 🔴 Not Started **Priority:** P3 (Low) **Complexity:** Medium **Estimated Time:** 6-8 hours **Description:** Tool that searches multiple sources, crawls top results, and summarizes findings. **Deferred:** Will design after Phase 2 completion. --- ### Task #7: Site-Specific Search Filters **Status:** 🔴 Not Started **Priority:** P3 (Low) **Complexity:** Low **Estimated Time:** 2-3 hours **Description:** Add `site:` parameter to `web_search` tool for domain-scoped queries. **Deferred:** Will implement after Phase 2. --- ### Task #8: Security/CVE Monitoring **Status:** 🔴 Not Started **Priority:** P3 (Low) **Complexity:** High **Estimated Time:** 10-12 hours **Description:** Tool to check CVE databases and security advisories for specific tech stacks. **Deferred:** Requires more research into CVE APIs. --- ## Notes & Decisions ### 2025-01-15: Initial Planning - Prioritized based on daily usage frequency analysis - Package registry ranked #1 for quick win (high frequency, low complexity) - Error translator ranked high for impact but higher complexity - GitHub info easy to implement, good ROI - Structured extraction builds on existing crawler - API explorer more complex but very useful for integration work ### Future Considerations - Add caching layer for frequently accessed data (npm packages, GitHub repos) - Consider rate limiting to respect external API quotas - May want to add telemetry to track which tools are most used - Should add comprehensive error logging for debugging --- **Remember:** Update this file every time you start, progress, or complete a task!

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/elad12390/web-research-assistant'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

TASKS.md•15.3 KiB