Documentation Retrieval & Web Scraping

README.md•6.97 kB

## MCP Server: Documentation Retrieval & Web Scraping (uv + FastMCP) This project provides a minimal, async MCP (Model Context Protocol) server that exposes a tool for retrieving and cleaning official documentation content for popular AI / Python ecosystem libraries. It uses: - `fastmcp` to define and run the MCP server over stdio. - `httpx` for async HTTP calls. - `serper.dev` for Google-like search (via API). - `groq` API (LLM) to clean raw HTML into readable text chunks. - `python-dotenv` for environment variable management. - `uv` as the package manager & runner (fast, lockfile-based, Python 3.11+). ### Features - Search restricted to official docs domains (`uv`, `langchain`, `openai`, `llama-index`). - Tool: `get_docs(query, library)` returns concatenated cleaned sections with `SOURCE:` labels. - Streaming-safe async design (chunking large HTML pages before LLM cleaning). - Separate `client.py` demonstrating how to connect as an MCP client and call the tool, then post-process with an LLM. --- ## Quick Start Prerequisites: - Python 3.11+ - `uv` installed (https://docs.astral.sh/uv/) - API keys for: `SERPER_API_KEY`, `GROQ_API_KEY` ### 1. Clone & Install ```bash git clone <your-repo-url> mcp-server-python cd mcp-server-python uv sync ``` This will create/refresh a `.venv` based on `pyproject.toml` + `uv.lock`. ### 2. Environment Variables Create a `.env` file in the project root: ```env SERPER_API_KEY=your_serper_api_key_here GROQ_API_KEY=your_groq_api_key_here ``` Optional: add other model settings if you later extend functionality. ### 3. Run the MCP Server Directly ```bash uv run mcp_server.py ``` The server will start and wait on stdio (no extra output unless you add logging). It registers the tool `get_docs`. ### 4. Use the Provided Client ```bash uv run client.py ``` You should see something like: ``` Available tools: ['get_docs'] ANSWER: <model-produced answer referencing SOURCE lines> ``` If the list is empty, ensure the server started correctly and no exceptions were raised (add logging—see below). --- ## Tool: get_docs Signature: ``` get_docs(query: str, library: str) -> str ``` Supported libraries (keys): `uv`, `langchain`, `openai`, `llama-index`. Flow: 1. Build a site-restricted query: `site:<docs-domain> <query>`. 2. Call Serper API for organic results. 3. Fetch each result URL (async) via `httpx`. 4. Split HTML into ~4000‑char chunks (memory safety & LLM limits). 5. Clean each chunk using Groq LLM (`openai/gpt-oss-20b`) with a system prompt. 6. Concatenate and label each block with `SOURCE: <url>` for traceability. Returned value: A large text blob suitable for retrieval-augmented prompting, preserving source attribution lines. --- ## Architecture File overview: | File | Purpose | |------|---------| | `mcp_server.py` | Defines `FastMCP` instance and implements `search_web`, `fetch_url`, and the `get_docs` tool. | | `client.py` | Launches server via stdio, lists tools, calls `get_docs`, then feeds result to an LLM for a user-friendly answer. | | `utils.py` | HTML cleaning helper (currently uses LLM + `trafilatura` for extraction and Groq for chunk transformation). | | `.env` | Environment variables (excluded from VCS). | | `pyproject.toml` | Declares dependencies and metadata. | | `uv.lock` | Reproducible lockfile generated by `uv`. | --- ## Dependency Notes Core runtime deps (from `pyproject.toml`): - `fastmcp` – MCP server helper. - `httpx` – async HTTP client. - `groq` – Groq API client. - `python-dotenv` – load variables from `.env`. - `trafilatura` – heuristic content extraction (currently partially used / can be extended). > Tip: If you add more scraping tools, reuse a single `httpx.AsyncClient` for performance. --- ## Logging & Debugging To see what the server is doing, you can temporarily add: ```python import logging, sys logging.basicConfig(level=logging.INFO, stream=sys.stderr) ``` Place near the top of `mcp_server.py` after imports. Since protocol uses stdout for JSON-RPC, send logs to stderr only. Common issues: - Empty tool list: The server exited early or crashed—add logging. - `SERPER_API_KEY` missing → 401 or empty search results. - `GROQ_API_KEY` missing → LLM cleaning fails (exception in `get_response_from_llm`). - Network timeouts: Adjust `timeout` in `httpx.AsyncClient` calls. --- ## Extending Ideas: - Add caching layer (e.g., `sqlite` or in-memory dict) to avoid re-fetching same URLs. - Parallelize URL fetch + clean with `asyncio.gather()` (mind rate limits / LLM cost). - Add another tool (e.g., `summarize_diff`, `list_endpoints`). - Provide structured JSON output (list of sources + cleaned text) instead of concatenated string. - Add tests using `pytest` + `pytest-asyncio` (mock Serper + LLM APIs). --- ## Example Programmatic Use (Without Client Wrapper) If you want to call the tool directly in a Python script using the client-side MCP library: ```python from mcp.client.stdio import stdio_client from mcp import ClientSession, StdioServerParameters import asyncio async def demo(): params = StdioServerParameters(command="uv", args=["run", "mcp_server.py"]) async with stdio_client(params) as (r, w): async with ClientSession(r, w) as session: await session.initialize() tools = await session.list_tools() print([t.name for t in tools.tools]) docs = await session.call_tool("get_docs", {"query": "install", "library": "uv"}) print(docs.content[:500]) asyncio.run(demo()) ``` --- ## Running With Active Virtualenv If you have an already activated virtual environment and want to use that instead of the project’s pinned environment, you can force uv to target it: ```bash uv run --active client.py ``` Otherwise, uv will warn that your active `$VIRTUAL_ENV` differs from the project `.venv` but continue using the project environment. --- ## License Add a license section here (e.g., MIT) if you intend to distribute. --- ## Troubleshooting Cheat Sheet | Symptom | Cause | Fix | |---------|-------|-----| | No tools listed | Server not running / crashed | Add stderr logging; run `uv run mcp_server.py` manually | | AttributeError on `.text` | Cleaner returned None | Ensure you return actual string from `fetch_url` / LLM call | | 401 from Serper | Bad/missing API key | Check `.env` and reload shell | | Empty search results | Narrow query | Simplify query or verify domain key | | High latency | Many sequential LLM chunk calls | Batch or reduce chunk size | --- ## Contributing 1. Fork & branch. 2. Run `uv sync`. 3. Add tests for new tools (if added). 4. Open PR with clear description. --- ## Roadmap (Optional) - [] Add JSON schema metadata for tool params. - [] Structured response format (list of {source, text}). - [] Add caching layer. - [] Add rate limiting/backoff. - [] Add CI workflow (lint + tests). --- ## Acknowledgments - Serper.dev for search API - Groq for fast OSS model serving - Astral for `uv` - MCP ecosystem for protocol foundation

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AIwithhassan/mcp-server-python'

If you have feedback or need assistance with the MCP directory API, please join our Discord server