README.md•6.97 kB
## MCP Server: Documentation Retrieval & Web Scraping (uv + FastMCP)
This project provides a minimal, async MCP (Model Context Protocol) server that exposes a tool for retrieving and cleaning official documentation content for popular AI / Python ecosystem libraries. It uses:
- `fastmcp` to define and run the MCP server over stdio.
- `httpx` for async HTTP calls.
- `serper.dev` for Google-like search (via API).
- `groq` API (LLM) to clean raw HTML into readable text chunks.
- `python-dotenv` for environment variable management.
- `uv` as the package manager & runner (fast, lockfile-based, Python 3.11+).
### Features
- Search restricted to official docs domains (`uv`, `langchain`, `openai`, `llama-index`).
- Tool: `get_docs(query, library)` returns concatenated cleaned sections with `SOURCE:` labels.
- Streaming-safe async design (chunking large HTML pages before LLM cleaning).
- Separate `client.py` demonstrating how to connect as an MCP client and call the tool, then post-process with an LLM.
---
## Quick Start
Prerequisites:
- Python 3.11+
- `uv` installed (https://docs.astral.sh/uv/)
- API keys for: `SERPER_API_KEY`, `GROQ_API_KEY`
### 1. Clone & Install
```bash
git clone <your-repo-url> mcp-server-python
cd mcp-server-python
uv sync
```
This will create/refresh a `.venv` based on `pyproject.toml` + `uv.lock`.
### 2. Environment Variables
Create a `.env` file in the project root:
```env
SERPER_API_KEY=your_serper_api_key_here
GROQ_API_KEY=your_groq_api_key_here
```
Optional: add other model settings if you later extend functionality.
### 3. Run the MCP Server Directly
```bash
uv run mcp_server.py
```
The server will start and wait on stdio (no extra output unless you add logging). It registers the tool `get_docs`.
### 4. Use the Provided Client
```bash
uv run client.py
```
You should see something like:
```
Available tools: ['get_docs']
ANSWER: <model-produced answer referencing SOURCE lines>
```
If the list is empty, ensure the server started correctly and no exceptions were raised (add logging—see below).
---
## Tool: get_docs
Signature:
```
get_docs(query: str, library: str) -> str
```
Supported libraries (keys): `uv`, `langchain`, `openai`, `llama-index`.
Flow:
1. Build a site-restricted query: `site:<docs-domain> <query>`.
2. Call Serper API for organic results.
3. Fetch each result URL (async) via `httpx`.
4. Split HTML into ~4000‑char chunks (memory safety & LLM limits).
5. Clean each chunk using Groq LLM (`openai/gpt-oss-20b`) with a system prompt.
6. Concatenate and label each block with `SOURCE: <url>` for traceability.
Returned value: A large text blob suitable for retrieval-augmented prompting, preserving source attribution lines.
---
## Architecture
File overview:
| File | Purpose |
|------|---------|
| `mcp_server.py` | Defines `FastMCP` instance and implements `search_web`, `fetch_url`, and the `get_docs` tool. |
| `client.py` | Launches server via stdio, lists tools, calls `get_docs`, then feeds result to an LLM for a user-friendly answer. |
| `utils.py` | HTML cleaning helper (currently uses LLM + `trafilatura` for extraction and Groq for chunk transformation). |
| `.env` | Environment variables (excluded from VCS). |
| `pyproject.toml` | Declares dependencies and metadata. |
| `uv.lock` | Reproducible lockfile generated by `uv`. |
---
## Dependency Notes
Core runtime deps (from `pyproject.toml`):
- `fastmcp` – MCP server helper.
- `httpx` – async HTTP client.
- `groq` – Groq API client.
- `python-dotenv` – load variables from `.env`.
- `trafilatura` – heuristic content extraction (currently partially used / can be extended).
> Tip: If you add more scraping tools, reuse a single `httpx.AsyncClient` for performance.
---
## Logging & Debugging
To see what the server is doing, you can temporarily add:
```python
import logging, sys
logging.basicConfig(level=logging.INFO, stream=sys.stderr)
```
Place near the top of `mcp_server.py` after imports. Since protocol uses stdout for JSON-RPC, send logs to stderr only.
Common issues:
- Empty tool list: The server exited early or crashed—add logging.
- `SERPER_API_KEY` missing → 401 or empty search results.
- `GROQ_API_KEY` missing → LLM cleaning fails (exception in `get_response_from_llm`).
- Network timeouts: Adjust `timeout` in `httpx.AsyncClient` calls.
---
## Extending
Ideas:
- Add caching layer (e.g., `sqlite` or in-memory dict) to avoid re-fetching same URLs.
- Parallelize URL fetch + clean with `asyncio.gather()` (mind rate limits / LLM cost).
- Add another tool (e.g., `summarize_diff`, `list_endpoints`).
- Provide structured JSON output (list of sources + cleaned text) instead of concatenated string.
- Add tests using `pytest` + `pytest-asyncio` (mock Serper + LLM APIs).
---
## Example Programmatic Use (Without Client Wrapper)
If you want to call the tool directly in a Python script using the client-side MCP library:
```python
from mcp.client.stdio import stdio_client
from mcp import ClientSession, StdioServerParameters
import asyncio
async def demo():
params = StdioServerParameters(command="uv", args=["run", "mcp_server.py"])
async with stdio_client(params) as (r, w):
async with ClientSession(r, w) as session:
await session.initialize()
tools = await session.list_tools()
print([t.name for t in tools.tools])
docs = await session.call_tool("get_docs", {"query": "install", "library": "uv"})
print(docs.content[:500])
asyncio.run(demo())
```
---
## Running With Active Virtualenv
If you have an already activated virtual environment and want to use that instead of the project’s pinned environment, you can force uv to target it:
```bash
uv run --active client.py
```
Otherwise, uv will warn that your active `$VIRTUAL_ENV` differs from the project `.venv` but continue using the project environment.
---
## License
Add a license section here (e.g., MIT) if you intend to distribute.
---
## Troubleshooting Cheat Sheet
| Symptom | Cause | Fix |
|---------|-------|-----|
| No tools listed | Server not running / crashed | Add stderr logging; run `uv run mcp_server.py` manually |
| AttributeError on `.text` | Cleaner returned None | Ensure you return actual string from `fetch_url` / LLM call |
| 401 from Serper | Bad/missing API key | Check `.env` and reload shell |
| Empty search results | Narrow query | Simplify query or verify domain key |
| High latency | Many sequential LLM chunk calls | Batch or reduce chunk size |
---
## Contributing
1. Fork & branch.
2. Run `uv sync`.
3. Add tests for new tools (if added).
4. Open PR with clear description.
---
## Roadmap (Optional)
- [] Add JSON schema metadata for tool params.
- [] Structured response format (list of {source, text}).
- [] Add caching layer.
- [] Add rate limiting/backoff.
- [] Add CI workflow (lint + tests).
---
## Acknowledgments
- Serper.dev for search API
- Groq for fast OSS model serving
- Astral for `uv`
- MCP ecosystem for protocol foundation