Provides documentation retrieval and web scraping capabilities for LangChain's official documentation, allowing users to search and extract clean, readable content from LangChain docs
Provides documentation retrieval and web scraping capabilities for OpenAI's official documentation, allowing users to search and extract clean, readable content from OpenAI docs
MCP Server: Documentation Retrieval & Web Scraping (uv + FastMCP)
This project provides a minimal, async MCP (Model Context Protocol) server that exposes a tool for retrieving and cleaning official documentation content for popular AI / Python ecosystem libraries. It uses:
fastmcp
to define and run the MCP server over stdio.httpx
for async HTTP calls.serper.dev
for Google-like search (via API).groq
API (LLM) to clean raw HTML into readable text chunks.python-dotenv
for environment variable management.uv
as the package manager & runner (fast, lockfile-based, Python 3.11+).
Features
Search restricted to official docs domains (
uv
,langchain
,openai
,llama-index
).Tool:
get_docs(query, library)
returns concatenated cleaned sections withSOURCE:
labels.Streaming-safe async design (chunking large HTML pages before LLM cleaning).
Separate
client.py
demonstrating how to connect as an MCP client and call the tool, then post-process with an LLM.
Quick Start
Prerequisites:
Python 3.11+
uv
installed (https://docs.astral.sh/uv/)API keys for:
SERPER_API_KEY
,GROQ_API_KEY
1. Clone & Install
This will create/refresh a .venv
based on pyproject.toml
+ uv.lock
.
2. Environment Variables
Create a .env
file in the project root:
Optional: add other model settings if you later extend functionality.
3. Run the MCP Server Directly
The server will start and wait on stdio (no extra output unless you add logging). It registers the tool get_docs
.
4. Use the Provided Client
You should see something like:
If the list is empty, ensure the server started correctly and no exceptions were raised (add logging—see below).
Tool: get_docs
Signature:
Supported libraries (keys): uv
, langchain
, openai
, llama-index
.
Flow:
Build a site-restricted query:
site:<docs-domain> <query>
.Call Serper API for organic results.
Fetch each result URL (async) via
httpx
.Split HTML into ~4000‑char chunks (memory safety & LLM limits).
Clean each chunk using Groq LLM (
openai/gpt-oss-20b
) with a system prompt.Concatenate and label each block with
SOURCE: <url>
for traceability.
Returned value: A large text blob suitable for retrieval-augmented prompting, preserving source attribution lines.
Architecture
File overview:
File | Purpose |
| Defines
instance and implements
,
, and the
tool. |
| Launches server via stdio, lists tools, calls
, then feeds result to an LLM for a user-friendly answer. |
| HTML cleaning helper (currently uses LLM +
for extraction and Groq for chunk transformation). |
| Environment variables (excluded from VCS). |
| Declares dependencies and metadata. |
| Reproducible lockfile generated by
. |
Dependency Notes
Core runtime deps (from pyproject.toml
):
fastmcp
– MCP server helper.httpx
– async HTTP client.groq
– Groq API client.python-dotenv
– load variables from.env
.trafilatura
– heuristic content extraction (currently partially used / can be extended).
Tip: If you add more scraping tools, reuse a single
httpx.AsyncClient
for performance.
Logging & Debugging
To see what the server is doing, you can temporarily add:
Place near the top of mcp_server.py
after imports. Since protocol uses stdout for JSON-RPC, send logs to stderr only.
Common issues:
Empty tool list: The server exited early or crashed—add logging.
SERPER_API_KEY
missing → 401 or empty search results.GROQ_API_KEY
missing → LLM cleaning fails (exception inget_response_from_llm
).Network timeouts: Adjust
timeout
inhttpx.AsyncClient
calls.
Extending
Ideas:
Add caching layer (e.g.,
sqlite
or in-memory dict) to avoid re-fetching same URLs.Parallelize URL fetch + clean with
asyncio.gather()
(mind rate limits / LLM cost).Add another tool (e.g.,
summarize_diff
,list_endpoints
).Provide structured JSON output (list of sources + cleaned text) instead of concatenated string.
Add tests using
pytest
+pytest-asyncio
(mock Serper + LLM APIs).
Example Programmatic Use (Without Client Wrapper)
If you want to call the tool directly in a Python script using the client-side MCP library:
Running With Active Virtualenv
If you have an already activated virtual environment and want to use that instead of the project’s pinned environment, you can force uv to target it:
Otherwise, uv will warn that your active $VIRTUAL_ENV
differs from the project .venv
but continue using the project environment.
License
Add a license section here (e.g., MIT) if you intend to distribute.
Troubleshooting Cheat Sheet
Symptom | Cause | Fix |
No tools listed | Server not running / crashed | Add stderr logging; run
manually |
AttributeError on
| Cleaner returned None | Ensure you return actual string from
/ LLM call |
401 from Serper | Bad/missing API key | Check
and reload shell |
Empty search results | Narrow query | Simplify query or verify domain key |
High latency | Many sequential LLM chunk calls | Batch or reduce chunk size |
Contributing
Fork & branch.
Run
uv sync
.Add tests for new tools (if added).
Open PR with clear description.
Roadmap (Optional)
[] Add JSON schema metadata for tool params.
[] Structured response format (list of {source, text}).
[] Add caching layer.
[] Add rate limiting/backoff.
[] Add CI workflow (lint + tests).
Acknowledgments
Serper.dev for search API
Groq for fast OSS model serving
Astral for
uv
MCP ecosystem for protocol foundation
remote-capable server
The server can be hosted and run remotely because it primarily relies on remote services or has no dependency on the local environment.
Tools
Enables retrieval and cleaning of official documentation content for popular AI/Python libraries (uv, langchain, openai, llama-index) through web scraping and LLM-powered content extraction. Uses Serper API for search and Groq API to clean HTML into readable text with source attribution.