Skip to main content
Glama
PRD.md17.2 kB
# Weixin MCP 产品需求文档 ## 1. 项目背景 ### 1.1 项目概述 **项目名称**: Weixin MCP - 微信文章阅读器 **项目类型**: Model Context Protocol (MCP) Server **目标**: 为大语言模型提供微信公众号文章内容获取能力 ### 1.2 问题陈述 微信公众号文章存在以下技术限制: - 内容需要浏览器环境渲染(动态加载) - 存在反爬虫机制 - 大模型无法直接访问微信文章URL获取内容 ### 1.3 解决方案 通过MCP协议实现一个服务端工具,使用Playwright浏览器自动化技术模拟真实浏览器访问,提取文章结构化内容,供大模型使用。 --- ## 2. 典型用户用例 ### 2.1 用例1:文章摘要生成 ``` 输入: - URL: https://mp.weixin.qq.com/s/xxx - 需求: "请总结这篇文章的主要观点" 处理流程: 1. AI调用 read_weixin_article 工具 2. MCP返回结构化数据: { "title": "文章标题", "author": "作者名", "publish_time": "2025-11-05", "content": "完整正文内容...", "success": true } 3. AI基于content字段生成摘要 预期输出: "这篇文章主要讨论了..." ``` ### 2.2 用例2:多篇文章对比分析 ``` 输入: - URLs: [url1, url2, url3] - 需求: "对比这三篇文章的观点差异" 处理流程: 1. AI顺序调用3次 read_weixin_article 2. 每次返回结构化数据 3. AI对比分析三篇文章的content 预期输出: "第一篇文章认为...,第二篇文章强调...,第三篇文章则..." ``` ### 2.3 用例3:内容验证与事实核查 ``` 输入: - URL: https://mp.weixin.qq.com/s/xxx - 需求: "验证文章中提到的数据是否准确" 处理流程: 1. AI调用工具获取文章内容 2. AI提取文章中的数据和引用 3. AI进行逻辑分析和已知知识对比 预期输出: "文章提到的...数据与公开资料一致/存在差异" ``` ### 2.4 测试用例设计视角 ```python # 测试数据结构 test_cases = [ { "case_id": "TC001", "url": "https://mp.weixin.qq.com/s/valid_article_id", "expected_fields": ["title", "author", "publish_time", "content"], "expected_success": True, "description": "正常文章读取" }, { "case_id": "TC002", "url": "https://mp.weixin.qq.com/s/invalid_id", "expected_success": False, "expected_error": "Article not found", "description": "无效URL处理" }, { "case_id": "TC003", "url": "https://mp.weixin.qq.com/s/deleted_article", "expected_success": False, "expected_error": "Article has been deleted", "description": "已删除文章处理" } ] ``` --- ## 3. 技术选型与架构 ### 3.1 核心技术栈 | 技术 | 版本 | 用途 | 选型理由 | |------|------|------|----------| | **Python** | 3.10+ | 开发语言 | fastmcp基于Python,生态成熟 | | **fastmcp** | latest | MCP框架 | 简化MCP服务开发,提供装饰器模式 | | **Playwright** | latest | 浏览器自动化 | 支持真实浏览器环境,绕过反爬虫 | | **BeautifulSoup4** | 4.12+ | HTML解析 | 内容提取和清理 | ### 3.2 系统架构 ``` ┌─────────────┐ MCP Protocol ┌──────────────────┐ │ │ <─────────────────────────> │ │ │ AI Client │ JSON-RPC │ MCP Server │ │ (Claude) │ │ (wx-mcp-server) │ │ │ │ │ └─────────────┘ └────────┬─────────┘ │ │ Control ▼ ┌─────────────────┐ │ Playwright │ │ Browser │ │ (Chromium) │ └────────┬────────┘ │ │ HTTP Request ▼ ┌─────────────────┐ │ Weixin Server │ │ mp.weixin.qq.com│ └─────────────────┘ ``` ### 3.3 项目结构 ``` wx-mcp-server/ ├── src/ │ ├── __init__.py │ ├── server.py # MCP服务主入口 │ ├── scraper.py # Playwright爬虫逻辑 │ ├── parser.py # 内容解析器 │ └── utils.py # 工具函数 ├── tests/ │ ├── test_scraper.py │ └── test_parser.py ├── pyproject.toml # 项目配置 ├── requirements.txt # 依赖管理 └── README.md ``` --- ## 4. 核心流程 ### 4.1 主流程图 ``` [用户请求] → [AI识别URL] → [调用MCP工具] ↓ [启动Playwright浏览器] ↓ [访问微信文章URL] ↓ [等待页面完全加载] ↓ [提取DOM元素内容] ↓ ┌──────────────┴──────────────┐ ↓ ↓ [解析成功] [解析失败] ↓ ↓ [结构化返回数据] [返回错误信息] ↓ ↓ └──────────────┬──────────────┘ ↓ [关闭浏览器] ↓ [返回给AI处理] ``` ### 4.2 数据流转 ``` 1. Input: {"url": "https://mp.weixin.qq.com/s/xxx"} 2. Process: Browser → DOM → Parser → JSON 3. Output: { "title": string, "author": string, "publish_time": string, "content": string, "success": boolean, "error": string | null } ``` --- ## 5. 关键技术实现 ### 5.1 MCP服务器主入口 (`server.py`) ```python from fastmcp import FastMCP from scraper import WeixinScraper import logging # 初始化MCP服务 mcp = FastMCP("weixin-reader") # 初始化爬虫 scraper = WeixinScraper() @mcp.tool() async def read_weixin_article(url: str) -> dict: """ 读取微信公众号文章内容 Args: url: 微信文章URL,格式: https://mp.weixin.qq.com/s/xxx Returns: dict: { "success": bool, "title": str, "author": str, "publish_time": str, "content": str, "error": str | None } """ try: # URL验证 if not url.startswith("https://mp.weixin.qq.com/s/"): return { "success": False, "error": "Invalid URL format. Must be a Weixin article URL." } # 调用爬虫获取内容 result = await scraper.fetch_article(url) return result except Exception as e: logging.error(f"Error fetching article: {e}") return { "success": False, "error": str(e) } if __name__ == "__main__": # 启动MCP服务器 mcp.run() ``` ### 5.2 Playwright爬虫实现 (`scraper.py`) ```python from playwright.async_api import async_playwright from parser import WeixinParser import asyncio class WeixinScraper: def __init__(self): self.parser = WeixinParser() self.browser = None self.context = None async def initialize(self): """初始化浏览器""" if not self.browser: playwright = await async_playwright().start() self.browser = await playwright.chromium.launch( headless=True, args=[ '--disable-blink-features=AutomationControlled', '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' ] ) self.context = await self.browser.new_context( viewport={'width': 1920, 'height': 1080}, user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' ) async def fetch_article(self, url: str) -> dict: """ 获取微信文章内容 Args: url: 文章URL Returns: dict: 包含文章数据的字典 """ try: await self.initialize() # 创建新页面 page = await self.context.new_page() # 访问URL,等待网络空闲 await page.goto(url, wait_until='networkidle', timeout=30000) # 等待关键元素加载 await page.wait_for_selector('#js_content', timeout=10000) # 获取页面HTML html_content = await page.content() # 关闭页面 await page.close() # 解析内容 result = self.parser.parse(html_content, url) return { "success": True, **result, "error": None } except Exception as e: return { "success": False, "error": f"Failed to fetch article: {str(e)}" } async def cleanup(self): """清理资源""" if self.browser: await self.browser.close() ``` ### 5.3 内容解析器 (`parser.py`) ```python from bs4 import BeautifulSoup import re from datetime import datetime class WeixinParser: def parse(self, html: str, url: str) -> dict: """ 解析微信文章HTML Args: html: 页面HTML内容 url: 文章URL Returns: dict: 解析后的结构化数据 """ soup = BeautifulSoup(html, 'html.parser') # 提取标题 title_elem = soup.find('h1', {'id': 'activity-name'}) title = title_elem.get_text(strip=True) if title_elem else "未找到标题" # 提取作者 author_elem = soup.find('span', {'id': 'js_author_name'}) or \ soup.find('a', {'id': 'js_name'}) author = author_elem.get_text(strip=True) if author_elem else "未知作者" # 提取发布时间 time_elem = soup.find('em', {'id': 'publish_time'}) publish_time = time_elem.get_text(strip=True) if time_elem else "未知时间" # 提取正文内容 content_elem = soup.find('div', {'id': 'js_content'}) if content_elem: # 清理内容 content = self._clean_content(content_elem) else: content = "未找到正文内容" return { "title": title, "author": author, "publish_time": publish_time, "content": content } def _clean_content(self, content_elem) -> str: """ 清理正文内容 Args: content_elem: BeautifulSoup元素 Returns: str: 清理后的纯文本内容 """ # 移除script和style标签 for tag in content_elem.find_all(['script', 'style']): tag.decompose() # 获取文本内容 text = content_elem.get_text(separator='\n', strip=True) # 清理多余空白 text = re.sub(r'\n{3,}', '\n\n', text) text = re.sub(r' {2,}', ' ', text) return text.strip() ``` ### 5.4 项目配置 (`pyproject.toml`) ```toml [project] name = "wx-mcp-server" version = "0.1.0" description = "A MCP server for reading Weixin articles" requires-python = ">=3.10" dependencies = [ "fastmcp>=0.1.0", "playwright>=1.40.0", "beautifulsoup4>=4.12.0", "lxml>=4.9.0", ] [project.optional-dependencies] dev = [ "pytest>=7.4.0", "pytest-asyncio>=0.21.0", ] [tool.pytest.ini_options] asyncio_mode = "auto" ``` ### 5.5 安装与运行脚本 ```bash # 初始化项目 cd wx-mcp-server python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate # 安装依赖 pip install -e . playwright install chromium # 运行服务 python src/server.py ``` --- ## 6. 错误处理策略 ### 6.1 错误类型与处理 | 错误类型 | 触发条件 | 处理方式 | 返回信息 | |---------|---------|---------|---------| | **URL格式错误** | URL不是微信文章链接 | 立即返回 | `"Invalid URL format"` | | **网络超时** | 30秒内未加载完成 | 重试1次,失败返回 | `"Network timeout"` | | **页面不存在** | 404或文章被删除 | 返回错误 | `"Article not found or deleted"` | | **元素未找到** | 关键DOM元素缺失 | 返回部分数据 | 成功但字段为"未找到" | | **浏览器崩溃** | Playwright异常 | 重启浏览器实例 | `"Browser error, please retry"` | ### 6.2 重试机制 ```python async def fetch_article_with_retry(self, url: str, max_retries: int = 2) -> dict: """带重试的文章获取""" for attempt in range(max_retries): try: result = await self.fetch_article(url) if result["success"]: return result except Exception as e: if attempt == max_retries - 1: raise await asyncio.sleep(2 ** attempt) # 指数退避 return {"success": False, "error": "Max retries exceeded"} ``` --- ## 7. 性能与优化 ### 7.1 性能指标 - **响应时间**: < 10秒/文章 - **成功率**: > 95% - **内存占用**: < 500MB ### 7.2 优化策略 1. **浏览器复用**: 保持浏览器实例,避免频繁启动 2. **并发控制**: 限制同时处理的请求数 3. **缓存机制**: 可选的文章内容缓存(避免重复请求) 4. **资源过滤**: 阻止图片/视频加载,只获取文本 ```python # 优化示例:阻止不必要的资源 await page.route("**/*.{png,jpg,jpeg,gif,svg,mp4}", lambda route: route.abort()) ``` --- ## 8. 安全与合规 ### 8.1 使用限制 - 仅用于个人学习和研究 - 遵守微信公众平台服务协议 - 不进行高频爬取(建议间隔 > 2秒) - 不用于商业用途 ### 8.2 Rate Limiting ```python from asyncio import Semaphore class WeixinScraper: def __init__(self, max_concurrent=3): self.semaphore = Semaphore(max_concurrent) async def fetch_article(self, url: str): async with self.semaphore: # 限制并发数 return await self._fetch_article_impl(url) ``` --- ## 9. 测试要求 ### 9.1 单元测试覆盖 - URL验证函数测试 - HTML解析函数测试 - 错误处理测试 ### 9.2 集成测试 ```python # tests/test_integration.py import pytest from src.server import read_weixin_article @pytest.mark.asyncio async def test_read_valid_article(): """测试读取有效文章""" url = "https://mp.weixin.qq.com/s/test_valid_article" result = await read_weixin_article(url) assert result["success"] is True assert "title" in result assert len(result["content"]) > 0 @pytest.mark.asyncio async def test_read_invalid_url(): """测试无效URL""" url = "https://invalid.com/article" result = await read_weixin_article(url) assert result["success"] is False assert "Invalid URL" in result["error"] ``` --- ## 10. 部署与集成 ### 10.1 MCP配置文件 在 Claude Desktop 配置中添加: ```json { "mcpServers": { "weixin-reader": { "command": "python", "args": [ "C:/Users/chenqimei/Desktop/wx-mcp/wx-mcp-server/src/server.py" ] } } } ``` ### 10.2 AI使用示例 ``` 用户: 请帮我总结这篇文章 https://mp.weixin.qq.com/s/xxx AI内部流程: 1. 识别URL 2. 调用 read_weixin_article(url="https://mp.weixin.qq.com/s/xxx") 3. 接收返回数据 4. 基于content字段生成摘要 AI回复: 这篇《文章标题》由作者XXX发布于2025-11-05,主要讨论了... ```

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Bwkyd/wexin-read-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server