LLM 网关 MCP 服务器

Python 3.13+ 许可证：MIT MCP 协议

模型上下文协议 (MCP) 服务器，支持从高性能 AI 代理到经济高效的 LLM 的智能委托

什么是 LLM Gateway？

LLM Gateway 是一款 MCP 原生服务器，支持将高级 AI 代理（例如 Claude 3.7 Sonnet）智能任务委托给更具成本效益的模型（例如 Gemini Flash 2.0 Lite）。它为多个大型语言模型 (LLM) 提供程序提供统一接口，同时优化了成本、性能和质量。

愿景：人工智能驱动的资源优化

LLM Gateway 的核心代表了我们与 AI 系统交互方式的根本性转变。它不再使用单一昂贵的模型来执行所有任务，而是构建了一个智能层级结构，其中：

Claude 3.7 等高级模型专注于高级推理、编排和复杂任务
经济高效的模型可处理常规处理、提取和机械任务
整个系统以极低的成本实现了近乎顶级的性能

这种方法反映了人类组织的工作方式——专家处理复杂的决策，同时将日常任务委托给具有适当技能的其他人。

MCP-原生架构

该服务器基于模型上下文协议 (MCP)构建，专为与 Claude 等 AI 代理协同工作而设计。所有功能均通过 MCP 工具公开，这些代理可以直接调用这些工具，从而为 AI 之间的委托创建无缝的工作流程。

主要用例：AI代理任务委派

LLM Gateway 的主要设计目标是允许像 Claude 3.7 Sonnet 这样的复杂 AI 代理智能地将任务委托给更便宜的模型：

delegates to ┌─────────────┐ ────────────────────────► ┌───────────────────┐ ┌──────────────┐ │ Claude 3.7 │ │ LLM Gateway │ ───────►│ Gemini Flash │ │ (Agent) │ ◄──────────────────────── │ MCP Server │ ◄───────│ DeepSeek │ └─────────────┘ returns results └───────────────────┘ │ GPT-4o-mini │ └──────────────┘

工作流程示例：

Claude 确定需要对文档进行总结（对于 Claude 来说，这是一项昂贵的操作）
Claude 通过 MCP 工具将此任务委托给 LLM Gateway
LLM Gateway 将摘要任务路由到 Gemini Flash（比 Claude 便宜 10-20 倍）
摘要返回给克劳德，进行更高层次的推理和决策
然后，Claude 就可以将其能力集中用于真正需要其智能的任务

这种委托模式可以在保持输出质量的同时节省 70-90% 的 API 成本。

为什么要使用 LLM Gateway？

🔄 AI 到 AI 的任务委派

最强大的用例是使高级 AI 代理能够将日常任务委托给更便宜的模型：

让 Claude 3.7 使用 GPT-4o-mini 进行初始文档摘要
让Claude使用Gemini 2.0 Flash light进行数据提取和转换
允许 Claude 协调不同提供商之间的多阶段工作流程
使 Claude 能够为每个特定的子任务选择正确的模型

💰 成本优化

高级模型的 API 成本可能相当高。LLM Gateway 通过以下方式帮助降低成本：

将适当的任务路由到更便宜的模型（例如，0.01 美元/1000 个代币 vs 0.15 美元/1000 个代币）
实施高级缓存以避免冗余 API 调用
跟踪和优化跨供应商的成本
实现成本感知任务路由决策

🔄 提供程序抽象

通过统一的界面避免提供商锁定：

OpenAI、Anthropic（Claude）、Google（Gemini）和DeepSeek 的标准 API
一致的参数处理和响应格式
无需更改应用程序代码即可更换提供商
防止特定提供商的中断和限制

📄 大规模文档处理

高效处理大型文档：

将文档分解成语义上有意义的块
跨多个模型并行处理块
从非结构化文本中提取结构化数据
从大量文本中生成摘要和见解

主要特点

MCP 协议集成

原生 MCP 服务器：基于模型上下文协议构建，用于 AI 代理集成
MCP 工具框架：所有功能均通过标准化 MCP 工具公开
工具组合：可以组合工具以实现复杂的工作流程
工具发现：支持工具列表和功能发现

智能任务委派

任务路由：分析任务并路由到适当的模型
供应商选择：根据任务要求选择供应商
成本绩效平衡：优化成本、质量或速度
授权跟踪：监控授权模式和结果

高级缓存

多级缓存：多种缓存策略：
- 精确匹配缓存
- 语义相似性缓存
- 任务感知缓存
持久缓存：基于磁盘的持久性，具有快速的内存访问
缓存分析：跟踪节省和命中率

文档工具

智能分块：多种分块策略：
- 基于标记的分块
- 语义边界检测
- 结构分析
文档操作：
- 总结
- 实体提取
- 问题生成
- 批处理

结构化数据提取

JSON 提取：使用模式验证提取结构化 JSON
表格提取：提取多种格式的表格
键值提取：从文本中提取键值对
语义模式推理：从文本生成模式

锦标赛模式

代码和文本竞赛：支持举办锦标赛式的比赛
多模型：同时比较不同模型的输出
性能指标：评估和跟踪模型性能
结果存储：保存比赛结果以供进一步分析

高级向量运算

语义搜索：在文档中查找语义相似的内容
向量存储：高效存储和检索向量嵌入
混合搜索：结合关键字和语义搜索功能
批处理：高效处理大型数据集

使用示例

Claude 使用 LLM Gateway 进行文档分析

此示例展示了 Claude 如何使用 LLM 网关通过将任务委托给更便宜的模型来处理文档：

import asyncio from mcp.client import Client async def main(): # Claude would use this client to connect to the LLM Gateway client = Client("http://localhost:8013") # Claude can identify a document that needs processing document = "... large document content ..." # Step 1: Claude delegates document chunking chunks_response = await client.tools.chunk_document( document=document, chunk_size=1000, method="semantic" ) print(f"Document divided into {chunks_response['chunk_count']} chunks") # Step 2: Claude delegates summarization to a cheaper model summaries = [] total_cost = 0 for i, chunk in enumerate(chunks_response["chunks"]): # Use Gemini Flash (much cheaper than Claude) summary = await client.tools.summarize_document( document=chunk, provider="gemini", model="gemini-2.0-flash-lite", format="paragraph" ) summaries.append(summary["summary"]) total_cost += summary["cost"] print(f"Processed chunk {i+1} with cost ${summary['cost']:.6f}") # Step 3: Claude delegates entity extraction to another cheap model entities = await client.tools.extract_entities( document=document, entity_types=["person", "organization", "location", "date"], provider="openai", model="gpt-4o-mini" ) total_cost += entities["cost"] print(f"Total delegation cost: ${total_cost:.6f}") # Claude would now process these summaries and entities using its advanced capabilities # Close the client when done await client.close() if __name__ == "__main__": asyncio.run(main())

多供应商比较，助力决策

# Claude can compare outputs from different providers for critical tasks responses = await client.tools.multi_completion( prompt="Explain the implications of quantum computing for cryptography.", providers=[ {"provider": "openai", "model": "gpt-4o-mini", "temperature": 0.3}, {"provider": "anthropic", "model": "claude-3-haiku-20240307", "temperature": 0.3}, {"provider": "gemini", "model": "gemini-2.0-pro", "temperature": 0.3} ] ) # Claude could analyze these responses and decide which is most accurate for provider_key, result in responses["results"].items(): if result["success"]: print(f"{provider_key} Cost: ${result['cost']}")

成本优化的工作流程

# Claude can define and execute complex multi-stage workflows workflow = [ { "name": "Initial Analysis", "operation": "summarize", "provider": "gemini", "model": "gemini-2.0-flash-lite", "input_from": "original", "output_as": "summary" }, { "name": "Entity Extraction", "operation": "extract_entities", "provider": "openai", "model": "gpt-4o-mini", "input_from": "original", "output_as": "entities" }, { "name": "Question Generation", "operation": "generate_qa", "provider": "deepseek", "model": "deepseek-chat", "input_from": "summary", "output_as": "questions" } ] # Execute the workflow results = await client.tools.execute_optimized_workflow( documents=[document], workflow=workflow ) print(f"Workflow completed in {results['processing_time']:.2f}s") print(f"Total cost: ${results['total_cost']:.6f}")

文档分块

将大型文档拆分成更小、更易于管理的块：

large_document = "... your very large document content ..." chunking_response = await client.tools.chunk_document( document=large_document, chunk_size=500, # Target size in tokens overlap=50, # Token overlap between chunks method="semantic" # Or "token", "structural" ) if chunking_response["success"]: print(f"Document divided into {chunking_response['chunk_count']} chunks.") # chunking_response['chunks'] contains the list of text chunks else: print(f"Error: {chunking_response['error']}")

多供应商完成

要同时从多个提供程序/模型获取同一提示的完成情况以进行比较：

multi_response = await client.tools.multi_completion( prompt="What are the main benefits of using the MCP protocol?", providers=[ {"provider": "openai", "model": "gpt-4o-mini"}, {"provider": "anthropic", "model": "claude-3-haiku-20240307"}, {"provider": "gemini", "model": "gemini-2.0-flash-lite"} ], temperature=0.5 ) if multi_response["success"]: print("Multi-completion results:") for provider_key, result in multi_response["results"].items(): if result["success"]: print(f"--- {provider_key} ---") print(f"Completion: {result['completion']}") print(f"Cost: ${result['cost']:.6f}") else: print(f"--- {provider_key} Error: {result['error']} ---") else: print(f"Multi-completion failed: {multi_response['error']}")

结构化数据提取（JSON）

要从文本中提取信息到特定的 JSON 模式中：

text_with_data = "User John Doe (john.doe@example.com) created an account on 2024-07-15. His user ID is 12345." desired_schema = { "type": "object", "properties": { "name": {"type": "string"}, "email": {"type": "string", "format": "email"}, "creation_date": {"type": "string", "format": "date"}, "user_id": {"type": "integer"} }, "required": ["name", "email", "creation_date", "user_id"] } json_response = await client.tools.extract_json( document=text_with_data, json_schema=desired_schema, provider="openai", # Choose a provider capable of structured extraction model="gpt-4o-mini" ) if json_response["success"]: print(f"Extracted JSON: {json_response['json_data']}") print(f"Cost: ${json_response['cost']:.6f}") else: print(f"Error: {json_response['error']}")

检索增强生成 (RAG) 查询

使用 RAG 提出问题，系统在生成答案之前检索相关上下文（假设相关文档已被索引）：

rag_response = await client.tools.rag_query( # Assuming a tool name like rag_query query="What were the key findings in the latest financial report?", # Parameters to control retrieval, e.g.: # index_name="financial_reports", # top_k=3, provider="anthropic", model="claude-3-haiku-20240307" # Model to generate the answer based on context ) if rag_response["success"]: print(f"RAG Answer:\n{rag_response['answer']}") # Potentially include retrieved sources: rag_response['sources'] print(f"Cost: ${rag_response['cost']:.6f}") else: print(f"Error: {rag_response['error']}")

融合搜索（关键词+语义）

要使用 Marqo 执行结合关键字相关性和语义相似性的混合搜索：

fused_search_response = await client.tools.fused_search( # Assuming a tool name like fused_search query="impact of AI on software development productivity", # Parameters for Marqo index and tuning: # index_name="tech_articles", # keyword_weight=0.3, # Weight for keyword score (0.0 to 1.0) # semantic_weight=0.7, # Weight for semantic score (0.0 to 1.0) # top_n=5, # filter_string="year > 2023" ) if fused_search_response["success"]: print(f"Fused Search Results ({len(fused_search_response['results'])} hits):") for hit in fused_search_response["results"]: print(f" - Score: {hit['_score']:.4f}, ID: {hit['_id']}, Content: {hit.get('text', '')[:100]}...") else: print(f"Error: {fused_search_response['error']}")

本地文本处理

要执行本地离线文本操作而不调用 LLM API：

# Assuming a tool that bundles local text functions local_process_response = await client.tools.process_local_text( text=" Extra spaces and\nnewlines\t here. ", operations=[ {"action": "trim_whitespace"}, {"action": "normalize_newlines"}, {"action": "lowercase"} ] ) if local_process_response["success"]: print(f"Processed Text: '{local_process_response['processed_text']}'") else: print(f"Error: {local_process_response['error']}")

举办模范锦标赛

比较特定任务（例如代码生成）上多个模型的输出：

# Assuming a tournament tool tournament_response = await client.tools.run_model_tournament( task_type="code_generation", prompt="Write a Python function to calculate the factorial of a number.", competitors=[ {"provider": "openai", "model": "gpt-4o-mini"}, {"provider": "anthropic", "model": "claude-3-opus-20240229"}, # Higher-end model for comparison {"provider": "deepseek", "model": "deepseek-coder"} ], evaluation_criteria=["correctness", "efficiency", "readability"], # Optional: ground_truth="def factorial(n): ..." ) if tournament_response["success"]: print("Tournament Results:") # tournament_response['results'] would contain rankings, scores, outputs for rank, result in enumerate(tournament_response.get("ranking", [])): print(f" {rank+1}. {result['provider']}/{result['model']} - Score: {result['score']:.2f}") print(f"Total Cost: ${tournament_response['total_cost']:.6f}") else: print(f"Error: {tournament_response['error']}")

（可以在此处添加更多工具示例...）

入门

安装

# Install uv if you don't already have it: curl -LsSf https://astral.sh/uv/install.sh | sh # Clone the repository git clone https://github.com/yourusername/llm_gateway_mcp_server.git cd llm_gateway_mcp_server # Install in venv using uv: uv venv --python 3.13 source .venv/bin/activate uv pip install -e ".[all]"

环境设置

使用您的 API 密钥创建一个.env文件：

# API Keys (at least one provider required) OPENAI_API_KEY=your_openai_key ANTHROPIC_API_KEY=your_anthropic_key GEMINI_API_KEY=your_gemini_key DEEPSEEK_API_KEY=your_deepseek_key # Server Configuration SERVER_PORT=8013 SERVER_HOST=127.0.0.1 # Logging Configuration LOG_LEVEL=INFO USE_RICH_LOGGING=true # Cache Configuration CACHE_ENABLED=true CACHE_TTL=86400

运行服务器

# Start the MCP server python -m llm_gateway.cli.main run # Or with Docker docker compose up

一旦运行，服务器将在http://localhost:8013上可用。

高级配置

虽然.env文件对于基本设置很方便，但 LLM 网关提供了更详细的配置选项，主要通过环境变量进行管理。

服务器配置

SERVER_HOST ：（默认值： 127.0.0.1 ）服务器监听的网络接口。使用0.0.0.0表示监听所有接口（Docker 或外部访问必需）。
SERVER_PORT ：（默认值： 8013 ）服务器监听的端口。
API_PREFIX ：（默认值： / ）API 端点的 URL 前缀。

日志配置

LOG_LEVEL ：（默认值： INFO ）控制日志的详细程度。选项： DEBUG 、 INFO 、 WARNING 、 ERROR 、 CRITICAL 。
USE_RICH_LOGGING ：（默认值： true ）使用 Rich 库来生成色彩丰富、格式化的控制台日志。设置为false表示生成纯文本日志（更适合文件重定向或某些日志聚合系统）。
LOG_FORMAT ：（可选）指定自定义日志格式字符串。
LOG_TO_FILE ：（可选，例如， gateway.log ）也应写入日志的文件路径。

缓存配置

CACHE_ENABLED ：（默认值： true ）全局启用或禁用缓存。
CACHE_TTL ：（默认值： 86400秒，即 24 小时）缓存项的默认生存时间。特定工具可能会覆盖此值。
CACHE_TYPE ：（默认值： memory ）缓存后端的类型。选项可能包括memory 、 redis 、 diskcache 。（注意：请查看当前实现以了解支持的类型。）
CACHE_MAX_SIZE ：（可选）缓存的最大项目数或内存大小。
REDIS_URL ：（如果CACHE_TYPE=redis则为必需）Redis 缓存服务器的连接 URL（例如， redis://localhost:6379/0 ）。

提供程序超时和重试

PROVIDER_TIMEOUT ：（默认值： 120秒）对 LLM 提供程序 API 的请求的默认超时。
PROVIDER_MAX_RETRIES ：（默认值： 3 ）失败的提供程序请求的默认重试次数（例如，由于临时网络问题或速率限制）。
特定提供程序超时/重试可能可以通过专用变量（如OPENAI_TIMEOUT 、 ANTHROPIC_MAX_RETRIES等）进行配置。（注意：检查当前实现）。

特定工具配置

某些工具可能有其特定的环境变量用于配置（例如，用于融合搜索的MARQO_URL ，以及默认的分块参数）。请参阅各个工具的文档或源代码。

启动服务器前，务必确保环境变量设置正确。更改环境变量通常需要重启服务器。

部署注意事项

虽然直接使用python或docker compose up运行服务器适合开发和测试，但为了实现更强大或生产部署，请考虑以下内容：

1. 作为后台服务运行

为了确保网关持续运行并在发生故障或服务器重启时自动重启，请使用进程管理器：

**systemd (Linux)：**创建一个服务单元文件（例如/etc/systemd/system/llm-gateway.service ）来管理该进程。这样就可以使用诸如sudo systemctl start|stop|restart|status llm-gateway类的命令。
**supervisor ：**一个用 Python 编写的流行进程控制系统。配置supervisord来监视和控制网关进程。
**Docker 重启策略：**如果使用 Docker（独立或 Compose），请在docker run命令或docker-compose.yml文件中配置适当的重启策略（例如， unless-stopped或always ）。

2.使用反向代理（Nginx / Caddy / Apache）

强烈建议在 LLM 网关前放置一个反向代理：

**HTTPS/SSL 终止：**代理可以处理 SSL 证书（例如，使用 Let's Encrypt 和 Caddy 或使用 Certbot 和 Nginx/Apache），加密客户端和代理之间的流量。
**负载平衡：**如果您需要运行网关的多个实例以实现高可用性或性能，则代理可以在它们之间分配流量。
**路径路由：**将外部路径（例如https://api.yourdomain.com/llm-gateway/ ）映射到内部网关服务器（ http://localhost:8013 ）。
**安全标头：**添加重要的安全标头（如 CSP、HSTS）。
**缓冲/缓存：**一些代理提供额外的请求/响应缓冲或缓存功能。

Nginx

location /llm-gateway/ { proxy_pass http://127.0.0.1:8013/; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; # Add configurations for timeouts, buffering, etc. }

3.容器编排（Kubernetes/Swarm）

如果在容器化环境中部署：

**健康检查：**在部署清单中实现并配置健康检查端点（例如，前面提到的/healthz ），以便编排器可以监视服务的健康状况。
**配置：**使用 ConfigMaps 和 Secrets（Kubernetes）或等效机制来安全地管理环境变量和 API 密钥，而不是将它们硬编码在图像中或仅依赖.env文件。
**资源限制：**为网关容器定义适当的 CPU 和内存请求/限制，以确保稳定的性能并防止资源匮乏。
**服务发现：**利用编排器的服务发现机制，而不是硬编码 IP 地址或主机名。

4.资源分配

确保主机或容器具有足够的RAM ，尤其是在使用内存缓存或处理大型文档/请求时。
监控CPU 使用率，特别是在高负载或多个复杂操作同时运行时。

通过授权节省成本

使用 LLM Gateway 进行委派可以节省大量成本：

任务	克劳德 3.7 直接	委托更便宜的法学硕士	储蓄
总结100页的文件	4.50 美元	0.45 美元（Gemini Flash）	90%
从 50 条记录中提取数据	2.25 美元	0.35 美元（GPT-4o-mini）	84%
产生20个内容创意	0.90 美元	0.12 美元（DeepSeek）	87%
处理 1,000 个客户查询	45.00 美元	7.50 美元（混合代表团）	83%

通过让 Claude 专注于高级推理和编排，同时将机械任务委托给具有成本效益的模型，可以在保持高质量输出的同时实现这些节省。

为什么AI到AI的授权很重要

人工智能授权的战略重要性不仅限于简单的成本节约：

普及先进的人工智能能力

通过启用 Claude 3.7、GPT-4o 等强大模型来有效地授权，我们：

以极低的成本提供先进的人工智能功能
允许预算有限的组织利用顶级人工智能功能
使整个行业能够更有效地利用人工智能资源

经济资源优化

AI 之间的授权代表着一种根本性的经济优化：

复杂的推理、创造力和理解力只属于顶级模型
常规数据处理、提取和更简单的任务采用经济高效的模型
整个系统以极低的成本实现了近乎顶级的性能
API 成本成为可控的支出，而不是不可预测的负债

可持续的人工智能架构

这种方法促进了更可持续的人工智能的使用：

减少不必要的高端计算资源消耗
创建分层的 AI 方法，使功能与需求相匹配
允许进行仅使用顶级模型才能进行的成本过高的实验工作
创建可扩展的 AI 集成方法，可随着业务需求而增长

技术演进路径

LLM Gateway 代表了 AI 应用架构的重要演变：

从单一的人工智能调用转向分布式、多模型工作流
实现人工智能驱动的复杂处理管道编排
为能够推理自身资源使用情况的人工智能系统创建基础
构建能够做出智能授权决策的自我优化人工智能系统

人工智能效率的未来

LLM Gateway 指向的未来是：

人工智能系统主动管理和优化自身的资源使用
更高性能的模型可作为整个人工智能生态系统的智能协调器
人工智能工作流程变得越来越复杂和自组织
组织可以以经济高效的方式充分利用人工智能的全部功能

这种高效、自组织的人工智能系统的愿景代表了实际人工智能部署的下一个前沿，超越了当前对每项任务使用单一模型的模式。

建筑学

MCP 集成的工作原理

LLM 网关是基于模型上下文协议原生构建的：

MCP 服务器核心：网关实现完整的 MCP 服务器
工具注册：所有功能均作为 MCP 工具公开
工具调用：Claude 和其他 AI 代理可以直接调用这些工具
上下文传递：结果以 MCP 的标准格式返回

这确保了与 Claude 和其他 MCP 兼容代理的无缝集成。

组件图

┌─────────────┐ ┌───────────────────┐ ┌──────────────┐ │ Claude 3.7 │ ────────► LLM Gateway MCP │ ────────► LLM Providers│ │ (Agent) │ ◄──────── Server & Tools │ ◄──────── (Multiple) │ └─────────────┘ └───────┬───────────┘ └──────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │ │ Completion │ │ Document │ │ Extraction │ │ │ │ Tools │ │ Tools │ │ Tools │ │ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │ │ Optimization │ │ Core MCP │ │ Analytics │ │ │ │ Tools │ │ Server │ │ Tools │ │ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │ │ Cache │ │ Vector │ │ Prompt │ │ │ │ Service │ │ Service │ │ Service │ │ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │ │ Tournament │ │ Code │ │ Multi-Agent │ │ │ │ Tools │ │ Extraction │ │ Coordination │ │ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │ │ RAG Tools │ │ Local Text │ │ Meta Tools │ │ │ │ │ │ Tools │ │ │ │ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

委托请求流程

当 Claude 将任务委托给 LLM Gateway 时：

Claude 发送 MCP 工具调用请求
网关通过MCP协议接收请求
适当的工具处理请求
缓存服务检查结果是否已被缓存
如果没有缓存，优化服务将选择适当的提供程序/模型
提供程序层将请求发送到选定的 LLM API
响应已标准化、已缓存，并且已记录指标
MCP服务器返回结果给Claude

详细功能文档

提供商集成

多供应商支持：一流的支持：
- OpenAI（GPT-4o-mini、GPT-4o、GPT-4o mini）
- 人类学（克劳德 3.7 系列）
- 谷歌（Gemini Pro、Gemini Flash、Gemini 闪光灯）
- DeepSeek（DeepSeek-Chat、DeepSeek-Reasoner）
- 用于添加新提供商的可扩展架构
模型管理：
- 根据任务需求自动选择模型
- 模型性能跟踪
- 提供商中断的回退机制

成本优化

智能路由：根据以下内容自动选择模型：
- 任务复杂性要求
- 预算限制
- 绩效优先级
- 历史性能数据
高级缓存系统：
- 多种缓存策略（精确、语义、基于任务）
- 可按任务类型配置 TTL
- 具有快速内存查找功能的持久缓存
- 缓存统计和成本节省跟踪

文档处理

智能文档分块：
- 多种分块策略（基于标记、语义、结构）
- 用于上下文保存的重叠配置
- 高效处理大型文档
文档操作：
- 摘要（具有可配置格式）
- 实体提取
- 问答对生成
- 具有并发控制的批处理

数据提取

结构化数据提取：
- 使用模式验证进行 JSON 提取
- 表格提取（JSON、CSV、Markdown 格式）
- 键值对提取
- 语义模式推理

锦标赛和基准测试

模特比赛：
- 在不同模型和配置之间进行竞赛
- 比较不同提供商的代码生成能力
- 生成统计绩效报告
- 商店竞争结果的历史分析
代码提取：
- 从模型响应中提取干净的代码
- 分析并验证提取的代码
- 支持多种编程语言

向量运算

嵌入服务：
- 高效的文本嵌入生成
- 嵌入缓存以降低 API 成本
- 批量处理以提高性能
语义搜索：
- 查找语义相似的内容
- 可配置的相似度阈值
- 快速向量运算
高级融合搜索（Marqo） ：
- 利用 Marqo 进行组合关键字和语义搜索
- 关键词和向量相关性之间的可调权重
- 支持复杂的过滤和分面

检索增强生成（RAG）

上下文生成：
- 利用检索到的相关信息增强法学硕士 (LLM) 的提示
- 提高事实准确性并减少幻觉
- 与矢量搜索和文档存储集成
工作流程集成：
- 将文档检索与生成任务无缝结合
- 可定制的检索和生成策略

本地文本处理

离线操作：
- 提供本地运行的文本操作工具，无需 API 调用
- 包括清理、格式化和基本分析的功能
- 适用于在发送至 LLM 之前对文本进行预处理或对结果进行后处理

元操作

内省与管理：
- 查询服务器功能和状态的工具
- 可能包括动态管理配置或工具设置的功能
- 促进更复杂的代理交互和自我管理

系统功能

丰富的日志记录：
- 使用Rich进行美观的控制台输出
- 不同操作的表情符号指示
- 详细上下文信息
- 日志条目中的性能指标
流媒体支持：
- 所有提供商的一致流媒体界面
- 逐个令牌传递
- 流期间的成本跟踪
健康监测：
- 端点健康检查（/healthz）
- 资源使用情况监控
- 提供商可用性跟踪
- 错误率统计
命令行界面：
- 用于服务器管理的丰富的交互式 CLI
- 从命令行直接调用工具
- 配置管理
- 缓存和服务器状态检查

工具使用示例

本节提供一些示例，说明 MCP 客户端（例如 Claude 3.7）如何调用 LLM 网关提供的特定工具。这些示例假设您已初始化名为client的mcp.client.Client实例并连接到网关。

基本完成

要从选定的提供商处获取简单的文本完成：

response = await client.tools.completion( prompt="Write a short poem about a robot learning to dream.", provider="openai", # Or "anthropic", "gemini", "deepseek" model="gpt-4o-mini", # Specify the desired model max_tokens=100, temperature=0.7 ) if response["success"]: print(f"Completion: {response['completion']}") print(f"Cost: ${response['cost']:.6f}") else: print(f"Error: {response['error']}")

文档摘要

总结一段文字，可能会委托给一个具有成本效益的模型：

document_text = "... your long document content here ..." summary_response = await client.tools.summarize_document( document=document_text, provider="gemini", model="gemini-2.0-flash-lite", # Using a cheaper model for summarization format="bullet_points", # Options: "paragraph", "bullet_points" max_length=150 # Target summary length in tokens (approximate) ) if summary_response["success"]: print(f"Summary:\n{summary_response['summary']}") print(f"Cost: ${summary_response['cost']:.6f}") else: print(f"Error: {summary_response['error']}")

实体提取

从文本中提取特定类型的实体：

text_to_analyze = "Apple Inc. announced its quarterly earnings on May 5th, 2024, reporting strong iPhone sales from its headquarters in Cupertino." entity_response = await client.tools.extract_entities( document=text_to_analyze, entity_types=["organization", "date", "product", "location"], provider="openai", model="gpt-4o-mini" ) if entity_response["success"]: print(f"Extracted Entities: {entity_response['entities']}") print(f"Cost: ${entity_response['cost']:.6f}") else: print(f"Error: {entity_response['error']}")

执行优化的工作流程

要运行多步骤工作流程，其中网关优化每个步骤的模型选择：

doc_content = "... content for workflow processing ..." workflow_definition = [ { "name": "Summarize", "operation": "summarize_document", "provider_preference": "cost", # Prioritize cheaper models "params": {"format": "paragraph"}, "input_from": "original", "output_as": "step1_summary" }, { "name": "ExtractKeywords", "operation": "extract_keywords", # Assuming an extract_keywords tool exists "provider_preference": "speed", "params": {"count": 5}, "input_from": "step1_summary", "output_as": "step2_keywords" } ] workflow_response = await client.tools.execute_optimized_workflow( documents=[doc_content], workflow=workflow_definition ) if workflow_response["success"]: print("Workflow executed successfully.") print(f"Results: {workflow_response['results']}") # Contains outputs like step1_summary, step2_keywords print(f"Total Cost: ${workflow_response['total_cost']:.6f}") print(f"Processing Time: {workflow_response['processing_time']:.2f}s") else: print(f"Workflow Error: {workflow_response['error']}")

列出可用工具（元工具）

要动态发现网关上当前已注册且可用的工具：

# Assuming a meta-tool for listing capabilities list_tools_response = await client.tools.list_tools() if list_tools_response["success"]: print("Available Tools:") for tool_name, tool_info in list_tools_response["tools"].items(): print(f"- {tool_name}: {tool_info.get('description', 'No description')}") # You might also get parameters, etc. else: print(f"Error listing tools: {list_tools_response['error']}")

真实用例

AI代理编排

Claude 或其他高级 AI 代理可以使用 LLM Gateway 来：

将日常任务委托给更便宜的模型
并行处理大型文档
从非结构化文本中提取结构化数据
生成草稿以供审查和改进

企业文档处理

高效处理大型文档集：

将文档分解成有意义的块
在最佳模型之间分配处理
大规模提取结构化数据
实现跨文档的语义搜索

研究与分析

研究团队可以使用 LLM Gateway 来：

比较不同模型的输出
高效处理研究论文
从研究中提取结构化信息
跟踪代币使用情况并优化研究预算

模型基准测试和选择

组织可以使用锦标赛功能来：

在不同模型之间进行受控竞赛
生成量化绩效指标
在模型选择上做出数据驱动的决策
构建自定义模型评估框架

安全注意事项

部署和操作 LLM 网关时，请考虑以下安全方面：

API密钥管理：
- 切勿在源代码中对 API 密钥进行硬编码。
- 使用环境变量（用于本地开发的.env文件、系统环境变量或用于生产的机密管理工具，如 HashiCorp Vault、AWS Secrets Manager、GCP Secret Manager）。
- 确保.env文件（如果使用）具有严格的文件权限（只有运行网关的用户可读）。
- 定期轮换密钥并立即撤销任何可疑的密钥。
网络暴露和访问控制：
- 默认情况下，服务器绑定到127.0.0.1 ，仅允许本地连接。如果您打算将其公开到外部，请将SERVER_HOST更改为0.0.0.0 ，并确保已设置适当的控制措施。
- 使用反向代理（Nginx、Caddy 等）处理传入连接。这允许您管理 TLS/SSL 加密、应用访问控制（例如 IP 允许列表），并可能添加网关级身份验证。
- 在主机或网络上使用防火墙规则来限制仅从受信任的来源（如反向代理或特定的内部客户端）访问SERVER_PORT 。
身份验证和授权：
- 网关本身可能没有内置的用户身份验证。访问控制通常依赖于网络安全（防火墙、VPN）以及可能由反向代理（例如 Basic Auth、OAuth2 代理）处理的身份验证。
- 确保只有授权客户端（如您信任的 AI 代理或应用程序）才能到达网关端点。
速率限制和滥用预防：
- 在反向代理级别实施速率限制或使用专用中间件来防止拒绝服务攻击或过度使用 API（这可能会产生高昂的成本）。
输入验证：
- 虽然 LLM 的输入通常是文本，但请注意，任何工具解释输入的方式都可能导致漏洞（例如，如果某个工具根据输入执行代码）。请根据特定工具的功能，对输入进行适当的清理或验证。
依赖项安全性：
- 定期更新依赖项（ uv pip install --upgrade ...或类似命令）以修补第三方库中已知的漏洞。
- 考虑使用安全扫描工具（如pip-audit或 GitHub Dependabot 警报）来识别易受攻击的依赖项。
记录：
- 请注意， DEBUG级别的日志记录可能会记录完整的提示和响应，其中可能包含敏感信息。请根据您的环境适当配置LOG_LEVEL ，并确保日志文件具有适当的权限。

执照

该项目根据 MIT 许可证获得许可 - 有关详细信息，请参阅 LICENSE 文件。

致谢

API 基础的模型上下文协议
Rich带来美观的终端输出
Pydantic用于数据验证
uv ，用于快速可靠的 Python 包管理
所有 LLM 提供商均通过 API 提供其模型

This server cannot be installed

security - not tested

license - not tested

quality - not tested

How are these scores calculated?

Related Resources

GitHub Repository

Need Help?

Report Issue

Related MCP Servers

Terminal MCP Server
theailanguage
-
security
F
license
-
quality
An MCP server that allows AI assistants like Claude to execute terminal commands on the user's computer and return the output, functioning like a terminal through AI.
Last updated -
71
PyMCPAutoGUI
kitfactory
-
security
A
license
-
quality
An MCP server that bridges AI agents with GUI automation capabilities, allowing them to control mouse, keyboard, windows, and take screenshots to interact with desktop applications.
Last updated -
15
MIT License
Vibe Coder MCP
freshtechbro
A
security
F
license
A
quality
An MCP server that supercharges AI assistants with powerful tools for software development, enabling research, planning, code generation, and project scaffolding through natural language interaction.
Last updated -
11
13
88
human-mcp
upamune
A
security
A
license
A
quality
An MCP server that allows AI assistants to utilize human capabilities by sending requests to humans and receiving their responses through a Streamlit UI.
Last updated -
7
45
MIT License

View all related MCP servers