OpenRouter MCP Server

mcp_server_thinking_tokens_best_practices.md•70.8 KiB

# OpenRouter MCP Server Implementation with Thinking Token Control: Best Practices Guide 2025 ## Table of Contents 1. [Introduction](#introduction) 2. [OpenRouter MCP Server Implementation](#openrouter-mcp-server-implementation) 3. [Thinking Models Token Control](#thinking-models-token-control) 4. [Implementation Best Practices](#implementation-best-practices) 5. [Advanced Techniques](#advanced-techniques) 6. [Cost Optimization Strategies](#cost-optimization-strategies) 7. [Common Pitfalls and Solutions](#common-pitfalls-and-solutions) 8. [Complete Implementation Examples](#complete-implementation-examples) 9. [References](#references) ## Introduction The Model Context Protocol (MCP) is an open-source standard from Anthropic for connecting AI assistants to external data sources and tools. This guide provides comprehensive best practices for implementing OpenRouter MCP servers with advanced thinking token control for reasoning models like o1, QwQ, and DeepSeek R1. ### Key Technologies Covered - **OpenRouter API**: Unified interface for multiple AI models - **MCP Protocol**: Standardized AI tool integration - **Thinking Models**: o1, QwQ, DeepSeek R1 with reasoning capabilities - **TypeScript SDK**: Production-ready implementation patterns - **Token Optimization**: Cost and performance management ## OpenRouter MCP Server Implementation ### Architecture Overview OpenRouter provides MCP integration by converting MCP (Anthropic) tool definitions to OpenAI-compatible tool definitions. The architecture involves three key components: 1. **MCP Server**: Provides tools and context to AI models 2. **OpenRouter API**: Unified interface for model access 3. **Client Application**: Consumes MCP tools via OpenRouter ### Authentication and Setup #### API Key Management ```typescript // Environment configuration interface Config { OPENROUTER_API_KEY: string; DEFAULT_MODEL?: string; BASE_URL?: string; } const config: Config = { OPENROUTER_API_KEY: process.env.OPENROUTER_API_KEY!, DEFAULT_MODEL: process.env.DEFAULT_MODEL || "anthropic/claude-3.7-sonnet", BASE_URL: "https://openrouter.ai/api/v1" }; // OpenAI SDK configuration for OpenRouter import OpenAI from 'openai'; const openai = new OpenAI({ apiKey: config.OPENROUTER_API_KEY, baseURL: config.BASE_URL, }); ``` #### Security Best Practices - **Never commit API keys** to public repositories - Use environment variables or secure secret management - Implement key rotation policies - Monitor usage to detect unauthorized access - Use the minimum required permissions ### Server Architecture Patterns #### Basic MCP Server Setup ```typescript import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; import { z } from "zod"; const server = new McpServer({ name: "openrouter-thinking-server", version: "1.0.0" }); // Transport configuration const transport = new StdioServerTransport(); ``` #### Production-Grade Server Template ```typescript interface ServerConfig { name: string; version: string; observability?: { enableTracing: boolean; enableMetrics: boolean; }; auth?: { enabled: boolean; provider: 'jwt' | 'oauth'; }; } class ProductionMCPServer { private server: McpServer; private config: ServerConfig; constructor(config: ServerConfig) { this.config = config; this.server = new McpServer({ name: config.name, version: config.version }); this.setupErrorHandling(); this.setupObservability(); } private setupErrorHandling() { this.server.onerror = (error) => { console.error("MCP Server Error:", error); // Don't exit - let MCP handle reconnection }; } private setupObservability() { if (this.config.observability?.enableTracing) { // OpenTelemetry integration this.setupTracing(); } } } ``` ### Request/Response Handling #### Tool Registration with Zod Validation ```typescript // Thinking model completion tool const thinkingCompletionSchema = z.object({ model: z.string().describe("Model to use (e.g., 'anthropic/claude-3.7-sonnet')"), messages: z.array(z.object({ role: z.enum(["system", "user", "assistant"]), content: z.string() })), reasoning: z.object({ effort: z.enum(["minimal", "low", "medium", "high"]).optional(), max_tokens: z.number().int().positive().max(32000).optional() }).optional(), max_tokens: z.number().int().positive().optional(), temperature: z.number().min(0).max(2).optional(), stream: z.boolean().optional().default(false) }); server.setRequestHandler(CallToolRequestSchema, async (request) => { if (request.params.name === "thinking_completion") { const args = thinkingCompletionSchema.parse(request.params.arguments); try { const response = await handleThinkingCompletion(args); return { content: [{ type: "text", text: JSON.stringify(response, null, 2) }] }; } catch (error) { throw new McpError( ErrorCode.InternalError, `Thinking completion failed: ${error.message}` ); } } }); ``` #### Error Management ```typescript enum MCPErrorType { AUTHENTICATION_FAILED = "AUTHENTICATION_FAILED", RATE_LIMIT_EXCEEDED = "RATE_LIMIT_EXCEEDED", MODEL_UNAVAILABLE = "MODEL_UNAVAILABLE", THINKING_TOKENS_EXCEEDED = "THINKING_TOKENS_EXCEEDED", VALIDATION_ERROR = "VALIDATION_ERROR" } class MCPError extends Error { constructor( public type: MCPErrorType, message: string, public details?: any ) { super(message); this.name = "MCPError"; } } // Error handling with automatic retry async function handleThinkingCompletion(args: ThinkingCompletionArgs) { const maxRetries = 3; let attempt = 0; while (attempt < maxRetries) { try { return await performCompletion(args); } catch (error) { attempt++; if (error.status === 429) { // Rate limit const delay = Math.pow(2, attempt) * 1000; // Exponential backoff await new Promise(resolve => setTimeout(resolve, delay)); continue; } if (error.status >= 500 && attempt < maxRetries) { continue; // Retry server errors } throw new MCPError( MCPErrorType.MODEL_UNAVAILABLE, `Completion failed after ${attempt} attempts: ${error.message}`, { originalError: error, attempt } ); } } } ``` ### Rate Limiting and Cost Management #### Built-in Rate Limiting ```typescript interface RateLimitConfig { requestsPerMinute: number; requestsPerDay: number; tokensPerMinute: number; costLimitPerDay: number; // In dollars } class RateLimiter { private requests: Map<string, number[]> = new Map(); private costs: Map<string, number> = new Map(); constructor(private config: RateLimitConfig) {} async checkLimits(userId: string, estimatedCost: number): Promise<void> { const now = Date.now(); const userRequests = this.requests.get(userId) || []; // Remove old requests (older than 1 minute) const recentRequests = userRequests.filter(time => now - time < 60000); if (recentRequests.length >= this.config.requestsPerMinute) { throw new MCPError( MCPErrorType.RATE_LIMIT_EXCEEDED, "Rate limit exceeded: too many requests per minute" ); } // Check daily cost limit const dailyCost = this.costs.get(userId) || 0; if (dailyCost + estimatedCost > this.config.costLimitPerDay) { throw new MCPError( MCPErrorType.RATE_LIMIT_EXCEEDED, "Daily cost limit exceeded" ); } // Update tracking recentRequests.push(now); this.requests.set(userId, recentRequests); } updateCost(userId: string, actualCost: number): void { const currentCost = this.costs.get(userId) || 0; this.costs.set(userId, currentCost + actualCost); } } ``` ## Thinking Models Token Control ### Understanding Thinking Tokens Thinking tokens are internal reasoning tokens that models use before generating their final response. Different models implement thinking tokens differently: - **OpenAI o1/o3 Series**: Uses `reasoning_effort` parameter - **Claude 3.7/4.x**: Uses `budget_tokens` parameter - **DeepSeek R1**: Uses Chain of Thought (CoT) with automatic reasoning ### OpenAI o1 Series Configuration #### Reasoning Effort Control ```typescript interface OpenAIReasoningConfig { model: string; reasoning_effort: "minimal" | "low" | "medium" | "high"; max_completion_tokens: number; } // Effort level mapping const EFFORT_RATIOS = { minimal: 0.05, // ~5% of max_tokens for reasoning low: 0.2, // ~20% of max_tokens for reasoning medium: 0.5, // ~50% of max_tokens for reasoning high: 0.8 // ~80% of max_tokens for reasoning }; async function createO1Completion(config: OpenAIReasoningConfig, messages: any[]) { const reasoningBudget = Math.floor( config.max_completion_tokens * EFFORT_RATIOS[config.reasoning_effort] ); // Ensure max_tokens > reasoning budget for final response const adjustedMaxTokens = Math.max( config.max_completion_tokens, reasoningBudget + 100 // Minimum 100 tokens for final response ); const response = await openai.chat.completions.create({ model: config.model, messages, reasoning_effort: config.reasoning_effort, max_completion_tokens: adjustedMaxTokens, // Note: temperature, top_p not supported with reasoning models }); return { content: response.choices[0].message.content, reasoning_tokens: response.usage?.reasoning_tokens || 0, completion_tokens: response.usage?.completion_tokens || 0, total_cost: calculateCost(response.usage, config.model) }; } ``` #### Dynamic Reasoning Effort ```typescript function determineReasoningEffort( queryComplexity: "simple" | "medium" | "complex", userBudget: number ): "minimal" | "low" | "medium" | "high" { const complexityMapping = { simple: { budget_low: "minimal", budget_medium: "low", budget_high: "medium" }, medium: { budget_low: "low", budget_medium: "medium", budget_high: "high" }, complex: { budget_low: "medium", budget_medium: "high", budget_high: "high" } }; const budgetLevel = userBudget < 0.01 ? "budget_low" : userBudget < 0.05 ? "budget_medium" : "budget_high"; return complexityMapping[queryComplexity][budgetLevel]; } ``` ### Claude 3.7+ Budget Tokens #### Budget Token Implementation ```typescript interface ClaudeThinkingConfig { model: string; budget_tokens: number; max_tokens: number; } async function createClaudeThinking( config: ClaudeThinkingConfig, messages: any[] ) { // Validate budget constraints if (config.budget_tokens > 32000) { throw new MCPError( MCPErrorType.THINKING_TOKENS_EXCEEDED, "Budget tokens cannot exceed 32,000" ); } if (config.budget_tokens < 1024) { config.budget_tokens = 1024; // Minimum budget } const response = await openai.chat.completions.create({ model: config.model, messages, max_tokens: config.max_tokens, thinking: { budget_tokens: config.budget_tokens } }); return { content: response.choices[0].message.content, thinking_content: response.choices[0].message.thinking, thinking_tokens: response.usage?.thinking_tokens || 0, output_tokens: response.usage?.completion_tokens || 0, total_cost: calculateClaudeCost(response.usage) }; } ``` #### Adaptive Budget Allocation ```typescript class AdaptiveBudgetManager { private performanceHistory: Map<string, ThinkingPerformance[]> = new Map(); calculateOptimalBudget( taskType: string, targetQuality: number, maxCost: number ): number { const history = this.performanceHistory.get(taskType) || []; if (history.length === 0) { return 4096; // Default starting budget } // Find the minimum budget that achieves target quality const qualityMet = history .filter(h => h.qualityScore >= targetQuality) .sort((a, b) => a.budgetUsed - b.budgetUsed); if (qualityMet.length === 0) { return Math.min(16384, maxCost * 0.8); // Conservative approach } const optimalBudget = qualityMet[0].budgetUsed; // Add 20% buffer for safety return Math.min(Math.floor(optimalBudget * 1.2), 32000); } recordPerformance( taskType: string, budgetUsed: number, qualityScore: number, cost: number ): void { const history = this.performanceHistory.get(taskType) || []; history.push({ budgetUsed, qualityScore, cost, timestamp: Date.now() }); // Keep only recent history (last 100 entries) if (history.length > 100) { history.splice(0, history.length - 100); } this.performanceHistory.set(taskType, history); } } interface ThinkingPerformance { budgetUsed: number; qualityScore: number; cost: number; timestamp: number; } ``` ### DeepSeek R1 Implementation #### Chain of Thought Access ```typescript interface DeepSeekR1Config { model: "deepseek-reasoner"; max_tokens: number; enable_cot_display: boolean; } async function createDeepSeekCompletion( config: DeepSeekR1Config, messages: any[] ) { const response = await openai.chat.completions.create({ model: config.model, messages, max_tokens: config.max_tokens // Note: DeepSeek R1 automatically manages thinking tokens }); const choice = response.choices[0]; return { content: choice.message.content, reasoning_content: choice.message.reasoning_content || null, reasoning_tokens: calculateReasoningTokens(choice.message.reasoning_content), output_tokens: response.usage?.completion_tokens || 0, total_cost: calculateDeepSeekCost(response.usage) }; } function calculateReasoningTokens(reasoningContent: string | null): number { if (!reasoningContent) return 0; // Rough estimation: ~4 characters per token return Math.floor(reasoningContent.length / 4); } ``` ### Unified Thinking Token Interface ```typescript interface UnifiedThinkingRequest { model: string; messages: any[]; thinking_config: { effort?: "minimal" | "low" | "medium" | "high"; budget_tokens?: number; max_thinking_tokens?: number; }; max_tokens: number; stream?: boolean; } class UnifiedThinkingClient { async createThinkingCompletion(request: UnifiedThinkingRequest) { const provider = this.detectProvider(request.model); switch (provider) { case "openai": return this.handleOpenAIReasoning(request); case "anthropic": return this.handleClaudeThinking(request); case "deepseek": return this.handleDeepSeekReasoning(request); default: throw new MCPError( MCPErrorType.MODEL_UNAVAILABLE, `Thinking tokens not supported for provider: ${provider}` ); } } private detectProvider(model: string): string { if (model.includes("openai/") || model.includes("o1") || model.includes("o3")) { return "openai"; } if (model.includes("anthropic/") || model.includes("claude")) { return "anthropic"; } if (model.includes("deepseek")) { return "deepseek"; } return "unknown"; } private async handleOpenAIReasoning(request: UnifiedThinkingRequest) { const effort = request.thinking_config.effort || "medium"; return createO1Completion({ model: request.model, reasoning_effort: effort, max_completion_tokens: request.max_tokens }, request.messages); } private async handleClaudeThinking(request: UnifiedThinkingRequest) { const budget = request.thinking_config.budget_tokens || 4096; return createClaudeThinking({ model: request.model, budget_tokens: budget, max_tokens: request.max_tokens }, request.messages); } } ``` ## Implementation Best Practices ### TypeScript Architecture #### Strict Type Safety ```typescript // Never use 'any' - always define proper types interface MCPToolResponse<T = unknown> { content: Array<{ type: "text" | "image" | "resource"; text?: string; data?: string; mimeType?: string; }>; metadata?: { thinking_tokens?: number; cost?: number; model_used?: string; response_time_ms?: number; }; } // Use discriminated unions for tool configurations type ThinkingModelConfig = | { provider: "openai"; reasoning_effort: "minimal" | "low" | "medium" | "high" } | { provider: "anthropic"; budget_tokens: number } | { provider: "deepseek"; enable_cot: boolean }; // Proper error typing type MCPResult<T> = | { success: true; data: T } | { success: false; error: MCPError }; ``` #### Zod Schema Patterns ```typescript // Comprehensive validation schemas const ThinkingRequestSchema = z.object({ model: z.string().min(1), messages: z.array(z.object({ role: z.enum(["system", "user", "assistant"]), content: z.union([ z.string(), z.array(z.object({ type: z.enum(["text", "image_url"]), text: z.string().optional(), image_url: z.object({ url: z.string().url() }).optional() })) ]) })), thinking_config: z.object({ effort: z.enum(["minimal", "low", "medium", "high"]).optional(), budget_tokens: z.number().int().min(1024).max(32000).optional(), quality_threshold: z.number().min(0).max(1).optional() }).optional(), optimization: z.object({ enable_caching: z.boolean().default(true), max_cost: z.number().positive().optional(), timeout_ms: z.number().int().positive().default(30000) }).optional() }); // Schema composition for complex tools const BaseToolSchema = z.object({ metadata: z.object({ user_id: z.string(), session_id: z.string(), timestamp: z.number() }) }); const ThinkingToolSchema = BaseToolSchema.extend({ thinking_request: ThinkingRequestSchema }); ``` #### Modular Tool Architecture ```typescript abstract class BaseMCPTool<TInput, TOutput> { abstract name: string; abstract description: string; abstract inputSchema: z.ZodSchema<TInput>; constructor(protected config: ToolConfig) {} async execute(input: TInput): Promise<MCPToolResponse<TOutput>> { try { const validatedInput = this.inputSchema.parse(input); const result = await this.process(validatedInput); return { content: [{ type: "text", text: JSON.stringify(result, null, 2) }], metadata: { response_time_ms: Date.now() - this.startTime, cost: this.calculateCost(result) } }; } catch (error) { throw this.handleError(error); } } protected abstract process(input: TInput): Promise<TOutput>; protected abstract calculateCost(result: TOutput): number; protected abstract handleError(error: unknown): MCPError; } // Concrete implementation class ThinkingCompletionTool extends BaseMCPTool<ThinkingRequest, ThinkingResponse> { name = "thinking_completion"; description = "Generate responses using thinking models with controlled reasoning"; inputSchema = ThinkingRequestSchema; protected async process(input: ThinkingRequest): Promise<ThinkingResponse> { const client = new UnifiedThinkingClient(); return client.createThinkingCompletion(input); } protected calculateCost(result: ThinkingResponse): number { const inputCost = result.input_tokens * this.getInputRate(result.model); const outputCost = result.output_tokens * this.getOutputRate(result.model); const thinkingCost = result.thinking_tokens * this.getOutputRate(result.model); return inputCost + outputCost + thinkingCost; } } ``` ### Server Configuration #### Environment-Based Configuration ```typescript interface ServerEnvironment { NODE_ENV: "development" | "staging" | "production"; OPENROUTER_API_KEY: string; LOG_LEVEL: "debug" | "info" | "warn" | "error"; ENABLE_TELEMETRY: boolean; MAX_THINKING_TOKENS: number; COST_LIMIT_PER_USER: number; } class ConfigManager { private static instance: ConfigManager; private config: ServerEnvironment; private constructor() { this.config = this.loadConfiguration(); this.validateConfiguration(); } static getInstance(): ConfigManager { if (!ConfigManager.instance) { ConfigManager.instance = new ConfigManager(); } return ConfigManager.instance; } private loadConfiguration(): ServerEnvironment { return { NODE_ENV: (process.env.NODE_ENV as any) || "development", OPENROUTER_API_KEY: process.env.OPENROUTER_API_KEY!, LOG_LEVEL: (process.env.LOG_LEVEL as any) || "info", ENABLE_TELEMETRY: process.env.ENABLE_TELEMETRY === "true", MAX_THINKING_TOKENS: parseInt(process.env.MAX_THINKING_TOKENS || "16384"), COST_LIMIT_PER_USER: parseFloat(process.env.COST_LIMIT_PER_USER || "10.0") }; } private validateConfiguration(): void { if (!this.config.OPENROUTER_API_KEY) { throw new Error("OPENROUTER_API_KEY is required"); } if (this.config.MAX_THINKING_TOKENS > 32000) { console.warn("MAX_THINKING_TOKENS exceeds recommended limit of 32,000"); } } get<K extends keyof ServerEnvironment>(key: K): ServerEnvironment[K] { return this.config[key]; } } ``` #### Transport Management ```typescript class TransportManager { private transports: Map<string, Transport> = new Map(); registerTransport(name: string, transport: Transport): void { this.transports.set(name, transport); } async connectAll(server: McpServer): Promise<void> { const connections = Array.from(this.transports.entries()).map( async ([name, transport]) => { try { await server.connect(transport); console.log(`Connected transport: ${name}`); } catch (error) { console.error(`Failed to connect transport ${name}:`, error); throw error; } } ); await Promise.all(connections); } async disconnectAll(): Promise<void> { for (const [name, transport] of this.transports) { try { await transport.close(); console.log(`Disconnected transport: ${name}`); } catch (error) { console.error(`Error disconnecting ${name}:`, error); } } } } // Usage const transportManager = new TransportManager(); // Stdio transport for local development transportManager.registerTransport( "stdio", new StdioServerTransport() ); // HTTP transport for production transportManager.registerTransport( "http", new SSEServerTransport( new Server(app), // Hono or Express server "/mcp" ) ); ``` ### Streaming Implementation #### Server-Sent Events (SSE) ```typescript interface StreamingConfig { enable_streaming: boolean; chunk_size: number; heartbeat_interval: number; } class StreamingThinkingClient { private config: StreamingConfig; constructor(config: StreamingConfig) { this.config = config; } async createStreamingCompletion( request: UnifiedThinkingRequest ): Promise<AsyncIterable<ThinkingChunk>> { if (!this.config.enable_streaming) { throw new MCPError( MCPErrorType.VALIDATION_ERROR, "Streaming is not enabled" ); } const response = await openai.chat.completions.create({ ...this.buildRequestParams(request), stream: true }); return this.processStreamingResponse(response); } private async* processStreamingResponse( stream: AsyncIterable<any> ): AsyncIterable<ThinkingChunk> { let thinkingTokens = 0; let outputTokens = 0; for await (const chunk of stream) { const delta = chunk.choices[0]?.delta; if (delta?.reasoning) { thinkingTokens += this.estimateTokens(delta.reasoning); yield { type: "thinking", content: delta.reasoning, tokens: thinkingTokens }; } if (delta?.content) { outputTokens += this.estimateTokens(delta.content); yield { type: "output", content: delta.content, tokens: outputTokens }; } if (chunk.choices[0]?.finish_reason) { yield { type: "complete", thinking_tokens: thinkingTokens, output_tokens: outputTokens, finish_reason: chunk.choices[0].finish_reason }; } } } private estimateTokens(text: string): number { // Rough estimation: ~4 characters per token return Math.ceil(text.length / 4); } } interface ThinkingChunk { type: "thinking" | "output" | "complete"; content?: string; tokens?: number; thinking_tokens?: number; output_tokens?: number; finish_reason?: string; } ``` #### Stream Cancellation ```typescript class CancellableStream { private abortController: AbortController; private onCancel?: () => void; constructor() { this.abortController = new AbortController(); } async createCancellableCompletion( request: UnifiedThinkingRequest ): Promise<AsyncIterable<ThinkingChunk>> { const signal = this.abortController.signal; const response = await openai.chat.completions.create({ ...this.buildRequestParams(request), stream: true }, { signal // Pass abort signal to request }); // Set up cancellation handler signal.addEventListener('abort', () => { if (this.onCancel) { this.onCancel(); } }); return this.processStreamingResponse(response); } cancel(): void { this.abortController.abort(); } onCancellation(callback: () => void): void { this.onCancel = callback; } } // Usage in MCP tool class StreamingThinkingTool extends BaseMCPTool<any, any> { private activeStreams: Map<string, CancellableStream> = new Map(); async execute(input: any): Promise<MCPToolResponse<any>> { const streamId = this.generateStreamId(); const stream = new CancellableStream(); this.activeStreams.set(streamId, stream); try { const completion = await stream.createCancellableCompletion(input); return { content: [{ type: "text", text: JSON.stringify({ stream_id: streamId, message: "Streaming started. Use cancel_stream tool to stop." }) }] }; } finally { this.activeStreams.delete(streamId); } } cancelStream(streamId: string): boolean { const stream = this.activeStreams.get(streamId); if (stream) { stream.cancel(); this.activeStreams.delete(streamId); return true; } return false; } } ``` ## Advanced Techniques ### Dynamic Token Allocation #### Complexity-Based Budgeting ```typescript interface QueryComplexity { syntactic_complexity: number; // 0-1 based on sentence structure semantic_complexity: number; // 0-1 based on topic difficulty reasoning_required: number; // 0-1 based on logical steps needed domain_complexity: number; // 0-1 based on specialized knowledge } class ComplexityAnalyzer { analyzeQuery(query: string): QueryComplexity { return { syntactic_complexity: this.analyzeSyntax(query), semantic_complexity: this.analyzeSemantics(query), reasoning_required: this.analyzeReasoning(query), domain_complexity: this.analyzeDomain(query) }; } private analyzeSyntax(query: string): number { const sentences = query.split(/[.!?]+/).filter(s => s.trim()); const avgWordsPerSentence = sentences.reduce((acc, s) => acc + s.trim().split(/\s+/).length, 0) / sentences.length; // Normalize to 0-1 scale return Math.min(avgWordsPerSentence / 30, 1); } private analyzeSemantics(query: string): number { const complexityIndicators = [ /\b(analyze|compare|evaluate|synthesize|integrate)\b/i, /\b(multiple|various|several|different)\b/i, /\b(relationship|correlation|causation)\b/i, /\b(implication|consequence|impact)\b/i ]; const matches = complexityIndicators.reduce((count, pattern) => count + (pattern.test(query) ? 1 : 0), 0); return matches / complexityIndicators.length; } private analyzeReasoning(query: string): number { const reasoningKeywords = [ /\bwhy\b/i, /\bhow\b/i, /\bexplain\b/i, /\bprove\b/i, /\bif.*then\b/i, /\bassume\b/i, /\bgiven.*find\b/i, /\bstep.*step\b/i, /\blogical\b/i, /\breason\b/i ]; const matches = reasoningKeywords.reduce((count, pattern) => count + (pattern.test(query) ? 1 : 0), 0); return Math.min(matches / 3, 1); } private analyzeDomain(query: string): number { const specializedDomains = [ /\b(quantum|molecular|biochemical|neurological)\b/i, /\b(algorithm|cryptographic|computational)\b/i, /\b(legal|constitutional|statutory)\b/i, /\b(financial|actuarial|economic)\b/i, /\b(medical|clinical|diagnostic)\b/i ]; return specializedDomains.some(pattern => pattern.test(query)) ? 0.8 : 0.2; } } class DynamicTokenAllocator { private complexityAnalyzer = new ComplexityAnalyzer(); calculateOptimalTokens( query: string, maxBudget: number, qualityTarget: number ): number { const complexity = this.complexityAnalyzer.analyzeQuery(query); // Weighted complexity score const weights = { syntactic_complexity: 0.1, semantic_complexity: 0.3, reasoning_required: 0.4, domain_complexity: 0.2 }; const overallComplexity = Object.entries(complexity).reduce( (sum, [key, value]) => sum + value * weights[key as keyof typeof weights], 0 ); // Calculate base allocation const baseAllocation = Math.floor(maxBudget * 0.3); // 30% minimum const complexityAllocation = Math.floor( maxBudget * 0.7 * overallComplexity ); // Quality adjustment const qualityMultiplier = 0.5 + (qualityTarget * 0.5); const finalAllocation = Math.floor( (baseAllocation + complexityAllocation) * qualityMultiplier ); return Math.max(1024, Math.min(finalAllocation, maxBudget)); } } ``` #### Adaptive Learning System ```typescript interface LearningData { query_hash: string; allocated_tokens: number; actual_tokens_used: number; quality_score: number; cost: number; response_time: number; user_feedback?: number; // 1-5 rating } class AdaptiveLearningSystem { private learningData: LearningData[] = []; private modelWeights: Map<string, number> = new Map(); recordExecution(data: LearningData): void { this.learningData.push(data); this.updateModelWeights(); // Cleanup old data (keep last 10,000 entries) if (this.learningData.length > 10000) { this.learningData = this.learningData.slice(-10000); } } private updateModelWeights(): void { // Simple gradient descent-like approach const recentData = this.learningData.slice(-100); // Last 100 executions for (const data of recentData) { const efficiency = data.actual_tokens_used / data.allocated_tokens; const qualityPerCost = data.quality_score / data.cost; // Adjust weights based on efficiency and quality/cost ratio const currentWeight = this.modelWeights.get(data.query_hash) || 1.0; const adjustment = this.calculateAdjustment(efficiency, qualityPerCost); this.modelWeights.set(data.query_hash, currentWeight + adjustment); } } private calculateAdjustment(efficiency: number, qualityPerCost: number): number { // Reward efficient token usage and high quality per cost const efficiencyScore = efficiency > 0.8 ? 0.1 : -0.05; const qualityScore = qualityPerCost > 0.5 ? 0.1 : -0.05; return (efficiencyScore + qualityScore) * 0.1; // Small adjustments } predictOptimalTokens(queryHash: string, baseAllocation: number): number { const weight = this.modelWeights.get(queryHash) || 1.0; return Math.floor(baseAllocation * weight); } getPerformanceMetrics(): { averageEfficiency: number; averageQuality: number; costEffectiveness: number; } { if (this.learningData.length === 0) { return { averageEfficiency: 0, averageQuality: 0, costEffectiveness: 0 }; } const recent = this.learningData.slice(-1000); const avgEfficiency = recent.reduce((sum, d) => sum + (d.actual_tokens_used / d.allocated_tokens), 0) / recent.length; const avgQuality = recent.reduce((sum, d) => sum + d.quality_score, 0) / recent.length; const costEffectiveness = recent.reduce((sum, d) => sum + (d.quality_score / d.cost), 0) / recent.length; return { averageEfficiency, avgQuality, costEffectiveness }; } } ``` ### Caching Strategies #### Prompt Caching Implementation ```typescript interface CacheEntry { key: string; response: any; thinking_tokens: number; cost: number; timestamp: number; hit_count: number; quality_score?: number; } class ThinkingCache { private cache: Map<string, CacheEntry> = new Map(); private maxSize: number; private ttlMs: number; constructor(maxSize: number = 10000, ttlMs: number = 24 * 60 * 60 * 1000) { this.maxSize = maxSize; this.ttlMs = ttlMs; } generateKey( model: string, messages: any[], thinkingConfig: any ): string { const content = { model, messages: messages.map(m => ({ role: m.role, content: m.content })), thinking: thinkingConfig }; return this.hashObject(content); } private hashObject(obj: any): string { const str = JSON.stringify(obj, Object.keys(obj).sort()); return require('crypto').createHash('sha256').update(str).digest('hex'); } get(key: string): CacheEntry | null { const entry = this.cache.get(key); if (!entry) return null; // Check TTL if (Date.now() - entry.timestamp > this.ttlMs) { this.cache.delete(key); return null; } // Update hit count and return entry.hit_count++; return entry; } set( key: string, response: any, thinking_tokens: number, cost: number, quality_score?: number ): void { // Evict if cache is full if (this.cache.size >= this.maxSize) { this.evictLeastUsed(); } this.cache.set(key, { key, response, thinking_tokens, cost, timestamp: Date.now(), hit_count: 0, quality_score }); } private evictLeastUsed(): void { let leastUsedKey = ''; let minHits = Infinity; let oldestTime = Infinity; for (const [key, entry] of this.cache) { // Prefer evicting entries with fewer hits, then older entries if (entry.hit_count < minHits || (entry.hit_count === minHits && entry.timestamp < oldestTime)) { leastUsedKey = key; minHits = entry.hit_count; oldestTime = entry.timestamp; } } if (leastUsedKey) { this.cache.delete(leastUsedKey); } } getCacheStats(): { size: number; hitRate: number; totalSavings: number; averageQuality: number; } { const entries = Array.from(this.cache.values()); const totalHits = entries.reduce((sum, e) => sum + e.hit_count, 0); const totalRequests = entries.length + totalHits; const hitRate = totalRequests > 0 ? totalHits / totalRequests : 0; const totalSavings = entries.reduce((sum, e) => sum + (e.cost * e.hit_count), 0); const qualityEntries = entries.filter(e => e.quality_score !== undefined); const averageQuality = qualityEntries.length > 0 ? qualityEntries.reduce((sum, e) => sum + e.quality_score!, 0) / qualityEntries.length : 0; return { size: this.cache.size, hitRate, totalSavings, averageQuality }; } } // Integration with thinking client class CachedThinkingClient extends UnifiedThinkingClient { private cache = new ThinkingCache(); async createThinkingCompletion(request: UnifiedThinkingRequest) { const cacheKey = this.cache.generateKey( request.model, request.messages, request.thinking_config ); // Check cache first const cached = this.cache.get(cacheKey); if (cached) { return { ...cached.response, cache_hit: true, cached_cost_savings: cached.cost }; } // Generate new response const response = await super.createThinkingCompletion(request); // Cache the response this.cache.set( cacheKey, response, response.thinking_tokens || 0, response.cost || 0, response.quality_score ); return { ...response, cache_hit: false }; } } ``` ### Parallel Processing #### Concurrent Model Requests ```typescript interface ParallelRequest { id: string; model: string; messages: any[]; thinking_config: any; priority: number; } interface ParallelResponse { id: string; response: any; thinking_tokens: number; cost: number; response_time: number; error?: Error; } class ParallelThinkingProcessor { private maxConcurrency: number; private queue: ParallelRequest[] = []; private processing: Set<string> = new Set(); constructor(maxConcurrency: number = 5) { this.maxConcurrency = maxConcurrency; } async processRequests( requests: ParallelRequest[] ): Promise<ParallelResponse[]> { // Sort by priority (higher number = higher priority) const sortedRequests = [...requests].sort((a, b) => b.priority - a.priority); this.queue.push(...sortedRequests); const results = await Promise.allSettled( this.queue.map(request => this.processRequest(request)) ); return results.map((result, index) => { const request = this.queue[index]; if (result.status === 'fulfilled') { return result.value; } else { return { id: request.id, response: null, thinking_tokens: 0, cost: 0, response_time: 0, error: result.reason }; } }); } private async processRequest(request: ParallelRequest): Promise<ParallelResponse> { // Wait for available slot await this.waitForSlot(); this.processing.add(request.id); const startTime = Date.now(); try { const client = new CachedThinkingClient(); const response = await client.createThinkingCompletion({ model: request.model, messages: request.messages, thinking_config: request.thinking_config }); return { id: request.id, response, thinking_tokens: response.thinking_tokens || 0, cost: response.cost || 0, response_time: Date.now() - startTime }; } finally { this.processing.delete(request.id); } } private async waitForSlot(): Promise<void> { while (this.processing.size >= this.maxConcurrency) { await new Promise(resolve => setTimeout(resolve, 100)); } } getQueueStatus(): { queued: number; processing: number; capacity: number; } { return { queued: this.queue.length, processing: this.processing.size, capacity: this.maxConcurrency }; } } // Usage in MCP tool class BatchThinkingTool extends BaseMCPTool<any, any> { private processor = new ParallelThinkingProcessor(); async execute(input: { requests: Array<{ model: string; messages: any[]; thinking_config?: any; priority?: number; }> }): Promise<MCPToolResponse<any>> { const parallelRequests: ParallelRequest[] = input.requests.map((req, index) => ({ id: `req_${index}`, model: req.model, messages: req.messages, thinking_config: req.thinking_config || {}, priority: req.priority || 1 })); const results = await this.processor.processRequests(parallelRequests); const summary = { total_requests: results.length, successful: results.filter(r => !r.error).length, failed: results.filter(r => r.error).length, total_thinking_tokens: results.reduce((sum, r) => sum + r.thinking_tokens, 0), total_cost: results.reduce((sum, r) => sum + r.cost, 0), average_response_time: results.reduce((sum, r) => sum + r.response_time, 0) / results.length }; return { content: [{ type: "text", text: JSON.stringify({ summary, results }, null, 2) }] }; } } ``` ## Cost Optimization Strategies ### Real-Time Cost Monitoring ```typescript interface CostTracker { input_tokens: number; output_tokens: number; thinking_tokens: number; total_cost: number; model_used: string; timestamp: number; } class CostMonitor { private costs: CostTracker[] = []; private dailyLimits: Map<string, number> = new Map(); private monthlyLimits: Map<string, number> = new Map(); // Model pricing (per 1M tokens) private modelPricing = { "openai/o1": { input: 15.0, output: 60.0 }, "openai/o1-mini": { input: 3.0, output: 12.0 }, "openai/o3-mini": { input: 1.25, output: 10.0 }, "anthropic/claude-3.7-sonnet": { input: 3.0, output: 15.0 }, "deepseek/deepseek-reasoner": { input: 0.55, output: 2.19 } }; calculateCost( model: string, inputTokens: number, outputTokens: number, thinkingTokens: number = 0 ): number { const pricing = this.modelPricing[model]; if (!pricing) { throw new Error(`Pricing not found for model: ${model}`); } const inputCost = (inputTokens / 1000000) * pricing.input; const outputCost = ((outputTokens + thinkingTokens) / 1000000) * pricing.output; return inputCost + outputCost; } trackUsage( model: string, inputTokens: number, outputTokens: number, thinkingTokens: number, userId?: string ): CostTracker { const cost = this.calculateCost(model, inputTokens, outputTokens, thinkingTokens); const tracker: CostTracker = { input_tokens: inputTokens, output_tokens: outputTokens, thinking_tokens: thinkingTokens, total_cost: cost, model_used: model, timestamp: Date.now() }; this.costs.push(tracker); // Check limits if (userId) { this.checkUserLimits(userId, cost); } return tracker; } private checkUserLimits(userId: string, newCost: number): void { const dailySpent = this.getDailySpent(userId); const monthlySpent = this.getMonthlySpent(userId); const dailyLimit = this.dailyLimits.get(userId) || 50.0; const monthlyLimit = this.monthlyLimits.get(userId) || 1000.0; if (dailySpent + newCost > dailyLimit) { throw new MCPError( MCPErrorType.RATE_LIMIT_EXCEEDED, `Daily cost limit exceeded: $${dailySpent + newCost} > $${dailyLimit}` ); } if (monthlySpent + newCost > monthlyLimit) { throw new MCPError( MCPErrorType.RATE_LIMIT_EXCEEDED, `Monthly cost limit exceeded: $${monthlySpent + newCost} > $${monthlyLimit}` ); } } getDailySpent(userId: string): number { const today = new Date(); today.setHours(0, 0, 0, 0); return this.costs .filter(c => c.timestamp >= today.getTime()) .reduce((sum, c) => sum + c.total_cost, 0); } getMonthlySpent(userId: string): number { const thisMonth = new Date(); thisMonth.setDate(1); thisMonth.setHours(0, 0, 0, 0); return this.costs .filter(c => c.timestamp >= thisMonth.getTime()) .reduce((sum, c) => sum + c.total_cost, 0); } getCostReport(userId?: string): { totalCost: number; totalThinkingTokens: number; totalOutputTokens: number; averageCostPerRequest: number; modelBreakdown: Map<string, number>; dailyTrend: Array<{ date: string; cost: number }>; } { const relevantCosts = userId ? this.costs // In real implementation, filter by userId : this.costs; const totalCost = relevantCosts.reduce((sum, c) => sum + c.total_cost, 0); const totalThinkingTokens = relevantCosts.reduce((sum, c) => sum + c.thinking_tokens, 0); const totalOutputTokens = relevantCosts.reduce((sum, c) => sum + c.output_tokens, 0); const averageCostPerRequest = relevantCosts.length > 0 ? totalCost / relevantCosts.length : 0; const modelBreakdown = new Map<string, number>(); for (const cost of relevantCosts) { const current = modelBreakdown.get(cost.model_used) || 0; modelBreakdown.set(cost.model_used, current + cost.total_cost); } // Daily trend for last 30 days const dailyTrend: Array<{ date: string; cost: number }> = []; for (let i = 29; i >= 0; i--) { const date = new Date(); date.setDate(date.getDate() - i); date.setHours(0, 0, 0, 0); const nextDay = new Date(date); nextDay.setDate(nextDay.getDate() + 1); const dayCost = relevantCosts .filter(c => c.timestamp >= date.getTime() && c.timestamp < nextDay.getTime()) .reduce((sum, c) => sum + c.total_cost, 0); dailyTrend.push({ date: date.toISOString().split('T')[0], cost: dayCost }); } return { totalCost, totalThinkingTokens, totalOutputTokens, averageCostPerRequest, modelBreakdown, dailyTrend }; } } ``` ### Budget-Aware Model Selection ```typescript interface ModelCapability { reasoning_quality: number; // 0-1 scale speed: number; // tokens/second cost_per_1k_tokens: number; max_thinking_tokens: number; supports_streaming: boolean; } class BudgetAwareModelSelector { private modelCapabilities: Map<string, ModelCapability> = new Map([ ["openai/o1", { reasoning_quality: 0.95, speed: 15, cost_per_1k_tokens: 0.075, max_thinking_tokens: 65536, supports_streaming: true }], ["openai/o1-mini", { reasoning_quality: 0.85, speed: 25, cost_per_1k_tokens: 0.015, max_thinking_tokens: 65536, supports_streaming: true }], ["anthropic/claude-3.7-sonnet", { reasoning_quality: 0.88, speed: 30, cost_per_1k_tokens: 0.018, max_thinking_tokens: 32000, supports_streaming: true }], ["deepseek/deepseek-reasoner", { reasoning_quality: 0.80, speed: 21, cost_per_1k_tokens: 0.0027, max_thinking_tokens: 23000, supports_streaming: false }] ]); selectOptimalModel( requirements: { min_quality: number; max_cost_per_request: number; estimated_tokens: number; require_streaming?: boolean; max_response_time?: number; } ): string { const candidates = Array.from(this.modelCapabilities.entries()) .filter(([model, caps]) => { // Filter by quality requirement if (caps.reasoning_quality < requirements.min_quality) return false; // Filter by streaming requirement if (requirements.require_streaming && !caps.supports_streaming) return false; // Filter by cost requirement const estimatedCost = (requirements.estimated_tokens / 1000) * caps.cost_per_1k_tokens; if (estimatedCost > requirements.max_cost_per_request) return false; // Filter by response time requirement if (requirements.max_response_time) { const estimatedTime = requirements.estimated_tokens / caps.speed; if (estimatedTime > requirements.max_response_time) return false; } return true; }); if (candidates.length === 0) { throw new MCPError( MCPErrorType.MODEL_UNAVAILABLE, "No models meet the specified requirements" ); } // Select the most cost-effective model that meets requirements const selected = candidates.reduce((best, current) => { const [bestModel, bestCaps] = best; const [currentModel, currentCaps] = current; // Calculate value score (quality per dollar) const bestValue = bestCaps.reasoning_quality / bestCaps.cost_per_1k_tokens; const currentValue = currentCaps.reasoning_quality / currentCaps.cost_per_1k_tokens; return currentValue > bestValue ? current : best; }); return selected[0]; } estimateRequestCost( model: string, inputTokens: number, expectedOutputTokens: number, thinkingTokensRatio: number = 0.5 ): number { const caps = this.modelCapabilities.get(model); if (!caps) { throw new Error(`Unknown model: ${model}`); } const thinkingTokens = expectedOutputTokens * thinkingTokensRatio; const totalTokens = inputTokens + expectedOutputTokens + thinkingTokens; return (totalTokens / 1000) * caps.cost_per_1k_tokens; } getModelRecommendations( taskType: "simple" | "medium" | "complex", budgetTier: "budget" | "standard" | "premium" ): string[] { const taskRequirements = { simple: { min_quality: 0.7, complexity_factor: 1.0 }, medium: { min_quality: 0.8, complexity_factor: 1.5 }, complex: { min_quality: 0.9, complexity_factor: 2.0 } }; const budgetLimits = { budget: 0.01, // $0.01 per request standard: 0.05, // $0.05 per request premium: 0.20 // $0.20 per request }; const task = taskRequirements[taskType]; const budget = budgetLimits[budgetTier]; const estimatedTokens = 2000 * task.complexity_factor; try { const primary = this.selectOptimalModel({ min_quality: task.min_quality, max_cost_per_request: budget, estimated_tokens: estimatedTokens }); // Also suggest alternatives const alternatives = Array.from(this.modelCapabilities.entries()) .filter(([model, caps]) => model !== primary && caps.reasoning_quality >= task.min_quality * 0.9 ) .sort((a, b) => a[1].cost_per_1k_tokens - b[1].cost_per_1k_tokens) .slice(0, 2) .map(([model]) => model); return [primary, ...alternatives]; } catch (error) { // If no models meet requirements, return cheapest options return Array.from(this.modelCapabilities.entries()) .sort((a, b) => a[1].cost_per_1k_tokens - b[1].cost_per_1k_tokens) .slice(0, 3) .map(([model]) => model); } } } ``` ### Provider Routing Optimization ```typescript interface ProviderRoute { provider: string; models: string[]; latency_avg: number; reliability_score: number; cost_multiplier: number; fallback_providers: string[]; } class ProviderRouter { private routes: Map<string, ProviderRoute> = new Map(); private failureHistory: Map<string, number[]> = new Map(); constructor() { this.initializeRoutes(); } private initializeRoutes(): void { this.routes.set("openai", { provider: "openai", models: ["o1", "o1-mini", "o3-mini"], latency_avg: 2500, reliability_score: 0.99, cost_multiplier: 1.0, fallback_providers: ["anthropic"] }); this.routes.set("anthropic", { provider: "anthropic", models: ["claude-3.7-sonnet", "claude-4"], latency_avg: 1800, reliability_score: 0.98, cost_multiplier: 0.8, fallback_providers: ["openai", "deepseek"] }); this.routes.set("deepseek", { provider: "deepseek", models: ["deepseek-reasoner"], latency_avg: 3200, reliability_score: 0.95, cost_multiplier: 0.2, fallback_providers: ["anthropic"] }); } async routeRequest( model: string, request: any, preferences: { prioritize: "cost" | "speed" | "reliability"; max_retries: number; } ): Promise<any> { const provider = this.getProviderForModel(model); const route = this.routes.get(provider); if (!route) { throw new MCPError( MCPErrorType.MODEL_UNAVAILABLE, `No route found for model: ${model}` ); } // Try primary provider try { return await this.executeRequest(provider, model, request); } catch (error) { this.recordFailure(provider); if (preferences.max_retries <= 0) { throw error; } // Try fallback providers for (const fallbackProvider of route.fallback_providers) { try { // Find equivalent model on fallback provider const fallbackModel = this.findEquivalentModel(model, fallbackProvider); if (fallbackModel) { return await this.executeRequest(fallbackProvider, fallbackModel, request); } } catch (fallbackError) { this.recordFailure(fallbackProvider); continue; } } throw new MCPError( MCPErrorType.MODEL_UNAVAILABLE, `All providers failed for model: ${model}` ); } } private getProviderForModel(model: string): string { for (const [provider, route] of this.routes) { if (route.models.some(m => model.includes(m))) { return provider; } } throw new Error(`Unknown provider for model: ${model}`); } private findEquivalentModel(model: string, targetProvider: string): string | null { const route = this.routes.get(targetProvider); if (!route) return null; // Simple mapping - in production, use more sophisticated logic const modelMappings = { "o1": "claude-3.7-sonnet", "o1-mini": "claude-3.7-sonnet", "claude-3.7-sonnet": "o1-mini", "deepseek-reasoner": "claude-3.7-sonnet" }; const baseModel = model.split('/').pop() || model; const equivalent = modelMappings[baseModel]; return equivalent && route.models.includes(equivalent) ? `${targetProvider}/${equivalent}` : null; } private async executeRequest( provider: string, model: string, request: any ): Promise<any> { // Provider-specific request execution // This would integrate with actual provider SDKs console.log(`Executing request on ${provider} with model ${model}`); // Simulate request execution await new Promise(resolve => setTimeout(resolve, 100)); return { provider, model, result: "success" }; } private recordFailure(provider: string): void { const failures = this.failureHistory.get(provider) || []; failures.push(Date.now()); // Keep only failures from last hour const oneHourAgo = Date.now() - 60 * 60 * 1000; const recentFailures = failures.filter(time => time > oneHourAgo); this.failureHistory.set(provider, recentFailures); // Update reliability score const route = this.routes.get(provider); if (route && recentFailures.length > 5) { route.reliability_score = Math.max(0.5, route.reliability_score * 0.95); } } getProviderHealth(): Map<string, { reliability: number; recent_failures: number; avg_latency: number; status: "healthy" | "degraded" | "unhealthy"; }> { const health = new Map(); for (const [provider, route] of this.routes) { const failures = this.failureHistory.get(provider) || []; const status = route.reliability_score > 0.95 ? "healthy" : route.reliability_score > 0.8 ? "degraded" : "unhealthy"; health.set(provider, { reliability: route.reliability_score, recent_failures: failures.length, avg_latency: route.latency_avg, status }); } return health; } } ``` ## Common Pitfalls and Solutions ### Authentication Issues **Problem**: API key exposure and rotation challenges ```typescript // ❌ Bad: Hardcoded API keys const API_KEY = "sk-1234567890abcdef"; // ✅ Good: Secure key management class SecureKeyManager { private keyRotationInterval = 24 * 60 * 60 * 1000; // 24 hours private lastRotation = 0; async getValidKey(): Promise<string> { if (this.shouldRotateKey()) { await this.rotateKey(); } return this.getCurrentKey(); } private shouldRotateKey(): boolean { return Date.now() - this.lastRotation > this.keyRotationInterval; } private async rotateKey(): Promise<void> { // Implement key rotation logic this.lastRotation = Date.now(); } private getCurrentKey(): string { const key = process.env.OPENROUTER_API_KEY; if (!key) { throw new MCPError( MCPErrorType.AUTHENTICATION_FAILED, "API key not found in environment" ); } return key; } } ``` ### Token Budget Misconfiguration **Problem**: Thinking tokens exceeding limits or inefficient allocation ```typescript // ❌ Bad: No validation or optimization async function createCompletion(budgetTokens: number) { return await openai.chat.completions.create({ thinking: { budget_tokens: budgetTokens } // No validation }); } // ✅ Good: Proper validation and optimization class BudgetValidator { private static readonly MIN_BUDGET = 1024; private static readonly MAX_BUDGET = 32000; private static readonly RECOMMENDED_RATIOS = { simple: 0.1, // 10% of max_tokens medium: 0.3, // 30% of max_tokens complex: 0.6 // 60% of max_tokens }; static validateAndOptimize( requestedBudget: number, maxTokens: number, complexity: keyof typeof BudgetValidator.RECOMMENDED_RATIOS ): number { // Validate range if (requestedBudget < this.MIN_BUDGET) { console.warn(`Budget ${requestedBudget} below minimum, using ${this.MIN_BUDGET}`); requestedBudget = this.MIN_BUDGET; } if (requestedBudget > this.MAX_BUDGET) { console.warn(`Budget ${requestedBudget} above maximum, using ${this.MAX_BUDGET}`); requestedBudget = this.MAX_BUDGET; } // Optimize based on complexity const recommendedRatio = this.RECOMMENDED_RATIOS[complexity]; const recommendedBudget = Math.floor(maxTokens * recommendedRatio); // Use the minimum of requested and recommended const optimizedBudget = Math.min(requestedBudget, recommendedBudget); // Ensure we leave room for final response const maxAllowedBudget = Math.floor(maxTokens * 0.8); return Math.min(optimizedBudget, maxAllowedBudget); } } ``` ### Error Handling Anti-patterns **Problem**: Poor error handling leading to cascading failures ```typescript // ❌ Bad: Swallowing errors or crashing async function badErrorHandling() { try { return await someAsyncOperation(); } catch (error) { console.log("Something went wrong"); // Lost error details return null; // Hides the problem } } // ✅ Good: Comprehensive error handling class RobustErrorHandler { static async withRetry<T>( operation: () => Promise<T>, options: { maxRetries: number; backoffMs: number; retryableErrors: string[]; } ): Promise<T> { let lastError: Error; for (let attempt = 1; attempt <= options.maxRetries; attempt++) { try { return await operation(); } catch (error) { lastError = error as Error; // Check if error is retryable const isRetryable = options.retryableErrors.some(errType => error.message.includes(errType) || error.code === errType ); if (!isRetryable || attempt === options.maxRetries) { break; } // Exponential backoff const delay = options.backoffMs * Math.pow(2, attempt - 1); await new Promise(resolve => setTimeout(resolve, delay)); console.warn( `Attempt ${attempt} failed, retrying in ${delay}ms: ${error.message}` ); } } throw new MCPError( MCPErrorType.INTERNAL_ERROR, `Operation failed after ${options.maxRetries} attempts`, { originalError: lastError } ); } } // Usage const result = await RobustErrorHandler.withRetry( () => thinkingClient.createCompletion(request), { maxRetries: 3, backoffMs: 1000, retryableErrors: ["rate_limit", "timeout", "server_error"] } ); ``` ### Memory Leaks in Streaming **Problem**: Unclosed streams and event listeners ```typescript // ❌ Bad: Memory leaks from unclosed streams async function badStreaming() { const stream = await openai.chat.completions.create({ stream: true }); for await (const chunk of stream) { console.log(chunk); // No cleanup, potential memory leak } } // ✅ Good: Proper stream management class StreamManager { private activeStreams = new Set<AbortController>(); async createManagedStream(request: any): Promise<AsyncIterable<any>> { const abortController = new AbortController(); this.activeStreams.add(abortController); try { const stream = await openai.chat.completions.create({ ...request, stream: true }, { signal: abortController.signal }); return this.wrapStreamWithCleanup(stream, abortController); } catch (error) { this.activeStreams.delete(abortController); throw error; } } private async* wrapStreamWithCleanup( stream: AsyncIterable<any>, controller: AbortController ): AsyncIterable<any> { try { for await (const chunk of stream) { if (controller.signal.aborted) { break; } yield chunk; } } finally { this.activeStreams.delete(controller); } } cleanup(): void { for (const controller of this.activeStreams) { controller.abort(); } this.activeStreams.clear(); } } // Ensure cleanup on process exit process.on('SIGTERM', () => { streamManager.cleanup(); }); ``` ### Schema Validation Errors **Problem**: Runtime errors due to invalid tool inputs ```typescript // ❌ Bad: No validation, runtime errors server.setRequestHandler(CallToolRequestSchema, async (request) => { const model = request.params.arguments.model; // Might be undefined const messages = request.params.arguments.messages; // Might be wrong format // Will crash if arguments are invalid return await client.createCompletion({ model, messages }); }); // ✅ Good: Comprehensive validation with helpful errors const ThinkingToolInputSchema = z.object({ model: z.string() .min(1, "Model name cannot be empty") .refine(model => model.includes("openai/") || model.includes("anthropic/") || model.includes("deepseek/"), "Model must be from supported provider" ), messages: z.array( z.object({ role: z.enum(["system", "user", "assistant"]), content: z.string().min(1, "Message content cannot be empty") }) ).min(1, "At least one message is required"), thinking_config: z.object({ effort: z.enum(["minimal", "low", "medium", "high"]).optional(), budget_tokens: z.number() .int("Budget tokens must be an integer") .min(1024, "Minimum budget is 1024 tokens") .max(32000, "Maximum budget is 32000 tokens") .optional() }).optional() }); server.setRequestHandler(CallToolRequestSchema, async (request) => { try { const validatedArgs = ThinkingToolInputSchema.parse(request.params.arguments); return await client.createCompletion(validatedArgs); } catch (error) { if (error instanceof z.ZodError) { throw new McpError( ErrorCode.InvalidParams, `Invalid tool arguments: ${error.errors.map(e => e.message).join(", ")}`, { validationErrors: error.errors } ); } throw error; } }); ``` ## Complete Implementation Examples ### Basic OpenRouter MCP Server ```typescript #!/usr/bin/env node import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; import { CallToolRequestSchema, ListToolsRequestSchema } from "@modelcontextprotocol/sdk/types.js"; import { z } from "zod"; import OpenAI from "openai"; // Configuration const config = { apiKey: process.env.OPENROUTER_API_KEY!, baseURL: "https://openrouter.ai/api/v1", defaultModel: "anthropic/claude-3.7-sonnet" }; // Initialize OpenAI client for OpenRouter const openai = new OpenAI({ apiKey: config.apiKey, baseURL: config.baseURL }); // Input validation schemas const ThinkingCompletionSchema = z.object({ model: z.string().default(config.defaultModel), messages: z.array(z.object({ role: z.enum(["system", "user", "assistant"]), content: z.string() })), thinking_config: z.object({ effort: z.enum(["minimal", "low", "medium", "high"]).optional(), budget_tokens: z.number().int().min(1024).max(32000).optional() }).optional(), max_tokens: z.number().int().positive().optional().default(4096), stream: z.boolean().optional().default(false) }); // Initialize MCP server const server = new McpServer({ name: "openrouter-thinking-server", version: "1.0.0" }); // Register tools server.setRequestHandler(ListToolsRequestSchema, async () => ({ tools: [ { name: "thinking_completion", description: "Generate responses using thinking models with controlled reasoning", inputSchema: { type: "object", properties: { model: { type: "string", description: "Model to use (default: anthropic/claude-3.7-sonnet)", default: config.defaultModel }, messages: { type: "array", items: { type: "object", properties: { role: { type: "string", enum: ["system", "user", "assistant"] }, content: { type: "string" } }, required: ["role", "content"] }, description: "Array of chat messages" }, thinking_config: { type: "object", properties: { effort: { type: "string", enum: ["minimal", "low", "medium", "high"], description: "Reasoning effort level" }, budget_tokens: { type: "number", minimum: 1024, maximum: 32000, description: "Maximum thinking tokens to use" } }, description: "Thinking/reasoning configuration" }, max_tokens: { type: "number", minimum: 1, default: 4096, description: "Maximum response tokens" }, stream: { type: "boolean", default: false, description: "Enable streaming response" } }, required: ["messages"] } } ] })); // Handle tool calls server.setRequestHandler(CallToolRequestSchema, async (request) => { if (request.params.name === "thinking_completion") { try { const args = ThinkingCompletionSchema.parse(request.params.arguments); // Build request parameters const requestParams: any = { model: args.model, messages: args.messages, max_tokens: args.max_tokens, stream: args.stream }; // Add thinking/reasoning parameters based on model if (args.thinking_config) { if (args.model.includes("anthropic/") && args.thinking_config.budget_tokens) { requestParams.thinking = { budget_tokens: args.thinking_config.budget_tokens }; } else if (args.model.includes("openai/") && args.thinking_config.effort) { requestParams.reasoning_effort = args.thinking_config.effort; } } // Make API request const response = await openai.chat.completions.create(requestParams); // Extract response data const choice = response.choices[0]; const usage = response.usage; const result = { content: choice.message.content, thinking_content: choice.message.thinking || null, reasoning_content: choice.message.reasoning || null, finish_reason: choice.finish_reason, usage: { input_tokens: usage?.prompt_tokens || 0, output_tokens: usage?.completion_tokens || 0, thinking_tokens: usage?.thinking_tokens || 0, reasoning_tokens: usage?.reasoning_tokens || 0, total_tokens: usage?.total_tokens || 0 }, model_used: args.model }; return { content: [{ type: "text", text: JSON.stringify(result, null, 2) }] }; } catch (error) { console.error("Thinking completion error:", error); return { content: [{ type: "text", text: JSON.stringify({ error: "Thinking completion failed", message: error.message, type: error.constructor.name }, null, 2) }], isError: true }; } } throw new Error(`Unknown tool: ${request.params.name}`); }); // Start server async function main() { if (!config.apiKey) { console.error("OPENROUTER_API_KEY environment variable is required"); process.exit(1); } const transport = new StdioServerTransport(); await server.connect(transport); console.error("OpenRouter Thinking MCP Server running on stdio"); } main().catch((error) => { console.error("Failed to start server:", error); process.exit(1); }); ``` ### Package.json Configuration ```json { "name": "openrouter-thinking-mcp-server", "version": "1.0.0", "description": "Production-grade OpenRouter MCP server with thinking token control", "main": "dist/server.js", "type": "module", "scripts": { "build": "tsc", "dev": "tsx watch src/server.ts", "start": "node dist/server.js", "test": "vitest", "test:coverage": "vitest --coverage", "lint": "eslint src/**/*.ts", "lint:fix": "eslint src/**/*.ts --fix", "typecheck": "tsc --noEmit" }, "dependencies": { "@modelcontextprotocol/sdk": "^1.0.0", "@hono/node-server": "^1.8.2", "hono": "^4.0.0", "openai": "^5.10.2", "zod": "^3.25.76", "winston": "^3.11.0", "dotenv": "^16.3.1" }, "devDependencies": { "@types/node": "^20.11.0", "@typescript-eslint/eslint-plugin": "^6.19.0", "@typescript-eslint/parser": "^6.19.0", "eslint": "^8.56.0", "tsx": "^4.7.0", "typescript": "^5.3.3", "vitest": "^1.2.0", "@vitest/coverage-v8": "^1.2.0" }, "engines": { "node": ">=18.0.0" }, "keywords": [ "mcp", "openrouter", "thinking-models", "ai", "typescript" ] } ``` ## References ### Official Documentation - [OpenRouter API Documentation](https://openrouter.ai/docs/api-reference/overview) - [Model Context Protocol Specification](https://modelcontextprotocol.io/specification/2025-06-18) - [OpenRouter MCP Integration Guide](https://openrouter.ai/docs/use-cases/mcp-servers) - [Anthropic Extended Thinking Guide](https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking) - [DeepSeek R1 API Documentation](https://api-docs.deepseek.com/guides/reasoning_model) ### GitHub Repositories - [MCP TypeScript SDK](https://github.com/modelcontextprotocol/typescript-sdk) - [MCP Production Template](https://github.com/cyanheads/mcp-ts-template) - [OpenRouter MCP Multimodal](https://github.com/stabgan/openrouter-mcp-multimodal) - [MCP Server Examples](https://github.com/modelcontextprotocol/servers) ### Technical Resources - [OpenRouter Reasoning Tokens](https://openrouter.ai/docs/use-cases/reasoning-tokens) - [OpenRouter Streaming API](https://openrouter.ai/docs/api-reference/streaming) - [Zod Schema Validation](https://zod.dev/) - [TypeScript Best Practices](https://www.typescriptlang.org/docs/) ### Cost Optimization - [OpenRouter Pricing](https://openrouter.ai/docs/api-reference/limits) - [Model Performance Analysis](https://artificialanalysis.ai/models) - [Provider Routing Options](https://openrouter.ai/docs/features/provider-routing) ### Community Resources - [MCP Community Examples](https://lobehub.com/mcp) - [OpenRouter Community](https://discord.gg/openrouter) - [Stack Overflow MCP Tag](https://stackoverflow.com/questions/tagged/model-context-protocol) --- *This guide represents best practices as of August 2025. For the most current information, always consult the official documentation and community resources.*

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/slyfox1186/claude-code-openrouter'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

mcp_server_thinking_tokens_best_practices.md•70.8 KiB