llm_benchmark
Run benchmarks with multiple prompts to evaluate LLM performance metrics, including response time and quality, for model comparison and testing.
Instructions
Ejecuta un benchmark con múltiples prompts para evaluar rendimiento del modelo
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| baseURL | No | URL del servidor OpenAI-compatible (ej: http://localhost:1234/v1, http://localhost:11434/v1) | |
| apiKey | No | API Key (requerida para OpenAI/Azure, opcional para servidores locales) | |
| prompts | Yes | Lista de prompts para el benchmark | |
| model | No | ID del modelo | |
| maxTokens | No | Max tokens por respuesta (default: 256) | |
| temperature | No | Temperatura (default: 0.7) | |
| topP | No | Top P para nucleus sampling | |
| runs | No | Ejecuciones por prompt (default: 1) |
Implementation Reference
- src/tools.ts:306-332 (handler)Main handler function for the llm_benchmark tool. Validates input with BenchmarkSchema, creates an LLMClient instance, executes benchmark via client.runBenchmark, and formats results into a comprehensive markdown report including summary stats and detailed per-prompt metrics.async llm_benchmark(args: z.infer<typeof BenchmarkSchema>) { const client = getClient(args); const { results, summary } = await client.runBenchmark(args.prompts, { model: args.model, maxTokens: args.maxTokens, temperature: args.temperature, runs: args.runs, }); let output = `# 📊 Benchmark Results\n\n`; output += `## Resumen\n`; output += `- **Prompts totales:** ${summary.totalPrompts}\n`; output += `- **Latencia promedio:** ${summary.avgLatencyMs.toFixed(2)} ms\n`; output += `- **Tokens/segundo promedio:** ${summary.avgTokensPerSecond.toFixed(2)}\n`; output += `- **Total tokens generados:** ${summary.totalTokensGenerated}\n\n`; output += `## Resultados Detallados\n\n`; results.forEach((r, i) => { output += `### Prompt ${i + 1}\n`; output += `> ${r.prompt.substring(0, 100)}${r.prompt.length > 100 ? "..." : ""}\n\n`; output += `- Latencia: ${r.latencyMs} ms\n`; output += `- Tokens: ${r.completionTokens}\n`; output += `- Velocidad: ${r.tokensPerSecond.toFixed(2)} tok/s\n\n`; }); return { content: [{ type: "text" as const, text: output }] }; },
- src/tools.ts:31-38 (schema)Zod schema defining the input parameters for the llm_benchmark tool, including required prompts array and optional model, token limits, temperature, and run count.export const BenchmarkSchema = ConnectionConfigSchema.extend({ prompts: z.array(z.string()).describe("Lista de prompts para el benchmark"), model: z.string().optional().describe("ID del modelo a usar"), maxTokens: z.number().optional().default(256).describe("Máximo de tokens por respuesta"), temperature: z.number().optional().default(0.7).describe("Temperatura"), topP: z.number().optional().describe("Top P para nucleus sampling"), runs: z.number().optional().default(1).describe("Número de ejecuciones por prompt"), });
- src/tools.ts:116-136 (registration)MCP tool registration entry in the exported tools array, specifying the name, description, and inputSchema for llm_benchmark to be returned by ListToolsRequest.{ name: "llm_benchmark", description: "Ejecuta un benchmark con múltiples prompts para evaluar rendimiento del modelo", inputSchema: { type: "object" as const, properties: { ...connectionProperties, prompts: { type: "array", items: { type: "string" }, description: "Lista de prompts para el benchmark", }, model: { type: "string", description: "ID del modelo" }, maxTokens: { type: "number", description: "Max tokens por respuesta (default: 256)" }, temperature: { type: "number", description: "Temperatura (default: 0.7)" }, topP: { type: "number", description: "Top P para nucleus sampling" }, runs: { type: "number", description: "Ejecuciones por prompt (default: 1)" }, }, required: ["prompts"], }, },
- src/index.ts:64-65 (registration)Dispatch logic in the MCP CallToolRequest handler switch statement that routes execution to the llm_benchmark handler function.case "llm_benchmark": return await toolHandlers.llm_benchmark(args as any);
- src/index.ts:42-44 (registration)MCP ListToolsRequest handler that returns the tools array containing the llm_benchmark tool definition.server.setRequestHandler(ListToolsRequestSchema, async () => { return { tools }; });