gepa_evaluate_prompt

Evaluate AI prompt performance across multiple tasks to measure effectiveness and identify improvement areas for optimization.

Instructions

Evaluate prompt candidate performance across multiple tasks

Input Schema

TableJSON Schema

Name	Required	Description
`promptId`	Yes	Unique identifier for the prompt to evaluate
`taskIds`	Yes	List of task IDs to evaluate the prompt against
`rolloutCount`	No	Number of evaluation rollouts per task
`parallel`	No	Whether to run evaluations in parallel

Implementation Reference

src/mcp/server.ts:832-919 (handler)
The primary handler function that implements the core logic of the 'gepa_evaluate_prompt' MCP tool. It performs simulated evaluations across multiple tasks with configurable rollouts, computes aggregate performance metrics (success rate, average score, execution time), and integrates the results into the Pareto frontier for prompt optimization.
private async evaluatePrompt(params: EvaluatePromptParams): Promise<{ content: { type: string; text: string; }[]; }> { const { promptId, taskIds, rolloutCount = 5, parallel = true } = params; // Validate required parameters if (!promptId || !taskIds || taskIds.length === 0) { throw new Error('promptId and taskIds are required'); } const evaluationId = `eval_${Date.now()}_${Math.random().toString(36).substring(7)}`; const totalEvaluations = taskIds.length * rolloutCount; try { // Simulate prompt evaluation process const evaluationResults: any[] = []; for (const taskId of taskIds) { for (let rollout = 0; rollout < rolloutCount; rollout++) { const rolloutResult = { taskId, rollout: rollout + 1, success: Math.random() > 0.2, // 80% success rate simulation score: Math.random() * 0.5 + 0.5, // Score between 0.5-1.0 executionTime: Math.random() * 2000 + 500, // 500-2500ms details: `Rollout ${rollout + 1} for task ${taskId}`, }; evaluationResults.push(rolloutResult); } } // Calculate aggregate metrics const successfulEvaluations = evaluationResults.filter(r => r.success); const successRate = successfulEvaluations.length / totalEvaluations; const averageScore = successfulEvaluations.reduce((sum, r) => sum + r.score, 0) / successfulEvaluations.length; const averageExecutionTime = evaluationResults.reduce((sum, r) => sum + r.executionTime, 0) / evaluationResults.length; // Update prompt candidate in Pareto frontier const candidate: GEPAPromptCandidate = { id: promptId, content: '', // Will be retrieved if needed generation: 0, taskPerformance: new Map(taskIds.map(taskId => [taskId, averageScore])), averageScore: averageScore * successRate, // Combined fitness score rolloutCount: totalEvaluations, createdAt: new Date(), lastEvaluated: new Date(), mutationType: 'initial', }; this.paretoFrontier.addCandidate(candidate); return { content: [ { type: 'text', text: `# Prompt Evaluation Complete ## Evaluation Details - **Evaluation ID**: ${evaluationId} - **Prompt ID**: ${promptId} - **Tasks Evaluated**: ${taskIds.length} - **Rollouts per Task**: ${rolloutCount} - **Total Evaluations**: ${totalEvaluations} - **Parallel Execution**: ${parallel ? 'Yes' : 'No'} ## Performance Metrics - **Success Rate**: ${(successRate * 100).toFixed(1)}% - **Average Score**: ${averageScore.toFixed(3)} - **Combined Fitness**: ${(averageScore * successRate).toFixed(3)} - **Average Execution Time**: ${averageExecutionTime.toFixed(0)}ms ## Task Breakdown ${taskIds.map(taskId => { const taskResults = evaluationResults.filter(r => r.taskId === taskId); const taskSuccessRate = taskResults.filter(r => r.success).length / taskResults.length; const taskAvgScore = taskResults.filter(r => r.success).reduce((sum, r) => sum + r.score, 0) / taskResults.filter(r => r.success).length || 0; return `- **${taskId}**: ${(taskSuccessRate * 100).toFixed(1)}% success, ${taskAvgScore.toFixed(3)} avg score`; }).join('\n')} ✨ Candidate updated in Pareto frontier with performance metrics.`, }, ], }; } catch (error) { throw new Error(`Failed to evaluate prompt: ${error instanceof Error ? error.message : 'Unknown error'}`); } }
src/mcp/server.ts:147-173 (schema)
MCP input schema and metadata definition for the 'gepa_evaluate_prompt' tool, used for tool discovery (listTools) and input validation.
name: 'gepa_evaluate_prompt', description: 'Evaluate prompt candidate performance across multiple tasks', inputSchema: { type: 'object', properties: { promptId: { type: 'string', description: 'Unique identifier for the prompt to evaluate' }, taskIds: { type: 'array', items: { type: 'string' }, description: 'List of task IDs to evaluate the prompt against' }, rolloutCount: { type: 'number', default: 5, description: 'Number of evaluation rollouts per task' }, parallel: { type: 'boolean', default: true, description: 'Whether to run evaluations in parallel' } }, required: ['promptId', 'taskIds'] }
src/mcp/server.ts:532-534 (registration)
Registration and dispatch logic in the central MCP tool request handler switch statement, mapping tool calls to the evaluatePrompt handler method.
case 'gepa_evaluate_prompt': return await this.evaluatePrompt(args as unknown as EvaluatePromptParams);
src/types/gepa.ts:251-256 (schema)
TypeScript interface defining the typed parameters for the evaluatePrompt handler function.
export interface EvaluatePromptParams { promptId: string; taskIds: string[]; rolloutCount?: number; parallel?: boolean; }
src/mcp/types.ts:267-267 (helper)
Type-safe constant defining the tool name string for use in code references.
EVALUATE_PROMPT: 'gepa_evaluate_prompt',

Prompt Auto-Optimizer MCP

gepa_evaluate_prompt

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API