Skip to main content
Glama

gepa_evaluate_prompt

Evaluate AI prompt performance across multiple tasks to measure effectiveness and identify improvement areas for optimization.

Instructions

Evaluate prompt candidate performance across multiple tasks

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
promptIdYesUnique identifier for the prompt to evaluate
taskIdsYesList of task IDs to evaluate the prompt against
rolloutCountNoNumber of evaluation rollouts per task
parallelNoWhether to run evaluations in parallel

Implementation Reference

  • The primary handler function that implements the core logic of the 'gepa_evaluate_prompt' MCP tool. It performs simulated evaluations across multiple tasks with configurable rollouts, computes aggregate performance metrics (success rate, average score, execution time), and integrates the results into the Pareto frontier for prompt optimization.
    private async evaluatePrompt(params: EvaluatePromptParams): Promise<{ content: { type: string; text: string; }[]; }> { const { promptId, taskIds, rolloutCount = 5, parallel = true } = params; // Validate required parameters if (!promptId || !taskIds || taskIds.length === 0) { throw new Error('promptId and taskIds are required'); } const evaluationId = `eval_${Date.now()}_${Math.random().toString(36).substring(7)}`; const totalEvaluations = taskIds.length * rolloutCount; try { // Simulate prompt evaluation process const evaluationResults: any[] = []; for (const taskId of taskIds) { for (let rollout = 0; rollout < rolloutCount; rollout++) { const rolloutResult = { taskId, rollout: rollout + 1, success: Math.random() > 0.2, // 80% success rate simulation score: Math.random() * 0.5 + 0.5, // Score between 0.5-1.0 executionTime: Math.random() * 2000 + 500, // 500-2500ms details: `Rollout ${rollout + 1} for task ${taskId}`, }; evaluationResults.push(rolloutResult); } } // Calculate aggregate metrics const successfulEvaluations = evaluationResults.filter(r => r.success); const successRate = successfulEvaluations.length / totalEvaluations; const averageScore = successfulEvaluations.reduce((sum, r) => sum + r.score, 0) / successfulEvaluations.length; const averageExecutionTime = evaluationResults.reduce((sum, r) => sum + r.executionTime, 0) / evaluationResults.length; // Update prompt candidate in Pareto frontier const candidate: GEPAPromptCandidate = { id: promptId, content: '', // Will be retrieved if needed generation: 0, taskPerformance: new Map(taskIds.map(taskId => [taskId, averageScore])), averageScore: averageScore * successRate, // Combined fitness score rolloutCount: totalEvaluations, createdAt: new Date(), lastEvaluated: new Date(), mutationType: 'initial', }; this.paretoFrontier.addCandidate(candidate); return { content: [ { type: 'text', text: `# Prompt Evaluation Complete ## Evaluation Details - **Evaluation ID**: ${evaluationId} - **Prompt ID**: ${promptId} - **Tasks Evaluated**: ${taskIds.length} - **Rollouts per Task**: ${rolloutCount} - **Total Evaluations**: ${totalEvaluations} - **Parallel Execution**: ${parallel ? 'Yes' : 'No'} ## Performance Metrics - **Success Rate**: ${(successRate * 100).toFixed(1)}% - **Average Score**: ${averageScore.toFixed(3)} - **Combined Fitness**: ${(averageScore * successRate).toFixed(3)} - **Average Execution Time**: ${averageExecutionTime.toFixed(0)}ms ## Task Breakdown ${taskIds.map(taskId => { const taskResults = evaluationResults.filter(r => r.taskId === taskId); const taskSuccessRate = taskResults.filter(r => r.success).length / taskResults.length; const taskAvgScore = taskResults.filter(r => r.success).reduce((sum, r) => sum + r.score, 0) / taskResults.filter(r => r.success).length || 0; return `- **${taskId}**: ${(taskSuccessRate * 100).toFixed(1)}% success, ${taskAvgScore.toFixed(3)} avg score`; }).join('\n')} ✨ Candidate updated in Pareto frontier with performance metrics.`, }, ], }; } catch (error) { throw new Error(`Failed to evaluate prompt: ${error instanceof Error ? error.message : 'Unknown error'}`); } }
  • MCP input schema and metadata definition for the 'gepa_evaluate_prompt' tool, used for tool discovery (listTools) and input validation.
    name: 'gepa_evaluate_prompt', description: 'Evaluate prompt candidate performance across multiple tasks', inputSchema: { type: 'object', properties: { promptId: { type: 'string', description: 'Unique identifier for the prompt to evaluate' }, taskIds: { type: 'array', items: { type: 'string' }, description: 'List of task IDs to evaluate the prompt against' }, rolloutCount: { type: 'number', default: 5, description: 'Number of evaluation rollouts per task' }, parallel: { type: 'boolean', default: true, description: 'Whether to run evaluations in parallel' } }, required: ['promptId', 'taskIds'] }
  • Registration and dispatch logic in the central MCP tool request handler switch statement, mapping tool calls to the evaluatePrompt handler method.
    case 'gepa_evaluate_prompt': return await this.evaluatePrompt(args as unknown as EvaluatePromptParams);
  • TypeScript interface defining the typed parameters for the evaluatePrompt handler function.
    export interface EvaluatePromptParams { promptId: string; taskIds: string[]; rolloutCount?: number; parallel?: boolean; }
  • Type-safe constant defining the tool name string for use in code references.
    EVALUATE_PROMPT: 'gepa_evaluate_prompt',

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/sloth-wq/prompt-auto-optimizer-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server