Skip to main content
Glama
sloth-wq

Prompt Auto-Optimizer MCP

by sloth-wq

gepa_evaluate_prompt

Evaluate AI prompt performance across multiple tasks to measure effectiveness and identify improvement areas for optimization.

Instructions

Evaluate prompt candidate performance across multiple tasks

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
promptIdYesUnique identifier for the prompt to evaluate
taskIdsYesList of task IDs to evaluate the prompt against
rolloutCountNoNumber of evaluation rollouts per task
parallelNoWhether to run evaluations in parallel

Implementation Reference

  • The primary handler function that implements the core logic of the 'gepa_evaluate_prompt' MCP tool. It performs simulated evaluations across multiple tasks with configurable rollouts, computes aggregate performance metrics (success rate, average score, execution time), and integrates the results into the Pareto frontier for prompt optimization.
      private async evaluatePrompt(params: EvaluatePromptParams): Promise<{
        content: { type: string; text: string; }[];
      }> {
        const { promptId, taskIds, rolloutCount = 5, parallel = true } = params;
    
        // Validate required parameters
        if (!promptId || !taskIds || taskIds.length === 0) {
          throw new Error('promptId and taskIds are required');
        }
    
        const evaluationId = `eval_${Date.now()}_${Math.random().toString(36).substring(7)}`;
        const totalEvaluations = taskIds.length * rolloutCount;
    
        try {
          // Simulate prompt evaluation process
          const evaluationResults: any[] = [];
          
          for (const taskId of taskIds) {
            for (let rollout = 0; rollout < rolloutCount; rollout++) {
              const rolloutResult = {
                taskId,
                rollout: rollout + 1,
                success: Math.random() > 0.2, // 80% success rate simulation
                score: Math.random() * 0.5 + 0.5, // Score between 0.5-1.0
                executionTime: Math.random() * 2000 + 500, // 500-2500ms
                details: `Rollout ${rollout + 1} for task ${taskId}`,
              };
              evaluationResults.push(rolloutResult);
            }
          }
    
          // Calculate aggregate metrics
          const successfulEvaluations = evaluationResults.filter(r => r.success);
          const successRate = successfulEvaluations.length / totalEvaluations;
          const averageScore = successfulEvaluations.reduce((sum, r) => sum + r.score, 0) / successfulEvaluations.length;
          const averageExecutionTime = evaluationResults.reduce((sum, r) => sum + r.executionTime, 0) / evaluationResults.length;
    
          // Update prompt candidate in Pareto frontier
          const candidate: GEPAPromptCandidate = {
            id: promptId,
            content: '', // Will be retrieved if needed
            generation: 0,
            taskPerformance: new Map(taskIds.map(taskId => [taskId, averageScore])),
            averageScore: averageScore * successRate, // Combined fitness score
            rolloutCount: totalEvaluations,
            createdAt: new Date(),
            lastEvaluated: new Date(),
            mutationType: 'initial',
          };
    
          this.paretoFrontier.addCandidate(candidate);
    
          return {
            content: [
              {
                type: 'text',
                text: `# Prompt Evaluation Complete
    
    ## Evaluation Details
    - **Evaluation ID**: ${evaluationId}
    - **Prompt ID**: ${promptId}
    - **Tasks Evaluated**: ${taskIds.length}
    - **Rollouts per Task**: ${rolloutCount}
    - **Total Evaluations**: ${totalEvaluations}
    - **Parallel Execution**: ${parallel ? 'Yes' : 'No'}
    
    ## Performance Metrics
    - **Success Rate**: ${(successRate * 100).toFixed(1)}%
    - **Average Score**: ${averageScore.toFixed(3)}
    - **Combined Fitness**: ${(averageScore * successRate).toFixed(3)}
    - **Average Execution Time**: ${averageExecutionTime.toFixed(0)}ms
    
    ## Task Breakdown
    ${taskIds.map(taskId => {
      const taskResults = evaluationResults.filter(r => r.taskId === taskId);
      const taskSuccessRate = taskResults.filter(r => r.success).length / taskResults.length;
      const taskAvgScore = taskResults.filter(r => r.success).reduce((sum, r) => sum + r.score, 0) / taskResults.filter(r => r.success).length || 0;
      return `- **${taskId}**: ${(taskSuccessRate * 100).toFixed(1)}% success, ${taskAvgScore.toFixed(3)} avg score`;
    }).join('\n')}
    
    ✨ Candidate updated in Pareto frontier with performance metrics.`,
              },
            ],
          };
        } catch (error) {
          throw new Error(`Failed to evaluate prompt: ${error instanceof Error ? error.message : 'Unknown error'}`);
        }
      }
  • MCP input schema and metadata definition for the 'gepa_evaluate_prompt' tool, used for tool discovery (listTools) and input validation.
    name: 'gepa_evaluate_prompt',
    description: 'Evaluate prompt candidate performance across multiple tasks',
    inputSchema: {
      type: 'object',
      properties: {
        promptId: {
          type: 'string',
          description: 'Unique identifier for the prompt to evaluate'
        },
        taskIds: {
          type: 'array',
          items: { type: 'string' },
          description: 'List of task IDs to evaluate the prompt against'
        },
        rolloutCount: {
          type: 'number',
          default: 5,
          description: 'Number of evaluation rollouts per task'
        },
        parallel: {
          type: 'boolean',
          default: true,
          description: 'Whether to run evaluations in parallel'
        }
      },
      required: ['promptId', 'taskIds']
    }
  • Registration and dispatch logic in the central MCP tool request handler switch statement, mapping tool calls to the evaluatePrompt handler method.
    case 'gepa_evaluate_prompt':
      return await this.evaluatePrompt(args as unknown as EvaluatePromptParams);
  • TypeScript interface defining the typed parameters for the evaluatePrompt handler function.
    export interface EvaluatePromptParams {
      promptId: string;
      taskIds: string[];
      rolloutCount?: number;
      parallel?: boolean;
    }
  • Type-safe constant defining the tool name string for use in code references.
    EVALUATE_PROMPT: 'gepa_evaluate_prompt',
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It mentions evaluating performance but fails to detail critical aspects like whether this is a read-only analysis or a mutating operation, potential side effects (e.g., logging results), performance implications (e.g., time-intensive due to multiple tasks), or error handling. This leaves significant gaps in understanding the tool's behavior.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is a single, efficient sentence that directly states the tool's purpose without unnecessary words. It's front-loaded with the core action, making it easy to grasp quickly. However, it could be slightly more structured by hinting at key parameters or outcomes to enhance clarity further.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity of evaluating prompts across tasks, the description is incomplete. With no annotations and no output schema, it fails to explain what the evaluation entails (e.g., metrics used, return format like scores or reports), or how it integrates with sibling tools. This lack of context makes it inadequate for understanding the tool's full scope and usage.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The schema description coverage is 100%, meaning all parameters are documented in the input schema. The description adds no additional meaning beyond the schema, such as explaining how 'promptId' relates to other tools or what 'taskIds' represent in context. Since the schema handles the heavy lifting, a baseline score of 3 is appropriate, but the description doesn't compensate with extra insights.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the action ('evaluate') and the target ('prompt candidate performance across multiple tasks'), making the purpose understandable. However, it doesn't differentiate this tool from its siblings like 'gepa_select_optimal' or 'gepa_reflect', which might also involve evaluation or selection processes, leaving room for confusion about its unique role.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives, such as 'gepa_select_optimal' for choosing prompts or 'gepa_reflect' for analysis. It lacks context on prerequisites, like needing existing prompts and tasks, or exclusions, such as not being for single-task evaluation, which limits its utility in decision-making.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/sloth-wq/prompt-auto-optimizer-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server