core-tools.md•17 kB
# Core GEPA Tools Reference
This document provides detailed specifications for the core GEPA MCP tools used for prompt evolution, trajectory recording, and reflection analysis.
## Table of Contents
- [gepa_start_evolution](#gepa_start_evolution)
- [gepa_record_trajectory](#gepa_record_trajectory)
- [gepa_evaluate_prompt](#gepa_evaluate_prompt)
- [gepa_reflect](#gepa_reflect)
---
## gepa_start_evolution
Initializes a genetic evolution process with configuration parameters and an optional seed prompt.
### Purpose
Starts a new prompt evolution experiment by setting up the initial population, defining evolution parameters, and creating the evolutionary framework for iterative improvement.
### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `taskDescription` | `string` | ✅ | Clear description of the task to optimize prompts for |
| `seedPrompt` | `string` | ❌ | Initial prompt to start evolution from (optional) |
| `targetModules` | `string[]` | ❌ | Specific modules or components to target (optional) |
| `config` | `object` | ❌ | Evolution configuration parameters (optional) |
### Configuration Object
```typescript
interface EvolutionConfig {
populationSize?: number; // Default: 20, Range: 5-50
generations?: number; // Default: 10, Range: 1-100
mutationRate?: number; // Default: 0.15, Range: 0.0-1.0
crossoverRate?: number; // Default: 0.7, Range: 0.0-1.0
elitismPercentage?: number; // Default: 0.1, Range: 0.0-0.5
}
```
### Request Example
```typescript
const response = await mcpClient.callTool('gepa_start_evolution', {
taskDescription: 'Generate comprehensive API documentation from code comments',
seedPrompt: 'Analyze the following code and generate detailed API documentation including parameters, return types, and usage examples:',
targetModules: ['documentation', 'code_analysis'],
config: {
populationSize: 25,
generations: 15,
mutationRate: 0.12,
crossoverRate: 0.75,
elitismPercentage: 0.15
}
});
```
### Response Example
```markdown
# Evolution Process Started
## Evolution Details
- **Evolution ID**: evolution_1733140800_abc123
- **Task**: Generate comprehensive API documentation from code comments
- **Target Modules**: documentation, code_analysis
- **Seed Prompt**: Provided
## Configuration
- **Population Size**: 25
- **Max Generations**: 15
- **Mutation Rate**: 0.12
## Initial Population
- **Total Candidates**: 25
- **Seed Candidates**: 1
- **Generated Candidates**: 24
Evolution process initialized successfully. Use `gepa_evaluate_prompt` to begin evaluating candidates.
```
### Error Cases
| Error | Cause | Solution |
|-------|-------|----------|
| `taskDescription is required` | Missing task description | Provide a clear, specific task description |
| `Invalid mutation rate` | Rate outside 0.0-1.0 range | Use values between 0.0 and 1.0 |
| `Population size too large` | Size exceeds system limits | Reduce population size to ≤50 |
---
## gepa_record_trajectory
Records an execution trajectory for prompt evaluation, capturing detailed performance metrics and execution steps.
### Purpose
Captures comprehensive execution data to enable reflection analysis, performance tracking, and evolutionary feedback. Essential for building the dataset needed for intelligent prompt improvement.
### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `promptId` | `string` | ✅ | Unique identifier for the prompt candidate |
| `taskId` | `string` | ✅ | Identifier for the specific task instance |
| `executionSteps` | `ExecutionStep[]` | ✅ | Sequence of execution steps with details |
| `result` | `ExecutionResult` | ✅ | Final execution result and performance score |
| `metadata` | `object` | ❌ | Additional execution metadata (optional) |
### ExecutionStep Schema
```typescript
interface ExecutionStep {
action: string; // Action performed (e.g., "parse_input", "generate_code")
input?: object; // Input data for this step
output?: object; // Output produced by this step
timestamp: string; // ISO timestamp of step execution
success: boolean; // Whether step completed successfully
reasoning?: string; // AI reasoning for this step
toolName?: string; // Tool used in this step
error?: string; // Error message if step failed
}
```
### ExecutionResult Schema
```typescript
interface ExecutionResult {
success: boolean; // Overall execution success
score: number; // Performance score (0.0-1.0)
output: object; // Final output of execution
error?: string; // Error message if execution failed
}
```
### Metadata Schema
```typescript
interface TrajectoryMetadata {
llmModel?: string; // LLM model used (e.g., "claude-3-sonnet")
executionTime?: number; // Total execution time in milliseconds
tokenUsage?: number; // Total tokens consumed
retryCount?: number; // Number of retries attempted
environment?: string; // Execution environment info
}
```
### Request Example
```typescript
const response = await mcpClient.callTool('gepa_record_trajectory', {
promptId: 'evolution_1733140800_candidate_5',
taskId: 'api_documentation_task_001',
executionSteps: [
{
action: 'parse_code_structure',
input: { codeText: '...' },
output: { functions: [...], classes: [...] },
timestamp: '2024-12-02T14:30:00.000Z',
success: true,
reasoning: 'Successfully identified 5 functions and 2 classes'
},
{
action: 'generate_documentation',
input: { parsedStructure: {...} },
output: { documentation: '...' },
timestamp: '2024-12-02T14:30:05.000Z',
success: true,
toolName: 'documentation_generator'
},
{
action: 'validate_output',
input: { documentation: '...' },
output: { validationResult: { isValid: true, score: 0.87 } },
timestamp: '2024-12-02T14:30:08.000Z',
success: true
}
],
result: {
success: true,
score: 0.87,
output: {
documentationText: 'Generated API documentation...',
coverageScore: 0.91,
qualityMetrics: { clarity: 0.85, completeness: 0.89 }
}
},
metadata: {
llmModel: 'claude-3-sonnet',
executionTime: 8500,
tokenUsage: 1250,
environment: 'production'
}
});
```
### Response Example
```markdown
# Trajectory Recorded Successfully
## Trajectory Details
- **Trajectory ID**: trajectory_1733140850_def456
- **Prompt ID**: evolution_1733140800_candidate_5
- **Task ID**: api_documentation_task_001
- **Execution Steps**: 3
- **Success**: ✅
- **Performance Score**: 0.870
## Execution Summary
- **Total Steps**: 3
- **Successful Steps**: 3
- **Failed Steps**: 0
- **Execution Time**: 8500ms
- **Token Usage**: 1250
## Storage
- **File**: ./data/trajectories/trajectory_1733140850_def456.json
- **Success**: Yes
- **ID**: trajectory_1733140850_def456
✨ Candidate added to Pareto frontier for optimization.
```
### Error Cases
| Error | Cause | Solution |
|-------|-------|----------|
| `promptId is required` | Missing prompt identifier | Provide valid prompt ID from evolution |
| `Invalid execution steps` | Malformed steps array | Ensure each step has required fields |
| `Score out of range` | Score not between 0.0-1.0 | Use normalized scores in valid range |
| `Storage failure` | File system error | Check disk space and permissions |
---
## gepa_evaluate_prompt
Evaluates a prompt candidate's performance across multiple tasks using configurable rollout counts and execution strategies.
### Purpose
Systematically tests prompt candidates across diverse scenarios to gather robust performance data for evolutionary selection and Pareto frontier updates.
### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `promptId` | `string` | ✅ | Unique identifier for prompt to evaluate |
| `taskIds` | `string[]` | ✅ | List of task IDs to evaluate against |
| `rolloutCount` | `number` | ❌ | Number of evaluation rollouts per task (default: 5) |
| `parallel` | `boolean` | ❌ | Whether to run evaluations in parallel (default: true) |
### Request Example
```typescript
const response = await mcpClient.callTool('gepa_evaluate_prompt', {
promptId: 'evolution_1733140800_candidate_12',
taskIds: [
'code_documentation_basic',
'code_documentation_complex',
'api_reference_generation',
'example_code_creation'
],
rolloutCount: 8,
parallel: true
});
```
### Response Example
```markdown
# Prompt Evaluation Complete
## Evaluation Details
- **Evaluation ID**: eval_1733140900_ghi789
- **Prompt ID**: evolution_1733140800_candidate_12
- **Tasks Evaluated**: 4
- **Rollouts per Task**: 8
- **Total Evaluations**: 32
- **Parallel Execution**: Yes
## Performance Metrics
- **Success Rate**: 87.5%
- **Average Score**: 0.763
- **Combined Fitness**: 0.668
- **Average Execution Time**: 1850ms
## Task Breakdown
- **code_documentation_basic**: 100.0% success, 0.821 avg score
- **code_documentation_complex**: 87.5% success, 0.742 avg score
- **api_reference_generation**: 75.0% success, 0.698 avg score
- **example_code_creation**: 87.5% success, 0.791 avg score
✨ Candidate updated in Pareto frontier with performance metrics.
```
### Evaluation Metrics
The evaluation process tracks several key metrics:
| Metric | Description | Range |
|--------|-------------|-------|
| **Success Rate** | Percentage of successful executions | 0-100% |
| **Average Score** | Mean performance score across rollouts | 0.0-1.0 |
| **Combined Fitness** | Success rate × Average score | 0.0-1.0 |
| **Execution Time** | Mean time per evaluation | milliseconds |
| **Token Efficiency** | Output quality per token used | 0.0-1.0 |
### Error Cases
| Error | Cause | Solution |
|-------|-------|----------|
| `promptId not found` | Invalid prompt identifier | Use valid prompt ID from active evolution |
| `Empty taskIds array` | No tasks specified | Provide at least one task ID |
| `Rollout count too high` | Exceeds system limits | Use rolloutCount ≤ 20 |
| `Evaluation timeout` | Tasks taking too long | Reduce complexity or increase timeouts |
---
## gepa_reflect
Analyzes execution trajectories to identify failure patterns and generate actionable prompt improvement suggestions.
### Purpose
Performs intelligent failure analysis to understand why prompts fail and provides specific, actionable recommendations for improvement. Powers the reflection-driven evolution cycle.
### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `trajectoryIds` | `string[]` | ✅ | List of trajectory IDs to analyze for patterns |
| `targetPromptId` | `string` | ✅ | Prompt ID to generate improvements for |
| `analysisDepth` | `'shallow' \| 'deep'` | ❌ | Depth of analysis to perform (default: 'deep') |
| `focusAreas` | `string[]` | ❌ | Specific areas to focus analysis on (optional) |
### Analysis Depth Options
| Depth | Description | Use Case |
|-------|-------------|----------|
| `shallow` | Quick pattern identification | Fast iteration, initial analysis |
| `deep` | Comprehensive root cause analysis | Detailed improvement, final optimization |
### Focus Areas
Common focus areas for targeted analysis:
- `instruction_clarity` - Prompt instruction quality
- `example_quality` - Example effectiveness
- `constraint_handling` - Constraint specification
- `error_recovery` - Failure handling strategies
- `output_formatting` - Response structure
- `reasoning_depth` - Logical reasoning quality
### Request Example
```typescript
const response = await mcpClient.callTool('gepa_reflect', {
trajectoryIds: [
'trajectory_1733140850_abc123',
'trajectory_1733140855_def456',
'trajectory_1733140860_ghi789',
'trajectory_1733140865_jkl012'
],
targetPromptId: 'evolution_1733140800_candidate_15',
analysisDepth: 'deep',
focusAreas: ['instruction_clarity', 'example_quality', 'error_recovery']
});
```
### Response Example
```markdown
# Reflection Analysis Complete
## Analysis Details
- **Reflection ID**: reflection_1733141000_mno345
- **Target Prompt**: evolution_1733140800_candidate_15
- **Trajectories Analyzed**: 4/4
- **Analysis Depth**: deep
- **Focus Areas**: instruction_clarity, example_quality, error_recovery
## Failure Pattern Analysis
- **Patterns Detected**: 3
- **Recommendations**: 5
- **Confidence**: 89.2%
## Key Findings
1. **Instruction Ambiguity** (75.0% frequency)
- Severity: 75.0%
- Description: Instructions lack specific output format requirements
2. **Insufficient Examples** (50.0% frequency)
- Severity: 50.0%
- Description: Limited examples for complex edge cases
3. **Error Handling Gaps** (25.0% frequency)
- Severity: 25.0%
- Description: No guidance for handling malformed input
## Improvement Recommendations
1. **High Priority**: Add explicit output format specifications
- Issue: Instruction Ambiguity
- Frequency: 75.0%
2. **High Priority**: Include diverse edge case examples
- Issue: Insufficient Examples
- Frequency: 50.0%
3. **Medium Priority**: Add error handling instructions
- Issue: Error Handling Gaps
- Frequency: 25.0%
4. **Medium Priority**: Clarify constraint boundaries
- Issue: Instruction Ambiguity
- Frequency: 75.0%
5. **Low Priority**: Improve reasoning chain structure
- Issue: Insufficient Examples
- Frequency: 50.0%
## Summary
The analysis identified 3 distinct failure patterns across 4 trajectories. Focus on addressing high-priority issues first to maximize improvement impact.
```
### Reflection Output Components
| Component | Description | Use Case |
|-----------|-------------|----------|
| **Failure Patterns** | Common failure modes and frequencies | Understanding systematic issues |
| **Root Cause Analysis** | Deep dive into underlying problems | Targeted improvement focus |
| **Improvement Suggestions** | Specific, actionable prompt changes | Direct implementation guidance |
| **Confidence Scores** | Reliability of analysis results | Risk assessment for changes |
| **Priority Ranking** | Ordered list of improvement areas | Resource allocation decisions |
### Error Cases
| Error | Cause | Solution |
|-------|-------|----------|
| `No valid trajectories` | All trajectory IDs invalid | Verify trajectory IDs exist and are accessible |
| `Insufficient data` | Too few trajectories for analysis | Provide at least 3 trajectories |
| `Analysis timeout` | Complex analysis taking too long | Use 'shallow' depth or reduce trajectory count |
| `Target prompt not found` | Invalid target prompt ID | Verify prompt ID exists in current evolution |
---
## Best Practices
### Evolution Initialization
- **Task Descriptions**: Be specific and measurable
- **Seed Prompts**: Use high-quality, representative examples
- **Population Size**: Start small (10-20) for faster iteration
- **Generations**: Monitor convergence, typically 10-30 generations
### Trajectory Recording
- **Step Granularity**: Capture meaningful decision points
- **Error Context**: Include rich error information for failures
- **Performance Metrics**: Use consistent scoring scales (0.0-1.0)
- **Metadata**: Include environment and execution context
### Evaluation Strategy
- **Task Diversity**: Use varied, representative task sets
- **Rollout Counts**: Balance thoroughness vs. speed (3-10 rollouts)
- **Parallel Processing**: Enable for faster evaluation cycles
- **Score Calibration**: Ensure consistent scoring across tasks
### Reflection Analysis
- **Trajectory Selection**: Include both successful and failed executions
- **Analysis Depth**: Use 'deep' for final optimization, 'shallow' for iteration
- **Focus Areas**: Target specific improvement areas when known
- **Implementation**: Apply suggestions systematically and test impact
## Integration Patterns
### Sequential Workflow
```typescript
// 1. Start evolution
const evolution = await startEvolution({...});
// 2. Evaluate candidates
const evaluation = await evaluatePrompt({...});
// 3. Record trajectories
await recordTrajectory({...});
// 4. Analyze failures
const reflection = await reflect({...});
// 5. Generate new candidates based on insights
```
### Parallel Evaluation
```typescript
// Evaluate multiple candidates simultaneously
const evaluations = await Promise.all([
evaluatePrompt({ promptId: 'candidate_1', ... }),
evaluatePrompt({ promptId: 'candidate_2', ... }),
evaluatePrompt({ promptId: 'candidate_3', ... })
]);
```
### Continuous Improvement
```typescript
// Monitor performance and trigger reflection automatically
if (successRate < threshold) {
const analysis = await reflect({
trajectoryIds: recentFailures,
targetPromptId: currentBest
});
// Apply improvements...
}
```