run_evaluation_tests
Run evaluation tests on CircleCI pipelines by triggering new pipelines with generated configuration files and returning URLs to monitor progress.
Instructions
This tool allows the users to run evaluation tests on a circleci pipeline.
They can be referred to as "Prompt Tests" or "Evaluation Tests".
This tool triggers a new CircleCI pipeline and returns the URL to monitor its progress.
The tool will generate an appropriate circleci configuration file and trigger a pipeline using this temporary configuration.
The tool will return the project slug.
Input options (EXACTLY ONE of these THREE options must be used):
Option 1 - Project Slug and branch (BOTH required):
- projectSlug: The project slug obtained from listFollowedProjects tool (e.g., "gh/organization/project")
- branch: The name of the branch (required when using projectSlug)
Option 2 - Direct URL (provide ONE of these):
- projectURL: The URL of the CircleCI project in any of these formats:
* Project URL with branch: https://app.circleci.com/pipelines/gh/organization/project?branch=feature-branch
* Pipeline URL: https://app.circleci.com/pipelines/gh/organization/project/123
* Workflow URL: https://app.circleci.com/pipelines/gh/organization/project/123/workflows/abc-def
* Job URL: https://app.circleci.com/pipelines/gh/organization/project/123/workflows/abc-def/jobs/xyz
Option 3 - Project Detection (ALL of these must be provided together):
- workspaceRoot: The absolute path to the workspace root
- gitRemoteURL: The URL of the git remote repository
- branch: The name of the current branch
Test Files:
- promptFiles: Array of prompt template file objects from the ./prompts directory, each containing:
* fileName: The name of the prompt template file
* fileContent: The contents of the prompt template file
Pipeline Selection:
- If the project has multiple pipeline definitions, the tool will return a list of available pipelines
- You must then make another call with the chosen pipeline name using the pipelineChoiceName parameter
- The pipelineChoiceName must exactly match one of the pipeline names returned by the tool
- If the project has only one pipeline definition, pipelineChoiceName is not needed
Additional Requirements:
- Never call this tool with incomplete parameters
- If using Option 1, make sure to extract the projectSlug exactly as provided by listFollowedProjects
- If using Option 2, the URLs MUST be provided by the user - do not attempt to construct or guess URLs
- If using Option 3, ALL THREE parameters (workspaceRoot, gitRemoteURL, branch) must be provided
- If none of the options can be fully satisfied, ask the user for the missing information before making the tool call
Returns:
- A URL to the newly triggered pipeline that can be used to monitor its progress
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| params | No |
Implementation Reference
- The primary handler function implementing the 'run_evaluation_tests' tool logic: processes inputs for project/branch, handles prompt files by gzipping and injecting into CircleCI config, generates evaluation workflow, and triggers the pipeline.export const runEvaluationTests: ToolCallback<{ params: typeof runEvaluationTestsInputSchema; }> = async (args) => { const { workspaceRoot, gitRemoteURL, branch, projectURL, pipelineChoiceName, projectSlug: inputProjectSlug, promptFiles, } = args.params ?? {}; let projectSlug: string | undefined; let branchFromURL: string | undefined; if (inputProjectSlug) { if (!branch) { return mcpErrorOutput( 'Branch not provided. When using projectSlug, a branch must also be specified.', ); } projectSlug = inputProjectSlug; } else if (projectURL) { projectSlug = getProjectSlugFromURL(projectURL); branchFromURL = getBranchFromURL(projectURL); } else if (workspaceRoot && gitRemoteURL && branch) { projectSlug = await identifyProjectSlug({ gitRemoteURL, }); } else { return mcpErrorOutput( 'Missing required inputs. Please provide either: 1) projectSlug with branch, 2) projectURL, or 3) workspaceRoot with gitRemoteURL and branch.', ); } if (!projectSlug) { return mcpErrorOutput(` Project not found. Ask the user to provide the inputs user can provide based on the tool description. Project slug: ${projectSlug} Git remote URL: ${gitRemoteURL} Branch: ${branch} `); } const foundBranch = branchFromURL || branch; if (!foundBranch) { return mcpErrorOutput( 'No branch provided. Try using the current git branch.', ); } if (!promptFiles || promptFiles.length === 0) { return mcpErrorOutput( 'No prompt template files provided. Please ensure you have prompt template files in the ./prompts directory (e.g. <relevant-name>.prompt.yml) and include them in the promptFiles parameter.', ); } const circleci = getCircleCIClient(); const { id: projectId } = await circleci.projects.getProject({ projectSlug, }); const pipelineDefinitions = await circleci.pipelines.getPipelineDefinitions({ projectId, }); const pipelineChoices = [ ...pipelineDefinitions.map((definition) => ({ name: definition.name, definitionId: definition.id, })), ]; if (pipelineChoices.length === 0) { return mcpErrorOutput( 'No pipeline definitions found. Please make sure your project is set up on CircleCI to run pipelines.', ); } const formattedPipelineChoices = pipelineChoices .map( (pipeline, index) => `${index + 1}. ${pipeline.name} (definitionId: ${pipeline.definitionId})`, ) .join('\n'); if (pipelineChoices.length > 1 && !pipelineChoiceName) { return { content: [ { type: 'text', text: `Multiple pipeline definitions found. Please choose one of the following:\n${formattedPipelineChoices}`, }, ], }; } const chosenPipeline = pipelineChoiceName ? pipelineChoices.find((pipeline) => pipeline.name === pipelineChoiceName) : undefined; if (pipelineChoiceName && !chosenPipeline) { return mcpErrorOutput( `Pipeline definition with name ${pipelineChoiceName} not found. Please choose one of the following:\n${formattedPipelineChoices}`, ); } const runPipelineDefinitionId = chosenPipeline?.definitionId || pipelineChoices[0].definitionId; // Process each file for compression and encoding const processedFiles = promptFiles.map((promptFile) => { const fileExtension = promptFile.fileName.toLowerCase(); let processedPromptFileContent: string; if (fileExtension.endsWith('.json')) { // For JSON files, parse and re-stringify to ensure proper formatting const json = JSON.parse(promptFile.fileContent); processedPromptFileContent = JSON.stringify(json, null); } else if ( fileExtension.endsWith('.yml') || fileExtension.endsWith('.yaml') ) { // For YAML files, keep as-is processedPromptFileContent = promptFile.fileContent; } else { // Default to treating as text content processedPromptFileContent = promptFile.fileContent; } // Gzip compress the content and then base64 encode for compact transport const gzippedContent = gzipSync(processedPromptFileContent); const base64GzippedContent = gzippedContent.toString('base64'); return { fileName: promptFile.fileName, base64GzippedContent, }; }); // Generate file creation commands with conditional logic for parallelism const fileCreationCommands = processedFiles .map( (file, index) => ` if [ "$CIRCLE_NODE_INDEX" = "${index}" ]; then sudo mkdir -p /prompts echo "${file.base64GzippedContent}" | base64 -d | gzip -d | sudo tee /prompts/${file.fileName} > /dev/null fi`, ) .join('\n'); // Generate individual evaluation commands with conditional logic for parallelism const evaluationCommands = processedFiles .map( (file, index) => ` if [ "$CIRCLE_NODE_INDEX" = "${index}" ]; then python eval.py ${file.fileName} fi`, ) .join('\n'); const configContent = ` version: 2.1 jobs: evaluate-prompt-template-tests: parallelism: ${processedFiles.length} docker: - image: cimg/python:3.12.0 steps: - run: | curl https://gist.githubusercontent.com/jvincent42/10bf3d2d2899033ae1530cf429ed03f8/raw/acf07002d6bfcfb649c913b01a203af086c1f98d/eval.py > eval.py echo "deepeval>=3.0.3 openai>=1.84.0 anthropic>=0.54.0 PyYAML>=6.0.2 " > requirements.txt pip install -r requirements.txt - run: | ${fileCreationCommands} - run: | ${evaluationCommands} workflows: mcp-run-evaluation-tests: jobs: - evaluate-prompt-template-tests `; const runPipelineResponse = await circleci.pipelines.runPipeline({ projectSlug, branch: foundBranch, definitionId: runPipelineDefinitionId, configContent, }); return { content: [ { type: 'text', text: `Pipeline run successfully. View it at: https://app.circleci.com/pipelines/${projectSlug}/${runPipelineResponse.number}`, }, ], }; };
- Zod input schema defining parameters for the tool, including project identification options (projectSlug+branch, projectURL, or workspaceRoot+gitRemoteURL+branch) and array of promptFiles with name and content.export const runEvaluationTestsInputSchema = z.object({ projectSlug: z.string().describe(projectSlugDescription).optional(), branch: z.string().describe(branchDescription).optional(), workspaceRoot: z .string() .describe( 'The absolute path to the root directory of your project workspace. ' + 'This should be the top-level folder containing your source code, configuration files, and dependencies. ' + 'For example: "/home/user/my-project" or "C:\\Users\\user\\my-project"', ) .optional(), gitRemoteURL: z .string() .describe( 'The URL of the remote git repository. This should be the URL of the repository that you cloned to your local workspace. ' + 'For example: "https://github.com/user/my-project.git"', ) .optional(), projectURL: z .string() .describe( 'The URL of the CircleCI project. Can be any of these formats:\n' + '- Project URL with branch: https://app.circleci.com/pipelines/gh/organization/project?branch=feature-branch\n' + '- Pipeline URL: https://app.circleci.com/pipelines/gh/organization/project/123\n' + '- Workflow URL: https://app.circleci.com/pipelines/gh/organization/project/123/workflows/abc-def\n' + '- Job URL: https://app.circleci.com/pipelines/gh/organization/project/123/workflows/abc-def/jobs/xyz', ) .optional(), pipelineChoiceName: z .string() .describe( 'The name of the pipeline to run. This parameter is only needed if the project has multiple pipeline definitions. ' + 'If not provided and multiple pipelines exist, the tool will return a list of available pipelines for the user to choose from. ' + 'If provided, it must exactly match one of the pipeline names returned by the tool.', ) .optional(), promptFiles: z .array( z.object({ fileName: z.string().describe('The name of the prompt template file'), fileContent: z .string() .describe('The contents of the prompt template file'), }), ) .describe( `Array of prompt template files in the ${promptsOutputDirectory} directory (e.g. ${fileNameTemplate}).`, ), });
- src/tools/runEvaluationTests/tool.ts:7-55 (registration)Tool specification object registering 'run_evaluation_tests' with name, detailed description of usage options, and reference to input schema.export const runEvaluationTestsTool = { name: 'run_evaluation_tests' as const, description: ` This tool allows the users to run evaluation tests on a circleci pipeline. They can be referred to as "Prompt Tests" or "Evaluation Tests". This tool triggers a new CircleCI pipeline and returns the URL to monitor its progress. The tool will generate an appropriate circleci configuration file and trigger a pipeline using this temporary configuration. The tool will return the project slug. Input options (EXACTLY ONE of these THREE options must be used): ${option1DescriptionBranchRequired} Option 2 - Direct URL (provide ONE of these): - projectURL: The URL of the CircleCI project in any of these formats: * Project URL with branch: https://app.circleci.com/pipelines/gh/organization/project?branch=feature-branch * Pipeline URL: https://app.circleci.com/pipelines/gh/organization/project/123 * Workflow URL: https://app.circleci.com/pipelines/gh/organization/project/123/workflows/abc-def * Job URL: https://app.circleci.com/pipelines/gh/organization/project/123/workflows/abc-def/jobs/xyz Option 3 - Project Detection (ALL of these must be provided together): - workspaceRoot: The absolute path to the workspace root - gitRemoteURL: The URL of the git remote repository - branch: The name of the current branch Test Files: - promptFiles: Array of prompt template file objects from the ${promptsOutputDirectory} directory, each containing: * fileName: The name of the prompt template file * fileContent: The contents of the prompt template file Pipeline Selection: - If the project has multiple pipeline definitions, the tool will return a list of available pipelines - You must then make another call with the chosen pipeline name using the pipelineChoiceName parameter - The pipelineChoiceName must exactly match one of the pipeline names returned by the tool - If the project has only one pipeline definition, pipelineChoiceName is not needed Additional Requirements: - Never call this tool with incomplete parameters - If using Option 1, make sure to extract the projectSlug exactly as provided by listFollowedProjects - If using Option 2, the URLs MUST be provided by the user - do not attempt to construct or guess URLs - If using Option 3, ALL THREE parameters (workspaceRoot, gitRemoteURL, branch) must be provided - If none of the options can be fully satisfied, ask the user for the missing information before making the tool call Returns: - A URL to the newly triggered pipeline that can be used to monitor its progress `, inputSchema: runEvaluationTestsInputSchema, };
- src/circleci-tools.ts:20-85 (registration)Server-level registration: imports the tool and handler, adds runEvaluationTestsTool to the CCI_TOOLS array (line 47), and maps 'run_evaluation_tests' to runEvaluationTests in CCI_HANDLERS (line 78).import { runEvaluationTestsTool } from './tools/runEvaluationTests/tool.js'; import { runEvaluationTests } from './tools/runEvaluationTests/handler.js'; import { rerunWorkflowTool } from './tools/rerunWorkflow/tool.js'; import { rerunWorkflow } from './tools/rerunWorkflow/handler.js'; import { downloadUsageApiDataTool } from './tools/downloadUsageApiData/tool.js'; import { downloadUsageApiData } from './tools/downloadUsageApiData/handler.js'; import { findUnderusedResourceClassesTool } from './tools/findUnderusedResourceClasses/tool.js'; import { findUnderusedResourceClasses } from './tools/findUnderusedResourceClasses/handler.js'; import { analyzeDiffTool } from './tools/analyzeDiff/tool.js'; import { analyzeDiff } from './tools/analyzeDiff/handler.js'; import { runRollbackPipelineTool } from './tools/runRollbackPipeline/tool.js'; import { runRollbackPipeline } from './tools/runRollbackPipeline/handler.js'; import { listComponentVersionsTool } from './tools/listComponentVersions/tool.js'; import { listComponentVersions } from './tools/listComponentVersions/handler.js'; // Define the tools with their configurations export const CCI_TOOLS = [ getBuildFailureLogsTool, getFlakyTestLogsTool, getLatestPipelineStatusTool, getJobTestResultsTool, configHelperTool, createPromptTemplateTool, recommendPromptTemplateTestsTool, runPipelineTool, listFollowedProjectsTool, runEvaluationTestsTool, rerunWorkflowTool, downloadUsageApiDataTool, findUnderusedResourceClassesTool, analyzeDiffTool, runRollbackPipelineTool, listComponentVersionsTool, ]; // Extract the tool names as a union type type CCIToolName = (typeof CCI_TOOLS)[number]['name']; export type ToolHandler<T extends CCIToolName> = ToolCallback<{ params: Extract<(typeof CCI_TOOLS)[number], { name: T }>['inputSchema']; }>; // Create a type for the tool handlers that directly maps each tool to its appropriate input schema type ToolHandlers = { [K in CCIToolName]: ToolHandler<K>; }; export const CCI_HANDLERS = { get_build_failure_logs: getBuildFailureLogs, find_flaky_tests: getFlakyTestLogs, get_latest_pipeline_status: getLatestPipelineStatus, get_job_test_results: getJobTestResults, config_helper: configHelper, create_prompt_template: createPromptTemplate, recommend_prompt_template_tests: recommendPromptTemplateTests, run_pipeline: runPipeline, list_followed_projects: listFollowedProjects, run_evaluation_tests: runEvaluationTests, rerun_workflow: rerunWorkflow, download_usage_api_data: downloadUsageApiData, find_underused_resource_classes: findUnderusedResourceClasses, analyze_diff: analyzeDiff, run_rollback_pipeline: runRollbackPipeline, list_component_versions: listComponentVersions, } satisfies ToolHandlers;