Skip to main content
Glama
IamCatoBot

CatoBot autoexperiment MCP Server

Official
by IamCatoBot

CatoBot autoexperiment MCP Server

Domain-Agnostic MCP Server for Autonomous Experimentation

A generalisation of Karpathy's autoresearch pattern into a reusable Model Context Protocol (MCP) server that any AI agent can drive, pointed at any domain.

Documentation Map

Example Experiments

The Pattern

modify something → run it → measure a result → keep or discard → repeat

The server exposes this loop as a standard set of MCP tools. The domain (what gets modified, how it runs, and what gets measured) is defined entirely in a JSON config file. The agent-side logic stays the same regardless of domain.

Architecture

┌─────────────────────────────────────────────────────┐
│  AI Agent (Claude Code, Codex, etc.)                │
│                                                     │
│  Reads status → plans change → edits file →         │
│  runs experiment → checks result → keeps/discards   │
└──────────────┬──────────────────────────────────────┘
               │ MCP (stdio)
┌──────────────▼──────────────────────────────────────┐
│  autoexperiment MCP server                                 │
│                                                     │
│  Tools:                                             │
│    autoexp_get_status          — session overview    │
│    autoexp_read_file           — read allowed file   │
│    autoexp_update_file         — full file replace   │
│    autoexp_patch_file          — targeted find/repl  │
│    autoexp_run_experiment      — execute + measure   │
│    autoexp_begin_experiment    — open pending record │
│    autoexp_complete_experiment — close with metric   │
│    autoexp_set_baseline        — mark as baseline    │
│    autoexp_rollback            — revert to last good │
│    autoexp_get_history         — review past runs    │
│    autoexp_run_setup           — one-time setup      │
│                                                     │
│  Resources:                                         │
│    autoexp://status        — session status (JSON)   │
│    autoexp://history       — experiment history      │
│    autoexp://file/{path}   — read allowed files      │
│                                                     │
│  Config:  autoexperiment.json  (domain adapter)     │
│  Ledger:  .autoexperiment_ledger.json (state)       │
└──────────────┬──────────────────────────────────────┘
               │ subprocess / external MCP server
┌──────────────▼──────────────────────────────────────┐
│  Your domain                                        │
│  (training script, benchmark, simulation, etc.)     │
└─────────────────────────────────────────────────────┘

Code Structure

The server is implemented as a Python package (autoexperiment_mcp/) with a thin server.py entry point:

autoexperiment-mcp-server/
├── server.py                         # Entry point: imports mcp, calls mcp.run()
└── autoexperiment_mcp/
    ├── models.py                     # Pydantic models (DomainConfig, ExperimentRecord, …)
    ├── utils.py                      # Pure utilities (git, hash, path, regex, time, coercion)
    ├── store.py                      # State I/O, snapshot management, TSV logging, query helpers
    ├── experiment.py                 # Core lifecycle: begin/complete experiment, keep decision
    ├── lifespan.py                   # Startup validation, app_lifespan context manager
    ├── app.py                        # mcp = FastMCP("autoexperiment_mcp", lifespan=…)
    ├── tools.py                      # All 11 @mcp.tool() registrations
    ├── resources.py                  # All 3 @mcp.resource() registrations
    └── __init__.py                   # Imports app + triggers tool/resource registration

Installation

Prerequisites

Install uv

macOS / Linux

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows (PowerShell)

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Learn more: astral-sh/uv

Clone the repository

git clone https://github.com/IamCatoBot/catobot-autoexperiment-mcp.git
cd catobot-autoexperiment-mcp

Install dependencies

uv sync

Quick Start

1. Prepare your experiment folder

Your experiment folder needs a working baseline, an evaluation script, and a config file:

my-experiment/
├── autoexperiment.json     ← config (you write this)
├── solution.py             ← editable (agent modifies this)
├── benchmark.py            ← evaluation (read-only)
└── data.csv                ← test data (read-only)

2. Create autoexperiment.json

{
  "project_name": "My Experiment",
  "description": "What you're trying to optimise",
  "workspace_dir": "/absolute/path/to/my-experiment",
  "editable_files": ["solution.py"],
  "read_only_files": ["benchmark.py", "data.csv"],
  "run_command": "python benchmark.py 2>&1",
  "timeout_seconds": 60,
  "metric_name": "rmse",
  "metric_regex": "^rmse:\\s*([\\d.]+)",
  "metric_direction": "lower",
  "use_git": true
}

3. Initialise git in the experiment folder

Git tracking is enabled by default (use_git: true). The experiment folder must be a git repository with an initial commit before the server will start.

cd /path/to/my-experiment
git init
git add -A
git commit -m "initial baseline"

4. Verify the run command works

Run your experiment command manually and check the output contains the metric in the expected format:

cd /path/to/my-experiment
python benchmark.py
# Should print something like:  rmse: 12.345678

5. Register the MCP server

Recommended: pass AUTOEXPERIMENT_CONFIG pointing to your config file. MCP hosts may launch the server process from a different working directory, so an explicit path is the safest default.

Claude Code (-e for env vars):

claude mcp add autoexperiment \
  -e AUTOEXPERIMENT_CONFIG=/path/to/my-experiment/autoexperiment.json \
  -- uv run \
  --project PATH_TO_AUTOEXPERIMENT_MCP_SERVER \
  python PATH_TO_AUTOEXPERIMENT_MCP_SERVER/server.py

Codex (--env for env vars):

codex mcp add autoexperiment \
  --env AUTOEXPERIMENT_CONFIG=/path/to/my-experiment/autoexperiment.json \
  -- uv run \
  --project PATH_TO_AUTOEXPERIMENT_MCP_SERVER \
  python PATH_TO_AUTOEXPERIMENT_MCP_SERVER/server.py

Replace /path/to/my-experiment/autoexperiment.json with the absolute path to your config file.

Optional shortcut: if the server process is launched from your experiment folder and the config filename is autoexperiment.json, you can omit the environment variable.

You only need to register the MCP server once per MCP client profile; after that, reconnect normally in new sessions.

Note: Replace PATH_TO_AUTOEXPERIMENT_MCP_SERVER with the actual path to your cloned repository. If the uv command is not found, run which uv (Unix) or Get-Command uv (PowerShell) and use the full path in the "command" field.

6. Start experimenting

Launch Claude Code, Codex, or another MCP client from your experiment folder and prompt it:

Read the experiment status, review the editable and read-only files, run the baseline first, then iterate until improvements plateau and no meaningful gains remain.

Security Warning

setup_command and run_command execute shell commands on your host machine. This server does not provide sandboxing or container isolation by default.

Startup Validation

The server validates the configuration at startup and will refuse to start if:

  • workspace_dir does not exist or is not a directory

  • Any file in editable_files or read_only_files is missing

  • A file appears in both editable_files and read_only_files

  • metric_regex is not a valid regular expression

  • use_git is true but the workspace is not a git repository

Error messages are specific and tell you exactly what to fix.

Configuration

Everything domain-specific lives in autoexperiment.json:

Field

Required

Default

Description

project_name

yes

Human-readable name

description

no

""

What you're trying to achieve

workspace_dir

yes

Absolute path to the experiment folder

editable_files

yes

Files the agent is allowed to modify (at least one)

read_only_files

no

[]

Files the agent can read but not change

execution_mode

no

"hybrid"

"shell", "external", or "hybrid"

run_command

shell/hybrid

Shell command to run one experiment

timeout_seconds

no

300

Max time per experiment (10–7200s)

setup_command

no

null

One-time setup (deps, data download, etc.)

metric_name

yes

Name of the metric being optimised

metric_regex

shell/hybrid

Regex with one capture group to extract a float from stdout

metric_direction

yes

"lower" or "higher"

require_baseline_first

no

true

Require a baseline experiment before non-baseline runs

use_git

no

true

Track experiments as git commits. Requires the workspace to be a git repo with an initial commit.

git_branch_prefix

no

"autoexp"

Prefix for experiment branches

keep_policy

no

see below

Multi-gate keep/discard policy

Keep Policy

The keep_policy object controls when a completed experiment is kept vs discarded. All gates must pass for a run to be kept.

Field

Default

Description

required_true_keys

[]

Metadata keys that must be boolean true

numeric_min

{}

Metadata keys with a floor value (e.g. {"utilization": 45})

numeric_max

{}

Metadata keys with a ceiling value (e.g. {"latency_ms": 250})

require_numeric_keys_present

true

If true, missing keys in numeric_min/numeric_max cause discard

allow_equal_metric_if_simpler

true

Keep a tied run if its complexity_score is lower

equal_metric_tolerance

1e-9

Tolerance for treating two metric values as equal

complexity_key

"complexity_score"

Metadata key used for complexity tie-breaking

The agent sees the full keep_policy in autoexp_get_status and receives a required_metadata_keys reminder in every autoexp_begin_experiment response — so it always knows exactly what to include in the metadata argument when calling autoexp_complete_experiment.

Tool Reference

Tool

Purpose

Destructive?

autoexp_get_status

Session overview, best score, editable files, keep_policy gates

No

autoexp_read_file

Read any allowed file

No

autoexp_update_file

Replace entire file contents

Yes

autoexp_patch_file

Targeted find-and-replace

No

autoexp_run_experiment

Execute the run command, extract metric (shell mode)

No (but slow)

autoexp_begin_experiment

Open a pending experiment record (external/hybrid mode)

No

autoexp_complete_experiment

Close a pending experiment with metric + metadata (external/hybrid mode)

No

autoexp_set_baseline

Mark an existing completed experiment as the baseline

No

autoexp_rollback

Revert files to a specific experiment's state via git

Yes

autoexp_get_history

Review past experiments and results

No

autoexp_run_setup

Run one-time setup command

No

How the Loop Works

Shell mode (execution_mode: "shell")

  1. Agent calls autoexp_get_status → learns the domain, metric, and current best.

  2. Agent calls autoexp_read_file → reads the editable file(s) to understand the code.

  3. Agent calls autoexp_patch_file or autoexp_update_file → makes a change.

  4. Agent calls autoexp_run_experiment with a hypothesis → server runs it, extracts metric.

  5. If improved: server auto-commits via git and records the commit hash. Agent plans next experiment.

  6. If regressed or crashed: agent calls autoexp_rollback, then tries something else.

  7. Agent calls autoexp_get_history periodically to review trends and avoid repetition.

  8. Repeat indefinitely.

External / hybrid mode (execution_mode: "external" or "hybrid")

Use this when another MCP server (e.g. a physics sim, a cloud evaluator) runs the experiment.

  1. Agent calls autoexp_get_status → note the keep_policy field — it lists every metadata key the policy will gate on.

  2. Agent edits the editable file(s) via autoexp_update_file / autoexp_patch_file.

  3. Agent calls autoexp_begin_experiment → receives experiment_id and a required_metadata_keys reminder.

  4. Agent triggers the external system and waits for results.

  5. Agent assembles a metadata dict containing all keys from required_metadata_keys (both from simulation output and any input-parameter constraints defined in numeric_min/numeric_max).

  6. Agent calls autoexp_complete_experiment with experiment_id, metric_value, and the assembled metadata dict.

  7. Server evaluates the keep policy and responds with kept, keep_reason, is_best.

  8. If not kept: agent calls autoexp_rollback and adjusts its approach.

Important: numeric_min/numeric_max gates often reference input parameters (e.g. service-time bounds from a config file) rather than simulation outputs. You must read those values yourself and include them in metadata alongside the simulator's results.

Design Principles

  • Domain-agnostic. The server knows nothing about ML, sorting, prompts, or any specific domain. All domain knowledge lives in the config file and the agent's reasoning.

  • Single metric. One number determines success. If your problem needs multiple metrics, your run command should combine them into a single score.

  • Fixed time budget. Each experiment gets the same wall-clock timeout, making results comparable.

  • Git as memory. Every improvement is committed with its commit hash recorded. Every regression can be rolled back to a specific experiment. The full history is always recoverable.

  • Agent autonomy. The server provides tools, not opinions. The agent decides what to try, when to rollback, and when to change strategy.

Maintainer

The CatoBot autoexperiment MCP Server is an open source project developed and maintained by Nikolaos Maniatis, The Cato Bot Company Limited.

Disclaimer

  • Work in progress: the software is actively evolving; features may change and some functionality may be incomplete.

  • LLM-powered workflow: model/code quality depends on the capabilities of the LLM driving the loop.

  • Validate outputs: always critically review and validate generated models, code changes, and metrics before relying on results.

Citation

For academic use, cite:

Maniatis, N. (2026). CatoBot autoexperiment MCP Server (v1.0.0). https://github.com/IamCatoBot/catobot-autoexperiment-mcp. Copyright The Cato Bot Company Limited. Licensed under Apache 2.0.

A
license - permissive license
-
quality - not tested
C
maintenance

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/IamCatoBot/catobot-autoexperiment-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server