CatoBot Autoexperiment Server

БесплатноНе проверен

A domain-agnostic MCP server for autonomous experimentation, generalizing Karpathy's autoresearch pattern into a reusable server that any AI agent can drive, po

автор: IamCatoBot

GitHub Embed

Описание

A domain-agnostic MCP server for autonomous experimentation, generalizing Karpathy's autoresearch pattern into a reusable server that any AI agent can drive, pointed at any domain defined by a JSON configuration.

README

Domain-Agnostic MCP Server for Autonomous Experimentation

A generalisation of Karpathy's autoresearch pattern into a reusable Model Context Protocol (MCP) server that any AI agent can drive, pointed at any domain.

Documentation Map

Core usage and setup: this README
Example catalog: example_experiments/README.md
Experiment design guide: example_experiments/Autoexperiment_Design_Guide.md
Contribution guide: CONTRIBUTING.md
Security policy: SECURITY.md
Support channels: SUPPORT.md
License and notices: LICENSE, NOTICE

Example Experiments

Shell-only experiment with CatoBot autoexperiment MCP: example_experiments/shell/
External orchestration of CatoBot autoexperiment MCP + Text2Sim MCP): example_experiments/external/DES_Text2Sim/

The Pattern

modify something → run it → measure a result → keep or discard → repeat

The server exposes this loop as a standard set of MCP tools. The domain (what gets modified, how it runs, and what gets measured) is defined entirely in a JSON config file. The agent-side logic stays the same regardless of domain.

Architecture

┌─────────────────────────────────────────────────────┐
│  AI Agent (Claude Code, Codex, etc.)                │
│                                                     │
│  Reads status → plans change → edits file →         │
│  runs experiment → checks result → keeps/discards   │
└──────────────┬──────────────────────────────────────┘
               │ MCP (stdio)
┌──────────────▼──────────────────────────────────────┐
│  autoexperiment MCP server                                 │
│                                                     │
│  Tools:                                             │
│    autoexp_get_status          — session overview    │
│    autoexp_read_file           — read allowed file   │
│    autoexp_update_file         — full file replace   │
│    autoexp_patch_file          — targeted find/repl  │
│    autoexp_run_experiment      — execute + measure   │
│    autoexp_begin_experiment    — open pending record │
│    autoexp_complete_experiment — close with metric   │
│    autoexp_set_baseline        — mark as baseline    │
│    autoexp_rollback            — revert to last good │
│    autoexp_get_history         — review past runs    │
│    autoexp_run_setup           — one-time setup      │
│                                                     │
│  Resources:                                         │
│    autoexp://status        — session status (JSON)   │
│    autoexp://history       — experiment history      │
│    autoexp://file/{path}   — read allowed files      │
│                                                     │
│  Config:  autoexperiment.json  (domain adapter)     │
│  Ledger:  .autoexperiment_ledger.json (state)       │
└──────────────┬──────────────────────────────────────┘
               │ subprocess / external MCP server
┌──────────────▼──────────────────────────────────────┐
│  Your domain                                        │
│  (training script, benchmark, simulation, etc.)     │
└─────────────────────────────────────────────────────┘

Code Structure

The server is implemented as a Python package (autoexperiment_mcp/) with a thin server.py entry point:

autoexperiment-mcp-server/
├── server.py                         # Entry point: imports mcp, calls mcp.run()
└── autoexperiment_mcp/
    ├── models.py                     # Pydantic models (DomainConfig, ExperimentRecord, …)
    ├── utils.py                      # Pure utilities (git, hash, path, regex, time, coercion)
    ├── store.py                      # State I/O, snapshot management, TSV logging, query helpers
    ├── experiment.py                 # Core lifecycle: begin/complete experiment, keep decision
    ├── lifespan.py                   # Startup validation, app_lifespan context manager
    ├── app.py                        # mcp = FastMCP("autoexperiment_mcp", lifespan=…)
    ├── tools.py                      # All 11 @mcp.tool() registrations
    ├── resources.py                  # All 3 @mcp.resource() registrations
    └── __init__.py                   # Imports app + triggers tool/resource registration

Installation

Prerequisites

Python 3.12 or higher
uv package manager

Install `uv`

macOS / Linux

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows (PowerShell)

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Learn more: astral-sh/uv

Clone the repository

git clone https://github.com/IamCatoBot/catobot-autoexperiment-mcp.git
cd catobot-autoexperiment-mcp

Install dependencies

uv sync

Quick Start

1. Prepare your experiment folder

Your experiment folder needs a working baseline, an evaluation script, and a config file:

my-experiment/
├── autoexperiment.json     ← config (you write this)
├── solution.py             ← editable (agent modifies this)
├── benchmark.py            ← evaluation (read-only)
└── data.csv                ← test data (read-only)

2. Create autoexperiment.json

{
  "project_name": "My Experiment",
  "description": "What you're trying to optimise",
  "workspace_dir": "/absolute/path/to/my-experiment",
  "editable_files": ["solution.py"],
  "read_only_files": ["benchmark.py", "data.csv"],
  "run_command": "python benchmark.py 2>&1",
  "timeout_seconds": 60,
  "metric_name": "rmse",
  "metric_regex": "^rmse:\\s*([\\d.]+)",
  "metric_direction": "lower",
  "use_git": true
}

3. Initialise git in the experiment folder

Git tracking is enabled by default (use_git: true). The experiment folder must be a git repository with an initial commit before the server will start.

cd /path/to/my-experiment
git init
git add -A
git commit -m "initial baseline"

4. Verify the run command works

Run your experiment command manually and check the output contains the metric in the expected format:

cd /path/to/my-experiment
python benchmark.py
# Should print something like:  rmse: 12.345678

5. Register the MCP server

Recommended: pass AUTOEXPERIMENT_CONFIG pointing to your config file. MCP hosts may launch the server process from a different working directory, so an explicit path is the safest default.

Claude Code (-e for env vars):

claude mcp add autoexperiment \
  -e AUTOEXPERIMENT_CONFIG=/path/to/my-experiment/autoexperiment.json \
  -- uv run \
  --project PATH_TO_AUTOEXPERIMENT_MCP_SERVER \
  python PATH_TO_AUTOEXPERIMENT_MCP_SERVER/server.py

Codex (--env for env vars):

codex mcp add autoexperiment \
  --env AUTOEXPERIMENT_CONFIG=/path/to/my-experiment/autoexperiment.json \
  -- uv run \
  --project PATH_TO_AUTOEXPERIMENT_MCP_SERVER \
  python PATH_TO_AUTOEXPERIMENT_MCP_SERVER/server.py

Replace /path/to/my-experiment/autoexperiment.json with the absolute path to your config file.

Optional shortcut: if the server process is launched from your experiment folder and the config filename is autoexperiment.json, you can omit the environment variable.

You only need to register the MCP server once per MCP client profile; after that, reconnect normally in new sessions.

Note: Replace PATH_TO_AUTOEXPERIMENT_MCP_SERVER with the actual path to your cloned repository. If the uv command is not found, run which uv (Unix) or Get-Command uv (PowerShell) and use the full path in the "command" field.

6. Start experimenting

Launch Claude Code, Codex, or another MCP client from your experiment folder and prompt it:

Read the experiment status, review the editable and read-only files, run the baseline first, then iterate until improvements plateau and no meaningful gains remain.

Security Warning

setup_command and run_command execute shell commands on your host machine. This server does not provide sandboxing or container isolation by default.

Startup Validation

The server validates the configuration at startup and will refuse to start if:

workspace_dir does not exist or is not a directory
Any file in editable_files or read_only_files is missing
A file appears in both editable_files and read_only_files
metric_regex is not a valid regular expression
use_git is true but the workspace is not a git repository

Error messages are specific and tell you exactly what to fix.

Configuration

Everything domain-specific lives in autoexperiment.json:

Field	Required	Default	Description
`project_name`	yes		Human-readable name
`description`	no	`""`	What you're trying to achieve
`workspace_dir`	yes		Absolute path to the experiment folder
`editable_files`	yes		Files the agent is allowed to modify (at least one)
`read_only_files`	no	`[]`	Files the agent can read but not change
`execution_mode`	no	`"hybrid"`	`"shell"`, `"external"`, or `"hybrid"`
`run_command`	shell/hybrid		Shell command to run one experiment
`timeout_seconds`	no	`300`	Max time per experiment (10–7200s)
`setup_command`	no	`null`	One-time setup (deps, data download, etc.)
`metric_name`	yes		Name of the metric being optimised
`metric_regex`	shell/hybrid		Regex with one capture group to extract a float from stdout
`metric_direction`	yes		`"lower"` or `"higher"`
`require_baseline_first`	no	`true`	Require a baseline experiment before non-baseline runs
`use_git`	no	`true`	Track experiments as git commits. Requires the workspace to be a git repo with an initial commit.
`git_branch_prefix`	no	`"autoexp"`	Prefix for experiment branches
`keep_policy`	no	see below	Multi-gate keep/discard policy

Keep Policy

The keep_policy object controls when a completed experiment is kept vs discarded. All gates must pass for a run to be kept.

Field	Default	Description
`required_true_keys`	`[]`	Metadata keys that must be boolean `true`
`numeric_min`	`{}`	Metadata keys with a floor value (e.g. `{"utilization": 45}`)
`numeric_max`	`{}`	Metadata keys with a ceiling value (e.g. `{"latency_ms": 250}`)
`require_numeric_keys_present`	`true`	If `true`, missing keys in `numeric_min`/`numeric_max` cause discard
`allow_equal_metric_if_simpler`	`true`	Keep a tied run if its `complexity_score` is lower
`equal_metric_tolerance`	`1e-9`	Tolerance for treating two metric values as equal
`complexity_key`	`"complexity_score"`	Metadata key used for complexity tie-breaking

The agent sees the full keep_policy in autoexp_get_status and receives a required_metadata_keys reminder in every autoexp_begin_experiment response — so it always knows exactly what to include in the metadata argument when calling autoexp_complete_experiment.

Tool Reference

Tool	Purpose	Destructive?
`autoexp_get_status`	Session overview, best score, editable files, keep_policy gates	No
`autoexp_read_file`	Read any allowed file	No
`autoexp_update_file`	Replace entire file contents	Yes
`autoexp_patch_file`	Targeted find-and-replace	No
`autoexp_run_experiment`	Execute the run command, extract metric (shell mode)	No (but slow)
`autoexp_begin_experiment`	Open a pending experiment record (external/hybrid mode)	No
`autoexp_complete_experiment`	Close a pending experiment with metric + metadata (external/hybrid mode)	No
`autoexp_set_baseline`	Mark an existing completed experiment as the baseline	No
`autoexp_rollback`	Revert files to a specific experiment's state via git	Yes
`autoexp_get_history`	Review past experiments and results	No
`autoexp_run_setup`	Run one-time setup command	No

How the Loop Works

Shell mode (`execution_mode: "shell"`)

Agent calls autoexp_get_status → learns the domain, metric, and current best.
Agent calls autoexp_read_file → reads the editable file(s) to understand the code.
Agent calls autoexp_patch_file or autoexp_update_file → makes a change.
Agent calls autoexp_run_experiment with a hypothesis → server runs it, extracts metric.
If improved: server auto-commits via git and records the commit hash. Agent plans next experiment.
If regressed or crashed: agent calls autoexp_rollback, then tries something else.
Agent calls autoexp_get_history periodically to review trends and avoid repetition.
Repeat indefinitely.

External / hybrid mode (`execution_mode: "external"` or `"hybrid"`)

Use this when another MCP server (e.g. a physics sim, a cloud evaluator) runs the experiment.

Agent calls autoexp_get_status → note the keep_policy field — it lists every metadata key the policy will gate on.
Agent edits the editable file(s) via autoexp_update_file / autoexp_patch_file.
Agent calls autoexp_begin_experiment → receives experiment_id and a required_metadata_keys reminder.
Agent triggers the external system and waits for results.
Agent assembles a metadata dict containing all keys from required_metadata_keys (both from simulation output and any input-parameter constraints defined in numeric_min/numeric_max).
Agent calls autoexp_complete_experiment with experiment_id, metric_value, and the assembled metadata dict.
Server evaluates the keep policy and responds with kept, keep_reason, is_best.
If not kept: agent calls autoexp_rollback and adjusts its approach.

Important: numeric_min/numeric_max gates often reference input parameters (e.g. service-time bounds from a config file) rather than simulation outputs. You must read those values yourself and include them in metadata alongside the simulator's results.

Design Principles

Domain-agnostic. The server knows nothing about ML, sorting, prompts, or any specific domain. All domain knowledge lives in the config file and the agent's reasoning.
Single metric. One number determines success. If your problem needs multiple metrics, your run command should combine them into a single score.
Fixed time budget. Each experiment gets the same wall-clock timeout, making results comparable.
Git as memory. Every improvement is committed with its commit hash recorded. Every regression can be rolled back to a specific experiment. The full history is always recoverable.
Agent autonomy. The server provides tools, not opinions. The agent decides what to try, when to rollback, and when to change strategy.

Maintainer

The CatoBot autoexperiment MCP Server is an open source project developed and maintained by Nikolaos Maniatis, The Cato Bot Company Limited.

Disclaimer

Work in progress: the software is actively evolving; features may change and some functionality may be incomplete.
LLM-powered workflow: model/code quality depends on the capabilities of the LLM driving the loop.
Validate outputs: always critically review and validate generated models, code changes, and metrics before relying on results.

Citation

For academic use, cite:

Maniatis, N. (2026). CatoBot autoexperiment MCP Server (v1.0.0). https://github.com/IamCatoBot/catobot-autoexperiment-mcp. Copyright The Cato Bot Company Limited. Licensed under Apache 2.0.

from github.com/IamCatoBot/catobot-autoexperiment-mcp

Установить CatoBot Autoexperiment Server в Claude Desktop, Claude Code, Cursor

Рекомендуется · одна команда, все IDE

unyly install catobot-autoexperiment-mcp-server

Ставит в Claude Desktop, Claude Code, Cursor и VS Code — сам разбирается с npx, uvx и сборкой из исходников.

Впервые? Поставь CLI: curl -fsSL https://unyly.org/install | sh

Или настроить вручную

Выполни в терминале:

claude mcp add catobot-autoexperiment-mcp-server -- uvx --from git+https://github.com/IamCatoBot/catobot-autoexperiment-mcp catobot-autoexperiment-mcp-server

FAQ