loading…
Search for a command to run...
loading…
A domain-agnostic MCP server for autonomous experimentation, generalizing Karpathy's autoresearch pattern into a reusable server that any AI agent can drive, po
A domain-agnostic MCP server for autonomous experimentation, generalizing Karpathy's autoresearch pattern into a reusable server that any AI agent can drive, pointed at any domain defined by a JSON configuration.
A generalisation of Karpathy's autoresearch pattern into a reusable Model Context Protocol (MCP) server that any AI agent can drive, pointed at any domain.
modify something → run it → measure a result → keep or discard → repeat
The server exposes this loop as a standard set of MCP tools. The domain (what gets modified, how it runs, and what gets measured) is defined entirely in a JSON config file. The agent-side logic stays the same regardless of domain.
┌─────────────────────────────────────────────────────┐
│ AI Agent (Claude Code, Codex, etc.) │
│ │
│ Reads status → plans change → edits file → │
│ runs experiment → checks result → keeps/discards │
└──────────────┬──────────────────────────────────────┘
│ MCP (stdio)
┌──────────────▼──────────────────────────────────────┐
│ autoexperiment MCP server │
│ │
│ Tools: │
│ autoexp_get_status — session overview │
│ autoexp_read_file — read allowed file │
│ autoexp_update_file — full file replace │
│ autoexp_patch_file — targeted find/repl │
│ autoexp_run_experiment — execute + measure │
│ autoexp_begin_experiment — open pending record │
│ autoexp_complete_experiment — close with metric │
│ autoexp_set_baseline — mark as baseline │
│ autoexp_rollback — revert to last good │
│ autoexp_get_history — review past runs │
│ autoexp_run_setup — one-time setup │
│ │
│ Resources: │
│ autoexp://status — session status (JSON) │
│ autoexp://history — experiment history │
│ autoexp://file/{path} — read allowed files │
│ │
│ Config: autoexperiment.json (domain adapter) │
│ Ledger: .autoexperiment_ledger.json (state) │
└──────────────┬──────────────────────────────────────┘
│ subprocess / external MCP server
┌──────────────▼──────────────────────────────────────┐
│ Your domain │
│ (training script, benchmark, simulation, etc.) │
└─────────────────────────────────────────────────────┘
The server is implemented as a Python package (autoexperiment_mcp/) with a thin server.py entry point:
autoexperiment-mcp-server/
├── server.py # Entry point: imports mcp, calls mcp.run()
└── autoexperiment_mcp/
├── models.py # Pydantic models (DomainConfig, ExperimentRecord, …)
├── utils.py # Pure utilities (git, hash, path, regex, time, coercion)
├── store.py # State I/O, snapshot management, TSV logging, query helpers
├── experiment.py # Core lifecycle: begin/complete experiment, keep decision
├── lifespan.py # Startup validation, app_lifespan context manager
├── app.py # mcp = FastMCP("autoexperiment_mcp", lifespan=…)
├── tools.py # All 11 @mcp.tool() registrations
├── resources.py # All 3 @mcp.resource() registrations
└── __init__.py # Imports app + triggers tool/resource registration
uvcurl -LsSf https://astral.sh/uv/install.sh | sh
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
Learn more: astral-sh/uv
git clone https://github.com/IamCatoBot/catobot-autoexperiment-mcp.git
cd catobot-autoexperiment-mcp
uv sync
Your experiment folder needs a working baseline, an evaluation script, and a config file:
my-experiment/
├── autoexperiment.json ← config (you write this)
├── solution.py ← editable (agent modifies this)
├── benchmark.py ← evaluation (read-only)
└── data.csv ← test data (read-only)
{
"project_name": "My Experiment",
"description": "What you're trying to optimise",
"workspace_dir": "/absolute/path/to/my-experiment",
"editable_files": ["solution.py"],
"read_only_files": ["benchmark.py", "data.csv"],
"run_command": "python benchmark.py 2>&1",
"timeout_seconds": 60,
"metric_name": "rmse",
"metric_regex": "^rmse:\\s*([\\d.]+)",
"metric_direction": "lower",
"use_git": true
}
Git tracking is enabled by default (use_git: true). The experiment folder must be a git repository with an initial commit before the server will start.
cd /path/to/my-experiment
git init
git add -A
git commit -m "initial baseline"
Run your experiment command manually and check the output contains the metric in the expected format:
cd /path/to/my-experiment
python benchmark.py
# Should print something like: rmse: 12.345678
Recommended: pass AUTOEXPERIMENT_CONFIG pointing to your config file. MCP hosts may launch the server process from a different working directory, so an explicit path is the safest default.
Claude Code (-e for env vars):
claude mcp add autoexperiment \
-e AUTOEXPERIMENT_CONFIG=/path/to/my-experiment/autoexperiment.json \
-- uv run \
--project PATH_TO_AUTOEXPERIMENT_MCP_SERVER \
python PATH_TO_AUTOEXPERIMENT_MCP_SERVER/server.py
Codex (--env for env vars):
codex mcp add autoexperiment \
--env AUTOEXPERIMENT_CONFIG=/path/to/my-experiment/autoexperiment.json \
-- uv run \
--project PATH_TO_AUTOEXPERIMENT_MCP_SERVER \
python PATH_TO_AUTOEXPERIMENT_MCP_SERVER/server.py
Replace /path/to/my-experiment/autoexperiment.json with the absolute path to your config file.
Optional shortcut: if the server process is launched from your experiment folder and the config filename is autoexperiment.json, you can omit the environment variable.
You only need to register the MCP server once per MCP client profile; after that, reconnect normally in new sessions.
Note: Replace
PATH_TO_AUTOEXPERIMENT_MCP_SERVERwith the actual path to your cloned repository. If theuvcommand is not found, runwhich uv(Unix) orGet-Command uv(PowerShell) and use the full path in the"command"field.
Launch Claude Code, Codex, or another MCP client from your experiment folder and prompt it:
Read the experiment status, review the editable and read-only files, run the baseline first, then iterate until improvements plateau and no meaningful gains remain.
setup_command and run_command execute shell commands on your host machine. This server does not provide sandboxing or container isolation by default.
The server validates the configuration at startup and will refuse to start if:
workspace_dir does not exist or is not a directoryeditable_files or read_only_files is missingeditable_files and read_only_filesmetric_regex is not a valid regular expressionuse_git is true but the workspace is not a git repositoryError messages are specific and tell you exactly what to fix.
Everything domain-specific lives in autoexperiment.json:
| Field | Required | Default | Description |
|---|---|---|---|
project_name |
yes | Human-readable name | |
description |
no | "" |
What you're trying to achieve |
workspace_dir |
yes | Absolute path to the experiment folder | |
editable_files |
yes | Files the agent is allowed to modify (at least one) | |
read_only_files |
no | [] |
Files the agent can read but not change |
execution_mode |
no | "hybrid" |
"shell", "external", or "hybrid" |
run_command |
shell/hybrid | Shell command to run one experiment | |
timeout_seconds |
no | 300 |
Max time per experiment (10–7200s) |
setup_command |
no | null |
One-time setup (deps, data download, etc.) |
metric_name |
yes | Name of the metric being optimised | |
metric_regex |
shell/hybrid | Regex with one capture group to extract a float from stdout | |
metric_direction |
yes | "lower" or "higher" |
|
require_baseline_first |
no | true |
Require a baseline experiment before non-baseline runs |
use_git |
no | true |
Track experiments as git commits. Requires the workspace to be a git repo with an initial commit. |
git_branch_prefix |
no | "autoexp" |
Prefix for experiment branches |
keep_policy |
no | see below | Multi-gate keep/discard policy |
The keep_policy object controls when a completed experiment is kept vs discarded. All gates must pass for a run to be kept.
| Field | Default | Description |
|---|---|---|
required_true_keys |
[] |
Metadata keys that must be boolean true |
numeric_min |
{} |
Metadata keys with a floor value (e.g. {"utilization": 45}) |
numeric_max |
{} |
Metadata keys with a ceiling value (e.g. {"latency_ms": 250}) |
require_numeric_keys_present |
true |
If true, missing keys in numeric_min/numeric_max cause discard |
allow_equal_metric_if_simpler |
true |
Keep a tied run if its complexity_score is lower |
equal_metric_tolerance |
1e-9 |
Tolerance for treating two metric values as equal |
complexity_key |
"complexity_score" |
Metadata key used for complexity tie-breaking |
The agent sees the full keep_policy in autoexp_get_status and receives a required_metadata_keys reminder in every autoexp_begin_experiment response — so it always knows exactly what to include in the metadata argument when calling autoexp_complete_experiment.
| Tool | Purpose | Destructive? |
|---|---|---|
autoexp_get_status |
Session overview, best score, editable files, keep_policy gates | No |
autoexp_read_file |
Read any allowed file | No |
autoexp_update_file |
Replace entire file contents | Yes |
autoexp_patch_file |
Targeted find-and-replace | No |
autoexp_run_experiment |
Execute the run command, extract metric (shell mode) | No (but slow) |
autoexp_begin_experiment |
Open a pending experiment record (external/hybrid mode) | No |
autoexp_complete_experiment |
Close a pending experiment with metric + metadata (external/hybrid mode) | No |
autoexp_set_baseline |
Mark an existing completed experiment as the baseline | No |
autoexp_rollback |
Revert files to a specific experiment's state via git | Yes |
autoexp_get_history |
Review past experiments and results | No |
autoexp_run_setup |
Run one-time setup command | No |
execution_mode: "shell")autoexp_get_status → learns the domain, metric, and current best.autoexp_read_file → reads the editable file(s) to understand the code.autoexp_patch_file or autoexp_update_file → makes a change.autoexp_run_experiment with a hypothesis → server runs it, extracts metric.autoexp_rollback, then tries something else.autoexp_get_history periodically to review trends and avoid repetition.execution_mode: "external" or "hybrid")Use this when another MCP server (e.g. a physics sim, a cloud evaluator) runs the experiment.
autoexp_get_status → note the keep_policy field — it lists every metadata key the policy will gate on.autoexp_update_file / autoexp_patch_file.autoexp_begin_experiment → receives experiment_id and a required_metadata_keys reminder.metadata dict containing all keys from required_metadata_keys (both from simulation output and any input-parameter constraints defined in numeric_min/numeric_max).autoexp_complete_experiment with experiment_id, metric_value, and the assembled metadata dict.kept, keep_reason, is_best.autoexp_rollback and adjusts its approach.Important:
numeric_min/numeric_maxgates often reference input parameters (e.g. service-time bounds from a config file) rather than simulation outputs. You must read those values yourself and include them inmetadataalongside the simulator's results.
The CatoBot autoexperiment MCP Server is an open source project developed and maintained by Nikolaos Maniatis, The Cato Bot Company Limited.
For academic use, cite:
Maniatis, N. (2026). CatoBot autoexperiment MCP Server (v1.0.0). https://github.com/IamCatoBot/catobot-autoexperiment-mcp. Copyright The Cato Bot Company Limited. Licensed under Apache 2.0.
Выполни в терминале:
claude mcp add catobot-autoexperiment-mcp-server -- npx Read and write pages in your workspace
автор: NotionIssues, cycles, triage — from Claude
автор: LinearSearch and read your Drive files
автор: GoogleConnect and unify data across various platforms and databases with [MindsDB as a single MCP server](https://docs.mindsdb.com/mcp/overview).
автор: mindsdbНе уверен что выбрать?
Найди свой стек за 60 секунд
Автор?
Embed-бейдж для README
Похожее
Все в категории productivity