loading…
Search for a command to run...
loading…
Enables AI assistants to capture screenshots and control desktop input (mouse, keyboard) to see and interact with your screen. Features user-first safety contro
Enables AI assistants to capture screenshots and control desktop input (mouse, keyboard) to see and interact with your screen. Features user-first safety controls including automatic pause on user activity and app allowlists to restrict interactions to approved applications only.
AI-native test agent that sees your app like a real user — 15x faster than Claude Code, without touching your screen.
An MCP server for autonomous visual testing. The AI plans test steps in natural language, the server executes them all without LLM round-trips. Works in the background via CDP (Chrome) or Accessibility API (native apps).
# The AI plans. The server executes. No LLM round-trips. Background. 3 seconds.
run_test(name="Login Flow", steps=[
{"find": "Email", "action": "click_and_type", "text": "[email protected]"},
{"find": "Password", "action": "click_and_type", "text": "secret123"},
{"find": "Log in", "action": "click"},
{"verify": "Dashboard"},
])
# → ✅ 4/4 passed in 800ms. Screenshot evidence attached.
Every testing tool makes you choose: fast but fragile (Playwright) or smart but slow (Claude Code computer use). Screen Agent is both:
run_test() executes ALL steps server-side. No LLM round-trips. 150ms/step vs Claude Code's 1-3s/step. 15x faster.act + eval_js — act returns a screenshot for the LLM to analyze visually, then executes at LLM-provided coordinates. eval_js runs JavaScript via CDP for assertions. 5 tests in 0.6s.window_scope + CDP lets you test Chrome apps on any macOS Space without touching the user's screen. For native apps, tests behind other windows on the same Space.┌──────────────────────────────────┐
│ MCP Layer │ 22 tools via Model Context Protocol
├──────────────────────────────────┤
│ Engine Layer │ InputChain (fallback) + Guardian (safety)
│ │ + WindowSession (background testing)
├──────────────────────────────────┤
│ Platform Layer │ Protocol-based backends
│ AX → CGEvent → pyautogui │ macOS / Windows / Linux
└──────────────────────────────────┘
The core design challenge: pyautogui works for ~80% of apps but fails for game engines and many Electron apps. Screen Agent solves this with a Chain of Responsibility pattern:
| Priority | Backend | Method | Best For |
|---|---|---|---|
| 1 | AX | AXPerformAction |
Native macOS apps — semantic, no coordinates needed |
| 2 | CGEvent | CGEventPost |
Games, Electron — native OS event injection |
| 3 | pyautogui | Python wrapper | Cross-platform fallback |
Each backend implements the same InputBackend protocol. If one fails, the chain automatically tries the next. All attempts are logged with telemetry for observability.
pip install screen-agent
# Recommended: install macOS native backends
pip install screen-agent[macos]
claude mcp add screen -- screen-agent serve
Add to your MCP config:
{
"mcpServers": {
"screen": {
"command": "screen-agent",
"args": ["serve"]
}
}
}
screen-agent check
| Tool | Description |
|---|---|
capture_screen |
Screenshot (full or region), returns image for vision analysis |
list_windows |
List all visible windows with positions |
get_active_window |
Currently focused window |
get_cursor_position |
Current mouse position |
verify: true for post-action screenshots)| Tool | Description |
|---|---|
click |
Click at coordinates (left/right/middle, multi-click) |
type_text |
Type text at cursor (Unicode via clipboard on macOS) |
press_key |
Key press with modifiers (e.g., Cmd+C) |
scroll |
Scroll wheel at optional position |
move_mouse |
Move cursor without clicking |
drag |
Click-drag between two points |
focus_window |
Bring window to front by partial title match |
| Tool | Description |
|---|---|
ocr |
Extract all text with bounding boxes |
find_text |
Find text and return location |
click_text |
Find text and click its center |
| Tool | Description |
|---|---|
run_test |
Execute a full test plan autonomously — no LLM round-trips. 15x faster. |
act |
Vision-first: returns screenshot → LLM looks → executes at coordinates |
eval_js |
Execute JavaScript via CDP. DOM assertions, element clicks, state checks |
interact |
OCR-based: find element by text + click/type in one call |
| Tool | Description |
|---|---|
window_scope |
Lock to a window. Chrome: auto-CDP (any Space). Native: CGWindowList (same Space). |
window_release |
Release window scope, return to full-screen mode |
| Tool | Description |
|---|---|
test_start |
Start a test session with automatic screenshot collection |
test_step |
Begin a test step (auto-captures "before" screenshot) |
test_verify |
Verify step via OCR text check or screenshot diff |
test_end |
End session, generate markdown report with evidence |
test_status |
Current session status |
| Tool | Description |
|---|---|
add_app |
Add app to allowlist — agent can ONLY interact with listed apps |
remove_app |
Remove from allowlist |
set_region |
Restrict to pixel region |
clear_scope |
Remove all restrictions |
get_agent_status |
Guardian state, backend stats, scope info |
Screen Agent can test applications without occupying your screen. Three modes, auto-selected:
# Start Chrome with debugging port
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--remote-debugging-port=9222 --user-data-dir=/tmp/chrome-test
# Connect — works even if Chrome is on a different desktop
window_scope(app="Chrome", url="localhost:3000")
# All operations go through Chrome's internal pipeline
interact(target="Submit", action="click")
interact(target="Email", action="click_and_type", text="[email protected]")
window_release()
CDP bypasses the macOS window server entirely. Screenshots come from Chrome's renderer, clicks go through Chrome's input system. Your screen is never touched.
# Works with Figma, Xcode, Terminal, games — any app
window_scope(app="Figma", title="Design v2")
interact(target="Export", action="click")
window_release()
Uses CGWindowListCreateImage to capture the window even when behind other apps. Requires same macOS Space.
Without window_scope, operates on the full screen as before.
window_scope called → try CDP (Chrome) → try CGWindowList (same Space) → error
no scope → full screen mode
Screen Agent's unique safety system with two guarantees:
# Agent can only interact with Chrome and Figma
add_app("Chrome")
add_app("Figma")
# Or restrict to a region
set_region(x=0, y=0, width=800, height=600)
All parameters are configurable via environment variables:
| Variable | Default | Description |
|---|---|---|
SCREEN_AGENT_COOLDOWN |
1.5 | Guardian cooldown seconds |
SCREEN_AGENT_GUARDIAN_DISABLED |
0 | Set to "1" to disable |
SCREEN_AGENT_INPUT_BACKENDS |
ax,cgevent,pyautogui | Backend priority order |
SCREEN_AGENT_MAX_DIMENSION |
2560 | Max screenshot dimension |
SCREEN_AGENT_LOG_LEVEL |
INFO | Logging level |
| Feature | macOS | Windows | Linux |
|---|---|---|---|
| Screenshot | mss | mss | mss |
| AX Input | Quartz AX | - | - |
| CGEvent Input | Quartz | - | - |
| pyautogui Input | fallback | fallback | fallback |
| Window Management | AppleScript | - | wmctrl |
| OCR | Vision Framework | - | - |
| Retina Scaling | auto-detect | - | - |
| Window Capture | CGWindowListCreateImage | PrintWindow | xdotool+ImageMagick |
git clone https://github.com/chriswu727/screen-agent
cd screen-agent
pip install -e ".[dev,macos]"
pytest tests/unit/ -v
ruff check src/ tests/
See DEVPATH.md for development history and architectural decisions.
MIT
Добавь это в claude_desktop_config.json и перезапусти Claude Desktop.
{
"mcpServers": {
"screen-agent": {
"command": "npx",
"args": []
}
}
}Web content fetching and conversion for efficient LLM usage.
Retrieval from AWS Knowledge Base using Bedrock Agent Runtime.
Provides auto-configuration for setting up an MCP server in Spring Boot applications.
A very streamlined mcp client that supports calling and monitoring stdio/sse/streamableHttp, and can also view request responses through the /logs page. It also