Skip to content

TMHSDigital/screencast-mcp

Screencast MCP

An MCP server that lets an agent record the screen, watch the footage, and make minimal ffmpeg edits — over stdio, on Windows.

Capture the desktop, a monitor, a window, or a region; sample a recording into frames the agent can actually look at; trim, crop, scale, overlay, compress, redact, and convert. No cloud, no streaming — ffmpeg and ffprobe wrapped behind nineteen typed tools.


Documentation


CI License: CC BY-NC-ND 4.0 Last commit Repo size Top language

TypeScript Node.js Model Context Protocol ffmpeg Capture

Note

Screen capture uses gdigrab and is Windows-only; the watch, edit, and produce tools work anywhere ffmpeg runs. The full pipeline is in place: capture, watch, the edit surface, redact_region safety redaction, system-audio capture, and the produce layer (transitions, title cards, music bed, aspect variants, platform export). Cross-platform capture backends are next — see ROADMAP.md.

Overview

Screencast MCP gives an agent a screen recorder it can drive and reason about. It speaks Model Context Protocol over stdio and exposes ffmpeg as a small, typed tool surface rather than a flag soup. The defining capability is the watch loop: an agent records footage, samples it into PNG frames, and views those frames to confirm what actually happened on screen.

The design choices are deliberate rather than incidental:

  • Capture is always explicit. A recording or screenshot happens only on a tool call — nothing auto-fires and there is no background or scheduled capture.
  • Footage is made viewable. sample_frames turns a video into images an agent can open, so "watch what happened" is a first-class operation, not an afterthought.
  • Presets over raw flags. Quality is draft / standard / high; the agent never reasons about codecs, CRF, or pixel formats.
  • Safe by default. Output lands under SCREENCAST_HOME, never inside a project checkout, and the public repo's .gitignore blocks captured media from being committed.
  • Crash-safe sessions. A recording interrupted by a crash is reconciled on the next start (orphan reaping), so no ffmpeg child silently outlives the server.

Tools

Twenty-five tools across four concerns. The manifest in mcp-tools.json is the canonical surface and is kept in sync with src/tools/.

Capture

Tool Purpose
start_recording Start a background recording. target = full | monitor:<index> | window:<title> | region:<x>,<y>,<w>,<h>; optional fps, quality, and audio (set audio.source = system to also capture loopback audio). Returns a session id and output path.
stop_recording Stop a session by id. Sends ffmpeg a graceful quit so the file is finalized, not truncated. Returns the final path and duration.
list_sessions List active and finished sessions.
get_session Inspect a single session by id.
screenshot Capture a single PNG of a target.
list_audio_devices List the DirectShow audio devices ffmpeg can see and flag a likely system-audio loopback device for start_recording.

Watch

Tool Purpose
sample_frames Extract frames from a video — at a fixed fps or at explicit timestamps — so the agent can view what happened. Returns the frame paths.
get_media_info ffprobe wrapper: duration, resolution, fps, codecs, container format, and size.

Minimal edit

Tool Purpose
trim Cut a sub-clip by start + (end or duration). Stream-copy for speed.
concat Join two or more videos into one.
convert Convert between mp4, gif, and webm.

Edit surface

These tools re-encode (a filter rewrites pixels, so stream copy does not apply). They reuse the same draft/standard/high presets as capture.

Tool Purpose
crop Crop to a pixel rectangle (x, y, width, height). A rectangle that runs off the frame is rejected, not silently clamped.
scale Resize to a width and/or height. One side keeps the aspect ratio; both set an exact size.
speed Change playback speed by a factor (>1 faster, <1 slower). Audio is retempo'd when present.
overlay Composite a logo, watermark, or picture-in-picture onto a base video at a position, optionally scaled and time-limited.
compress Re-encode smaller with a light/medium/heavy CRF ladder and an optional maxWidth that only downscales.
extract_audio Write the audio track to its own file (mp3, aac, wav, or copy).
clip Extract one or more frame-accurate sub-segments to separate files. Unlike trim, it re-encodes so cuts land exactly on the given times.

Redact

Tool Purpose
redact_region Cover declared rectangles in a video. style is box (default, a solid irreversible fill), blur, or pixelate; each region may be limited to a start/end window and expanded with pad.

Important

redact_region covers the regions you declare. It is not automatic secret detection, so it cannot find a secret you did not point it at. The default box style is a solid fill, which is irreversible; blur and pixelate are softer but can be partially recovered, so prefer box for real secrets. A region that falls outside the frame is rejected rather than silently doing nothing.

Produce

The production layer turns raw clips into a finished piece. Tools that combine clips auto-normalize each input to a common resolution, fps, and audio rate first, so heterogeneous sources compose cleanly.

Tool Purpose
xfade_transition Crossfade two videos into one with an xfade transition (fade, wipeleft, slideup, ...). Audio is crossfaded when both clips have a track.
assemble_highlights Stitch two or more clips into one, with hard cuts (transition: "cut") or an xfade transition between each.
title_card Generate a standalone title card (centered text on a solid background, with a silent track so it composes). Text uses a bundled font, so no system font is needed.
music_bed Lay a music track under a video: looped/trimmed to length, faded in and out, leveled, and mixed with any existing audio (optionally ducked).
reframe Re-aspect to 16:9, 9:16, 1:1, or 4:5 with pad (letterbox, no content lost) or crop (fill, trims overflow).
export_preset Encode a platform-ready file (youtube, instagram_reel, tiktok, x, square): reframes, caps fps, and encodes H.264 at the platform bitrate with faststart.

Targets

Every capture tool takes a single target string, so an agent never has to juggle quoting:

Target Captures
full The whole virtual desktop
monitor:<index> One display; 0 is always primary
window:<title> The on-screen rectangle a window occupies (case-insensitive exact title, else substring; topmost wins)
region:<x>,<y>,<w>,<h> An absolute pixel rectangle

Output is written under SCREENCAST_HOME (default <homedir>/.screencast-mcp) into recordings/, frames/, screenshots/, and edits/. Any tool also accepts an explicit output path.

Prerequisites

ffmpeg and ffprobe are external dependencies and must be on PATH (or pointed at via the FFMPEG_PATH / FFPROBE_PATH environment variables). The server detects them per call and returns a clear error with an install hint if either is missing.

Platform Install
Windows winget install Gyan.FFmpeg or choco install ffmpeg
macOS brew install ffmpeg
Linux apt install ffmpeg

Installation

npm install -g @tmhs/screencast-mcp

Or run it from a clone:

git clone https://github.com/TMHSDigital/screencast-mcp.git
cd screencast-mcp
npm install
npm run build      # produces dist/index.js, the server entry point

MCP client configuration

{
  "mcpServers": {
    "screencast": {
      "command": "npx",
      "args": ["-y", "@tmhs/screencast-mcp"]
    }
  }
}

Tip

Running from a clone instead of the published package? Point the client straight at the build: "command": "node", "args": ["C:/path/to/screencast-mcp/dist/index.js"].

Usage

A typical watch loop — record, sample, look:

// 1. record a region for a few seconds
start_recording { "target": "region:0,0,1280,720", "quality": "draft" }
//    -> { "sessionId": "rec-…", "outputPath": "…/recordings/rec-….mp4" }

// 2. finalize the file
stop_recording { "sessionId": "rec-…" }
//    -> { "durationSec": 4.2, "finalizedGracefully": true }

// 3. turn it into frames the agent can view
sample_frames { "input": "…/recordings/rec-….mp4", "timestamps": [0.5, 2, 3.5] }
//    -> { "frames": ["…/frames/…/frame_000_0.5s.png", …] }

screenshot { "target": "window:My App" } is the one-shot equivalent for a still.

Windows notes

  • Multi-monitor offsets. gdigrab has no "capture monitor N" selector, so a monitor target captures the whole virtual desktop and crops to that display's pixel bounds (from System.Windows.Forms.Screen.AllScreens). monitor:1 grabs the second display at its real offset; monitor:0 is always primary.
  • Window capture resolves the window to the on-screen rectangle it occupies and captures that, rather than the window's own surface — gdigrab's native title= grab returns a blank frame for GPU- or DirectComposition-composited windows (Chrome, Electron, UWP). The window must be visible, on top, and not minimized (a minimized window is rejected with a clear error), the capture includes anything drawn over that rectangle, and for start_recording the rectangle is fixed once at start. True per-window background capture (Windows Graphics Capture API) is a future phase.
  • Fullscreen-exclusive apps often produce black frames under gdigrab; run the source in borderless-windowed mode for reliable capture.
  • System audio needs a loopback device. gdigrab is video-only, so start_recording with audio.source = system captures from a separate dshow input. Windows has no native loopback, so this needs a virtual-audio device (enable Stereo Mix, or install a driver such as screen-capture-recorder's virtual-audio-capturer or VB-CABLE). Run list_audio_devices to find it. Microphone capture is intentionally not supported.

Threat model

Warning

Screen capture can record anything on screen at the moment of capture — passwords, tokens, private messages, and other secrets. window: captures the screen rectangle a window occupies, so overlays, notifications, or another window drawn over it are captured too. Treat recordings, screenshots, and sampled frames as sensitive by default.

  • Capture is always explicit. Nothing auto-fires; capture is gated behind an explicit tool call.
  • Output stays local. Files are written to the local filesystem only — never uploaded, streamed, or transmitted anywhere.
  • Public repo, private captures. The .gitignore blocks recordings, frames, screenshots, and common video/image output so test media cannot be committed by accident.
  • Review before sharing. Sample frames or inspect a recording before handing a file to another tool or person, so you know what it contains.
  • Redaction is declared, not detected. redact_region covers only the rectangles you specify, so it depends on you (or the agent) having found the secret first. Use the default box style for a solid irreversible fill, and still review the output before sharing.
  • System audio is sensitive too. When audio.source = system, the recording captures everything playing on the machine (call audio, notifications, media). Treat audio-bearing recordings with the same care as the video.

Project structure

.
├── src/
│   ├── index.ts          # MCP server entry (stdio); registers every tool, reaps orphans
│   ├── context.ts        # shared session-store singleton
│   ├── tools/            # one file per tool (capture, watch, edit)
│   ├── utils/            # ffmpeg, monitors, windows, sessions, paths, targets
│   └── __tests__/        # vitest unit tests + guarded local-capture harness
├── docs/                 # GitHub Pages documentation site
├── mcp-tools.json        # canonical tool manifest (kept in sync with src/tools)
├── .github/workflows/    # CI, release, npm publish, Pages, ecosystem drift check
├── ROADMAP.md · CONTRIBUTING.md · SECURITY.md · LICENSE
└── package.json

Development

npm install
npm run build      # tsc -> dist/
npm test           # vitest (pure unit tests; no ffmpeg or display required)
npm run dev        # tsx watch

The capture path can't be exercised on CI's headless Linux runners, so an end-to-end harness lives behind a flag and is skipped by default:

RUN_LOCAL_CAPTURE_TESTS=1 npm test   # Windows + ffmpeg + a real display

Contributing

Issues and pull requests are welcome — see CONTRIBUTING.md and the Code of Conduct. Security reports go through SECURITY.md.

License

Released under the CC-BY-NC-ND-4.0 license.

About

MCP server for Windows screen recording, frame sampling, and minimal ffmpeg edits

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors