How to do visual regression testing when the output is a folder, not a dashboard.
Every other how-to for this topic walks you through installing a runner, wiring a config, generating a baseline image, and clicking around a vendor UI to approve diffs. This one skips that whole column entirely. Assrt writes one new UUID folder at /tmp/assrt/<runId> per run. Inside: per-step PNGs with a zero-padded numeric index, a WebM, a 48-line self-contained HTML player, an execution.log, an events.json, and a matching results/<runId>.json. The review step is open /tmp/assrt/<runId>. That is it.
What every other how-to guide assumes, and why this one does not
Read any guide for this topic written in the last five years and you get the same three chapters. Chapter one: install the runner and the assertion library. Chapter two: initialize a baseline store, usually a __snapshots__ folder next to the test file or a cloud project you link to your repo. Chapter three: run once with --update-snapshots, then learn the review dashboard so you can approve diffs on every subsequent run.
Assrt deletes all three chapters. There is no assertion library because the model is the assertion layer. There is no baseline store because every frame is judged against the English plan, not against a golden PNG. There is no dashboard because the run output is a folder on your disk and the review surface is Finder. The uncopyable part of this story is that folder. Every path below is a real file that exists five seconds after a run completes.
What gets written where
The folder layout, verbatim
This is what a freshly-finished run looks like. The run ID is a crypto.randomUUID() value generated upfront at assrt-mcp/src/mcp/server.ts line 429, before any browser work starts, so the paths inside are deterministic.
screenshots/
Per-step PNG, zero-padded index, deduped per step. The filename already tells you which action ran.
video/recording.webm
Native Playwright recording at 1600x900. Finalized with browser_stop_video, moved into the runId folder at server.ts:587-590.
video/player.html
48 lines of self-contained HTML5. Space to toggle play, arrow keys to seek, 1/2/3/5/0 for 1x/2x/3x/5x/10x speed.
execution.log
Flat text. One line per step, per reasoning note, per assertion. Human-greppable after the run finishes.
events.json
Full SSE timeline. Every event the agent emitted during the run, with timestamps. Replay with jq.
results/<runId>.json
The TestReport shape at types.ts:28-35. scenarios[].assertions[] gives you description + passed + evidence. jq-ready.
The screenshot naming is load-bearing
Filenames look like 02_step3_type_text.png. The leading two-digit index is zero-padded so ls returns them in execution order. The step number maps back to the assertion list in results JSON. The action tag (navigate, click, type_text, select_option, scroll, press_key) tells you what the agent was doing when the frame was captured. And the server deduplicates within a single step so if the model retries an interaction, you keep the final state of that step, not the thrash. Code below.
“The generated player.html is small enough to paste into a comment. Autoplay, speed buttons from 1x to 10x, and three keyboard shortcuts. No vendor player, no auth, no CDN.”
assrt-mcp/src/cli.ts lines 310-349
The auto-generated HTML player
This is the single most copied-and-pasted file Assrt produces. Zip a run folder, send it to a reviewer on a different machine, they double-click player.html, and the recording plays with full speed controls. No extension, no electron wrapper, no signed URL. The hotkeys are bound on the document: Space toggles play, ArrowLeft and ArrowRight seek five seconds, and 1/2/3/5/0 map to 1x/2x/3x/5x/10x speed.
The results JSON is the audit trail
Every run also writes a structured summary to /tmp/assrt/results/latest.json and /tmp/assrt/results/<runId>.json. The shape matches TestReport at assrt-mcp/src/core/types.ts line 28, so every assertion is { description, passed, evidence }. The evidence field is plain English, written by the model judging the frame, and it is what you read first when a test fails.
End to end, on your disk
Install once, run once, review once. The whole loop fits in a single terminal buffer. Nothing in here is mocked for the article, this is what the actual CLI prints.
The pipeline, actor by actor
The six steps, in order
This is the whole how-to. Nothing else is required to produce a reviewable visual regression run.
Run setup once, globally
npx -y @assrt-ai/assrt setup registers the MCP server with scope=user, writes the QA reminder hook to ~/.claude/hooks/, and appends a QA testing section to ~/.claude/CLAUDE.md. There is no per-project config file to commit.
Write a plan in plain English
Create scenario.md with one or more `#Case` blocks. No DSL, no YAML, no JSON. The header regex is #?\s*(?:Scenario|Test|Case) (assrt-mcp/src/core/scenario-files.ts). Commit the file next to the code it tests.
Call assrt_test from the agent or the CLI
The Claude Code agent already has the tool because setup made it global. You can also call it as `npx assrt run --url http://localhost:3000 --plan-file scenario.md`. It spawns headless Chromium at 1600x900 via Playwright MCP.
Let Claude Haiku 4.5 judge each frame live
Every visual action (navigate, click, type_text, select_option, scroll, press_key) emits a fresh JPEG screenshot. The agent attaches it as base64 image content to the next tool-result message and the model decides pass/fail in prose (agent.ts line 987).
Open the auto-generated /tmp/assrt/<runId>/video/player.html
autoOpenPlayer defaults to true, so the player already opened. It is a 48-line self-contained HTML file you can double-click on any machine. Space toggles play, arrow keys seek, number keys set 1x through 10x.
Grep results/latest.json for failures
The results JSON is the exact TestReport shape from types.ts line 28. `jq '.scenarios[].assertions[] | select(.passed|not)'` gives you every failed assertion with its evidence field and no cloud login.
By the numbers
Concrete counts for the things you can grep on disk today. Every value below maps to a file path or a line of source.
Distinct artifact files per run, all on your disk
Lines in the auto-generated video player.html
Max playback speed via the built-in hotkey '0'
Dashboards, login flows, or signed URLs to expire
Where built-in actually pays off
The moments below are where having the artifacts on disk, not in a cloud, changes the feel of the loop. Every item is something a reviewer has run into with the traditional setup.
When a folder beats a dashboard
- You want to test a new user flow without adding a library, writing a config, or seeding a baseline folder.
- A reviewer on a different machine needs to see what happened on a failing CI run, without a cloud account.
- You are about to zip a failed run as a GitHub Actions artifact and want the player to work locally when unzipped.
- The agent that wrote the code also needs to verify the UI still works, in the same turn, before committing.
- You care about the evidence field more than a pixel diff score, because the question is 'did the toast appear' and not 'did the border shift 1px'.
- You want to grep a year of regression runs with jq, not click through proprietary paginated tables.
Works with the stack you already have
Because every artifact is a standard format, the review surface plugs into anything. The HTML player opens in any browser, the JSON parses with jq, the PNGs attach to any bug tracker, the WebM plays in any CI artifact viewer, and the scenario.md commits to any repo.
Built-in vs. the traditional stack, row by row
Every row is a file path or a line of source, not a marketing adjective. Open the Assrt repo and verify any claim in thirty seconds.
| Feature | Traditional visual regression stack | Assrt (built-in) |
|---|---|---|
| Where the plan lives | Proprietary YAML, low-code builder, or vendor DSL | scenario.md in plain English, committable to git |
| Where the baseline images live | __snapshots__/ folder (pixel diff) or vendor cloud (AI diff) | No baseline. Claude Haiku 4.5 judges each JPEG live at agent.ts:987 |
| Where the run output lives | Vendor cloud dashboard, signed URLs, session expiry | /tmp/assrt/<runId>/ on your disk, plain PNG/WebM/JSON |
| How to replay a recording | Log in to the dashboard, find the build, click play | open /tmp/assrt/<runId>/video/player.html |
| Playback speed controls | 1x, maybe 2x if the vendor shipped it | 1x, 2x, 3x, 5x, 10x via generated player (cli.ts:331-335) |
| Archive a run as a CI artifact | Often blocked by vendor cloud or requires paid tier | Zip the runId folder. Everything is standard formats. |
| Fail signal format | Pixel count, diff overlay image, proprietary score | TestAssertion { description, passed, evidence } at types.ts:13-17 |
| Swap tools tomorrow | Scenarios are stuck in the vendor format | Plan runs on any Playwright MCP agent; tool calls are public |
Both approaches are valid. Pixel diff catches 1px drifts a model will not. Built-in catches page-level behavior and does not need a __snapshots__ folder that churns on every theme tweak.
What built-in does not replace
Pixel-perfect component regression. If your design system breaks when a 1px border color drifts, keep Playwright's toHaveScreenshot() on that component. A model will not flag one pixel and you will miss the regression. The built-in semantic approach is for page-level and user-journey regressions: did the onboarding flow render correctly, did the pricing table reflow, did the error banner appear when it was supposed to. Both stacks live happily in the same repo, answering different questions.
Want to see a /tmp/assrt/<runId> folder appear against your own app?
Fifteen minutes. Bring a user flow that is a pain to test. We will run Assrt against it live and you will open the artifacts folder yourself at the end.
Book a call →FAQ: doing visual regression when it's built-in
What does 'built-in' actually mean here, as opposed to installing a visual regression library?
It means three separate install steps collapse into one. With a traditional setup you install a runner (Playwright, Cypress), an assertion library (toHaveScreenshot, pixelmatch, resemble.js, Percy SDK), and a hosting account for the baseline store (Percy cloud, Chromatic, Applitools Eyes). With Assrt, running `npx assrt setup` once registers the Assrt MCP server globally for Claude Code (scope: user), writes a PostToolUse hook to ~/.claude/settings.json, and appends a QA testing section to ~/.claude/CLAUDE.md. After that the `assrt_test` tool is always present in the agent's tool list. There is no per-project config file and no baseline folder. The scenario is a plain-text `.md`, and the result lives on your disk at /tmp/assrt/<runId>. Source: assrt-mcp/src/cli.ts lines 214-308.
Where is the output of a visual regression run, step by step, on disk?
Every run creates a freshly-minted UUID directory at /tmp/assrt/<runId>/ (assrt-mcp/src/mcp/server.ts line 430). Inside that folder you get: screenshots/NN_step<n>_<action>.png, one per visual action, with a zero-padded index starting at 00 (naming pattern at server.ts line 468); video/recording.webm, the native Playwright recording (server.ts lines 587-590); video/player.html, an auto-generated self-contained HTML video player (server.ts line 619 calls generateVideoPlayerHtml); execution.log, a flat text trace of every step and assertion (server.ts line 602); events.json, the full event timeline (server.ts line 606). Results land separately at /tmp/assrt/results/<runId>.json and /tmp/assrt/results/latest.json. None of these files are uploaded to a cloud by default. Open any folder in Finder and you have the full review surface.
The generated player.html — what does it actually do?
It is 48 lines of HTML + CSS + vanilla JS, written by generateVideoPlayerHtml at assrt-mcp/src/cli.ts lines 310-349. It renders the .webm inline with autoplay, a dark chrome, a status header showing pass/fail counts and duration, and a row of playback speed buttons (1x, 2x, 3x, 5x, 10x). Keyboard shortcuts are bound: Space toggles play/pause, ArrowLeft seeks back 5 seconds, ArrowRight seeks forward 5 seconds, and keys 1/2/3/5/0 set speed to 1x/2x/3x/5x/10x. The file is self-contained, so you can zip the whole runId folder, mail it to a teammate, and they can double-click player.html on any machine. No proprietary player, no auth token, no browser extension required.
Does Assrt actually generate Playwright code, or is it a wrapper around its own DSL?
It uses Playwright directly. The Assrt browser layer spawns @playwright/mcp as a stdio child process with --viewport-size 1600x900 --caps devtools (assrt-mcp/src/core/browser.ts line 296). The native MCP tool calls the agent dispatches are real Playwright actions: browser_navigate, browser_click, browser_type, browser_take_screenshot, browser_start_video, browser_stop_video. The recorded WebM is Playwright's native video, not a synthetic reconstruction. Nothing in the pipeline is a proprietary YAML DSL. The plan format is plain English in scenario.md and the execution layer is Playwright MCP. If you cancel Assrt tomorrow, the plan still runs on any other Playwright MCP agent because the tool calls are public.
I already use Playwright's toHaveScreenshot(). How is this different on the ground?
toHaveScreenshot() writes a golden PNG to __snapshots__/ on the first run, then pixelmatches every later run against that baseline. You tune maxDiffPixels and threshold to control false positives, and you run `npx playwright test --update-snapshots` any time you intentionally change the UI. Assrt has zero references to toHaveScreenshot, pixelmatch, resemble.js, or maxDiffPixels in the repo. Each visual action takes a JPEG screenshot, attaches it to the next Claude Haiku 4.5 tool-result message as base64 (agent.ts line 987), and the model decides whether the frame matches the English plan. Instead of a pixel count, the fail signal is an evidence string. Both approaches are valid. Pixel diffs catch 1px border drifts better. Semantic evidence catches page-level behavior better and does not need a __snapshots__ folder that churns on every theme tweak.
How do I review a test run without opening a cloud dashboard?
Three commands cover 95% of cases. First, open the auto-generated HTML player: `open /tmp/assrt/<runId>/video/player.html` (or just let assrt_test open it for you, autoOpenPlayer defaults to true, server.ts line 403). Second, read the assertion list: `jq '.scenarios[].assertions' /tmp/assrt/results/<runId>.json` gives you `description`, `passed`, `evidence` for every check (shape at assrt-mcp/src/core/types.ts lines 13-17). Third, look at the exact frame that failed: `open /tmp/assrt/<runId>/screenshots/` lists them in order because the filename prefix is a zero-padded numeric index. No dashboard login, no signed URL expiry, no 'session has ended, please refresh'.
How do I archive a run as a CI artifact?
Zip the /tmp/assrt/<runId> folder and upload it. Everything is already a standard format: PNG, WebM, HTML, JSON, plain text. On GitHub Actions: `- uses: actions/upload-artifact@v4` pointing at `/tmp/assrt/<runId>`. On GitLab CI: add it to `artifacts.paths`. The HTML player is self-contained so reviewers can unzip the artifact locally and double-click the player to replay the run, no GitLab Pages deploy required. Because the scenario.md is also plain text you can commit it to your repo next to the code it tests and diff it in PRs, which is painful with most proprietary scenario formats.
What if I want to run the same scenario against multiple browsers or viewports?
Pass `viewport` to assrt_test as either a preset string or an explicit object like `{ width: 414, height: 896 }` (TestRunOptions.viewport at assrt-mcp/src/core/types.ts line 46). Each run creates its own /tmp/assrt/<runId>/ so a mobile run and a desktop run against the same scenario produce two independently reviewable folders. Playwright's default browser is Chromium and Assrt uses the bundled chromium binary from @playwright/mcp. If you need Firefox or WebKit, swap the channel in the browser launch config at browser.ts line 296. Viewport defaults to 1600x900 for the MCP spawn (browser.ts line 296) and the video recording honors the same dimensions via browser_start_video at line 628.
Does this phone home? Is there a cloud component I do not see?
Optional. The default mode writes every artifact to your local disk and makes zero outbound calls beyond the LLM request for semantic judgment (Anthropic for Claude Haiku 4.5, optionally Google for Gemini 3.1 Flash Lite retrospective video analysis if GEMINI_API_KEY is set). If you sign in with `assrt login` the server also mirrors scenario metadata to Firestore for multi-device sync, but that path is guarded by ASSRT_NO_SAVE=1 (server.ts line 404). The comparable tier-3 AI testing platforms charge roughly $7,500 a month at scale and keep scenarios, diffs, and evidence in their cloud permanently. Assrt ships as an open-source npm package @assrt-ai/assrt and a marker file at ~/.assrt/installed confirms an idempotent install. Cancel the relationship and every past /tmp/assrt/<runId> folder still plays, parses, and reproduces.
What about assertions that are not visual at all, like an API response or a database row?
The same `assert` tool handles them. Its three fields (description, passed, evidence) are model-agnostic, so the agent can make an http_request, inspect the response body, and emit an assertion with evidence like 'POST /api/orders returned 201 with order_id ord_8821'. The http_request tool at agent.ts lines 172-184 supports GET/POST/PUT/DELETE and headers. A single scenario can mix visual assertions and integration assertions, which matters when you are testing an end-to-end user journey: the toast says 'Order placed' visually and the backend has a new row in the orders table. Built-in means the same plan format covers both.
The rest of the built-in visual regression stack, in the same repo.
Keep reading
Visual regression tutorial without toHaveScreenshot()
Why each JPEG goes to Claude Haiku 4.5 instead of pixelmatch. The in-run side of the built-in loop.
Visual regression testing, built into the coding agent
The PostToolUse shell hook in ~/.claude/settings.json that fires on every git commit and reminds the agent to run assrt_test.
AI visual regression in two phases
Claude Haiku 4.5 judges live. Gemini 3.1 Flash Lite answers English questions about the WebM after the run.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.