Automated user testing, one agent, both jobs

Automated user testing that asserts pass/fail and files UX bugs, in the same run.

Automated UI testing tools return pass or fail. Usability research platforms return UX commentary but need human participants. Assrt collapses both categories because the agent picks from a closed set of exactly 18 tools per step, and one of them, suggest_improvement, lets it file a UX finding mid-scenario instead of only asserting. Everything in this page is verifiable in the source repo at assrt-mcp/src/core/agent.ts.

Matthew Diakonov, Assrt maintainer

Published April 21, 202610 min read

4.8from Assrt MCP users

18-tool closed set, one of them files UX bugs

Up to 20 pages auto-discovered per run, 3 concurrent

Open-source, self-hosted, Playwright under the hood

One sentence, if you only read one

The automated user is an LLM agent picking one of 18 tools per step, and one of those tools is suggest_improvement.

That single tool in the closed set is why a run that passes every assertion can still hand you three UX findings with severity and a suggested fix. No research platform integration, no second tool, no post-run human review required to produce the findings.

Install npx @assrt-ai/assrt

Automated user testing, reimagined

One agent. Two jobs. Eighteen tools.

Write #Case blocks in plain English

Agent picks from 18 closed-set tools per step

assert records functional pass / fail

suggest_improvement records UX findings

Background crawl adds up to 20 discovered pages

0:00 / 0:05

The phrase “automated user testing” hides two incompatible products

Read the top ten results for this keyword. Half are automated UI testing frameworks (Selenium, Cypress, Playwright, Katalon). The other half are usability research platforms (UserTesting, Maze, UXtweak). The two halves have almost nothing in common. One returns a row of green checkmarks. The other returns a set of five-minute sessions where real humans think aloud. Teams end up buying one of each and writing glue code to pretend they are the same pipeline.

This page is about the version that collapses the two. Not because the research platform is redundant (it is not; a representative human finds things the model cannot), but because the fast, continuous, every-PR feedback loop deserves both outputs from one run. Assrt handles that loop. Your recruited research budget stays available for the hard studies.

What a run outputs, before and after the UX-finding tool

Same scenario file, same Playwright runtime underneath. The only difference is whether the agent is allowed to call suggest_improvement during the run or is restricted to assertion-only tools.

Case 1: First-time signup PASSED (42.1s) Case 2: Onboarding copy PASSED (8.3s) Run complete. 2/2 passed.

Pass / fail per #Case
No UX commentary in the report
A green run can still ship a confusing UI

The anchor fact: one tool definition, thirteen lines of code

The whole difference comes down to whether the agent's tool list includes this declaration. It is at line 158 of assrt-mcp/src/core/agent.ts, identical in shape to the other seventeen tools in the closed set. The agent does not get a special mode or a different prompt. It gets one more tool in the array, and that tool's description tells it when to reach for the UX-bug-filing hammer.

assrt-mcp/src/core/agent.ts

Four required fields. A title. A severity in one of three bands (critical, major, minor). A description of what is wrong. A suggestion for how to fix it. The agent is free to call this between assertions or instead of one. The run emits an improvement_suggestion event and attaches the record to the final report. That is the entire delta between “automated UI testing” and “automated user testing that reports like a user”.

The numbers that define the automated user

Four integers, all source-verifiable, that bound what the automated user can do on one run.

0Tools in the closed set

0UX-bug-filing tool

0Max auto-discovered pages

0Concurrent discoveries

assrt-mcp/src/core/agent.ts

The ratio of 1 UX-reporter tool to 17 functional tools is the reason the agent does not turn every run into a wall of unsolicited opinions. It has to choose, on each turn, whether this is a moment to assert or a moment to file a finding. Most turns are the former. The handful that matter are the latter.

How one plan fans out into four kinds of output

Inputs on the left, the agent in the middle, outputs on the right. The run handles all four output shapes without separate pipelines, separate tools, or a second pass.

One agent, four output shapes

A scenario file that exercises both tools

Nothing about the plan format is special for UX findings. You write the same English #Case blocks. The second case here explicitly permits the agent to file a finding if the copy would confuse a first-time user; the first case is a straight functional assertion.

ux.md

What the terminal actually shows during a run

Assertions and findings interleave. Auto-discovered pages are reported at the end. The video and JSON report land on disk.

assrt run --plan-file ux.md

What the closed set of tools looks like in practice

The agent cannot invent new tool names. On each turn it picks one of the eighteen, with full input validation. Six groups cover the surface area of automated user testing.

Functional assertions

The agent calls assert(description, passed, evidence) for each thing the scenario claims. A failed assertion fails the #Case. This is the pass/fail half most automated user testing tools stop at.

UX findings, inline

suggest_improvement(title, severity, description, suggestion) is an equal-status tool in the same 18-tool set. When the agent sees copy that would confuse a real user, it can file a finding instead of returning green.

Disposable inbox

create_temp_email + wait_for_verification_code mean the automated user can complete an OTP signup like a human would, without a MailSlurp account or hand-maintained fixtures.

Auto-discovered pages

Every navigate() is queued for background discovery. A second agent reads the accessibility tree of each page and proposes 1-2 new #Case ideas. Up to 20 pages, 3 concurrent, deduplicated by origin + pathname.

Stability without timeouts

wait_for_stable injects a MutationObserver and returns when the DOM has been silent for N seconds (default 2, max 10). A real user perceives a page as done when nothing is moving, not when a millisecond budget expires.

Shared Chrome session

--extension attaches the run to the user's running Chrome, so the automated user inherits real cookies, real logins, real 2FA state. The session is indistinguishable from a human because, on the trust plane, it is one.

The seven steps of one run

Across a single invocation, here is what the system does between the first npx assrt and the final JSON report.

Parse the plan

parseScenarios at agent.ts:620 splits scenario.md on /(?:#?\s*(?:Scenario|Test|Case))\s*\d*[:.]\s*/gi. Every #Case is a block. No selectors, no fixtures, no imports in that file.

Preflight the URL

A HEAD request with an 8-second timeout fails fast if the dev server is wedged. A real user would give up and you want your automated user to give up the same way.

Attach or launch Chrome

--extension attaches to the running Chrome over Playwright MCP; otherwise Assrt launches one. Cookies and logins from an extension attach carry over; that is why the agent can skip past auth.

Snapshot before every action

The accessibility tree is the selector. Each element has a stable ref like [ref=e5]. The agent matches your English 'Click Get started' against a node with role=button and name containing 'Get started', then calls click with the ref.

Loop across 18 tools

Each turn, the model picks exactly one tool from the closed set: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, wait_for_stable, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, suggest_improvement, complete_scenario, http_request.

Emit findings and assertions

assert writes {description, passed, evidence}. suggest_improvement writes {title, severity, description, suggestion}. Both end up on the TestReport; a scenario can be green on assertions and still have findings attached.

Crawl in the background

queueDiscoverPage collects every new origin+pathname up to 20. A second agent reads each page's accessibility tree and emits 1-2 #Case ideas, capped at 3 concurrent. You end with more test ideas than you started with.

Write the report

TestReport lands at /tmp/assrt/results/latest.json with per-#Case pass, assertions, findings, and duration. /tmp/assrt/scenario.md holds the plan and auto-syncs back to cloud storage on edit.

Versus the rest of the “automated user testing” category

Two tables, two different categories. The first is against pure functional runners. The second is against human-in-the-loop research platforms. Assrt sits across both categories rather than inside either one.

Versus functional automation (Playwright, Cypress, Selenium)

Feature	Functional runner	Assrt
Output of a run	Pass / fail per spec, with HTML report	Pass / fail per #Case plus UX findings with severity plus discovered pages
Where UX complaints go	Nowhere, unless a human watches the trace afterwards	suggest_improvement records attached to the scenario in the JSON report
How the user is defined	A script you write with locators and awaits	An LLM agent picking from 18 tools against the accessibility tree
Emails and OTP	Bring your own inbox service (MailSlurp, Mailosaur, custom SMTP)	create_temp_email + wait_for_verification_code, built into the tool list
Coverage of unseen pages	Exactly what your specs cover	Your #Cases plus up to 20 auto-discovered pages with 1-2 #Case ideas each
Real user sessions	Fresh storage state per run; bring your own login fixture	--extension mode attaches to your running Chrome with real cookies
License and hosting	Open-source framework + proprietary recorders / SaaS	Open-source, self-hosted, Playwright under the hood

Versus usability research (UserTesting, Maze, UXtweak)

Feature	Research platform	Assrt
Who drives the app	Recruited human participants, one at a time	An LLM agent, end to end, in minutes per run
How findings are captured	Moderator notes + session replays, reviewed afterwards	suggest_improvement records with severity, emitted mid-scenario
Cost per additional run	Participant incentives, scheduling, recruitment	LLM tokens for the agent loop (Claude Haiku by default)
Functional assertions in the same run	Not the point; functional correctness is out of scope	Same agent, same run, via the assert tool
Speed from change to feedback	Days, sometimes weeks, per study	Minutes; the agent kicks off from your coding agent
Representativeness	Real humans, real confusion, real edge cases	One simulated user. Catches the obvious; cannot replace humans entirely

What the automated user never does

The point of a closed tool set is that the blast radius is knowable. The agent cannot spawn processes, read the filesystem, hit arbitrary APIs without http_request, or talk to other agents on the machine. It cannot file severity-five P0s on your tracker; it writes a suggest_improvement record into the test report, and you decide what to do with it. And it does not re-run scenarios until they pass. A single failure at step three, after three genuine attempts at recovery, marks the scenario failed and moves on.

If you want retries, wrap the entire run in CI. If you want stricter pass criteria than the agent infers from the English steps, pass a passCriteria string and the prompt will append it to every scenario as a mandatory checklist. The automated user is fast and opinionated; the guardrails are on you.

Want to see an automated user file a UX finding against your app?

Fifteen minutes. We point Assrt at a URL of yours and walk through the report together. The agent does the clicking; you watch the findings land.

Book a call →

Frequently asked questions

How is automated user testing with Assrt different from running Playwright specs?

Playwright specs are code you write and maintain; they answer one question per test: did the assertion pass or fail. Assrt's agent answers that question AND a second question in the same run: would a real user find any of this confusing. The second question is answered via the suggest_improvement tool at agent.ts:158, which records {title, severity, description, suggestion}. A scenario can pass every assertion and still emit UX findings. Playwright has no equivalent because its surface is locators and expectations, not a model picking between an assertion and a UX complaint on each turn.

How can an LLM agent genuinely behave like a real user?

It cannot replace a representative human, and the product does not claim to. What it does is use the accessibility tree the way a screen-reader user would: the snapshot tool returns nodes with roles and names, the model matches your English step ('Click Get started') to a node and clicks by ref. It waits for DOM stability the way a user waits for a spinner to stop. It requests a verification email from an inbox the way a user would check gmail for an OTP. And when the copy or layout doesn't make sense to the model reading the page, it files that as a suggest_improvement. None of this replaces a five-person research study; it catches the obvious-to-a-user issues at the speed of an automated run.

What exactly does suggest_improvement record, and how does it appear in the report?

Four fields, all required: title (short), severity (the string 'critical', 'major', or 'minor'), description (what is wrong), suggestion (how to fix). The tool is defined at agent.ts:158 in the same closed set as click, type_text, and assert. When the agent calls it, the run emits an improvement_suggestion event and the finding is attached to the test report. Functional assertions still drive pass/fail; UX findings sit alongside them. In practice this means a scenario can be green and still tell you three things the team should look at.

What does the auto-discovery system actually do during a test run?

Every time the main agent calls navigate(), the new URL is normalized to origin+pathname (trailing slash stripped, query string ignored) and queued. A background loop picks up to 3 queued URLs at a time, snapshots each page, and asks a second agent to produce 1-2 #Case ideas per page using the DISCOVERY_SYSTEM_PROMPT at agent.ts:256. The cap is 20 unique pages per session (MAX_DISCOVERED_PAGES). URLs matching /logout, /api/, about:blank, javascript:, data:, or chrome: are skipped. The discovered #Case ideas are emitted as events so you can review them and paste the good ones into the next run's plan.

Do I need an LLM API key, a cloud account, or a paid tier to use this?

You need an LLM key (Anthropic by default; Gemini supported). You do not need a cloud account. Scenarios, plans, results, and video all land on your local filesystem under /tmp/assrt. The cloud sync is optional; it uploads artifacts and returns shareable URLs if you want them. The MCP server and CLI are open-source on npm (@assrt-ai/assrt and @assrt-ai/assrt-mcp). Compared to enterprise automated user testing platforms that charge five figures per month and store every run on their infrastructure, the default posture here is: your key, your filesystem, your Playwright.

Can I use this for automated user testing against a site I am already logged into?

Yes, via --extension mode. The flag makes assrt-run attach to your running Chrome over the Playwright MCP extension bridge. Your real cookies, 2FA state, and sign-in history come along. First run asks you to approve in Chrome and save a token to ~/.assrt/extension-token; subsequent runs reuse it. This matters for automated user testing because the automated user's session is not synthetic; it is on the same trust plane as your own. The corollary is that you should not run destructive scenarios in extension mode unless you mean them to touch your real data.

What is in the final report a run produces?

Three things, all on disk. /tmp/assrt/scenario.md is the plan you (or the auto-discovery) wrote. /tmp/assrt/results/latest.json is the TestReport: each #Case with its pass flag, assertions (description, passed, evidence), UX findings (title, severity, description, suggestion), steps, and duration. If --video was set, a WebM file plus a player.html with 1x/2x/3x/5x/10x playback shortcuts. In CI, pass --json to print the same TestReport to stdout so a pipeline can grep pass counts without reading files.

Where does this fit between functional automation (Playwright) and UX research (Maze, UserTesting)?

It overlaps both without replacing either. Against Playwright: same runtime (Playwright MCP under the hood), different interface (Markdown #Cases instead of *.spec.ts), extra outputs (UX findings, auto-discovered pages). Against Maze or UserTesting: no recruited humans, no per-participant cost, runs in minutes, emits findings with severity. The honest answer is that your five-user usability study still surfaces things one LLM cannot. Assrt fits as automated user testing that runs every PR, reports the obvious-to-a-user issues continuously, and frees your human research budget for the complex studies.

What triggers the agent to file a UX finding rather than just assert?

The system prompt at agent.ts:198 tells it to 'Report an obvious bug or UX issue in the application' and the scenario's English steps can explicitly request that it evaluate copy or layout. Cases where it commonly fires: confusing button labels, missing progress indicators, broken visual hierarchy, copy that refers to features not visible, forms that reject valid input without an error message, or navigation that loops. The agent does not file style opinions; the tool definition is for obvious-to-a-user bugs, which in practice means things a product owner would also flag on a demo walkthrough.

Is there a way to see the run happen, or is the output only text?

There are three ways to see it. One, pass --headed and watch the browser window during the run. Two, pass --video to record a WebM with a cursor overlay (the browser manager injects a 20px red dot on every click) and auto-open a player.html with keyboard shortcuts for speed and seeking. Three, in MCP mode inside a coding agent, the agent receives a screencast WebSocket URL that streams the session at roughly 15 frames per second so the user sitting in Claude Code or Cursor can watch it live. The purpose is that automated user testing should produce something you can actually watch, not just a pass-fail signal.