AI in automation testing is five primitives, not one smart model
Every article on the phrase stops at "the agent clicks like a human." That is the easy half. The hard half is a set of adaptive primitives classical Playwright never needed, because classical Playwright ran against deterministic apps written by humans. One of those primitives is a single JavaScript expression that pastes a six-digit OTP into six fields in one DOM event. Get that one wrong and your AI test stalls on every signup flow.
This guide walks through the five primitives that make AI in automation testing actually work, with the exact line numbers in the Assrt agent source where each one lives. You can open the file, read the code, and fork it.
“The model is not the clever part. The clever parts are a DataTransfer paste, a MutationObserver, an inbox poll, an HTTP client, and an unbounded retry loop. Every one is in a single file.”
agent.ts in assrt-mcp (lines 7-196)
Why "AI in automation testing" is the right phrase
The common framing is that AI replacesautomation testing. That framing is wrong in a way that produces bad tools. Automation testing still happens. Playwright is still the runtime. Chromium still launches. A selector still has to resolve to a DOM node. What AI adds is a small set of primitives that handle situations classical automation could not: streaming AI-rendered content that doesn't finish when the network goes quiet, multi-field OTP inputs that drop characters when you type them one at a time, disposable email verification inside a test run, webhook assertions that require an HTTP client. These are primitives, not intelligence. The AI picks which one to call. The primitive does the work.
The rest of this guide walks each of the five primitives, with the specific lines in the Assrt agent source where each is implemented, so you can replicate the pattern in any stack.
The five primitives at a glance
Adaptive DOM stability
wait_for_stable injects a MutationObserver, polls window.__assrt_mutations every 500ms, and returns as soon as the page has been quiet for 2 seconds. Fast pages return fast; streaming pages wait as long as they actually stream.
Disposable inbox + OTP polling
create_temp_email hits temp-mail.io for a fresh inbox. wait_for_verification_code polls that inbox until the verification mail arrives. Your signup test does not need a fixture account or a magic link shortcut.
One-event multi-field OTP paste
The system prompt at line 235 hardcodes the exact DataTransfer + ClipboardEvent expression the agent must pass to 'evaluate'. One synthetic paste fills all six fields atomically. React-controlled inputs stop dropping characters.
External API verification
http_request is a first-class tool. The agent can poll Telegram's getUpdates after a UI action, fetch a Stripe event, or call a custom assertion endpoint. Integration tests stop needing a separate framework.
Infinite-step recovery
MAX_STEPS_PER_SCENARIO = Infinity (line 7). On a failed action the agent re-snapshots, picks a different accessibility ref, and tries again instead of bailing at the first flake. complete_scenario is the only way out, and it always writes pass/fail with evidence.
Primitive 1: the OTP paste one-liner (the anchor fact)
OTP-gated signup flows are where every previous generation of AI browser tester broke. The pattern is a row of six single-character inputs, usually input[maxlength="1"], controlled by React. An agent that types one character per field loses characters to focus transitions and autoadvance logic. An agent that tries a paste on a single input fills only that input. The version that works is a synthetic ClipboardEvent fired on the parent of the first input, carrying a DataTransfer with the full code. The browser fans the paste across the sibling inputs, React processes it as one event, every digit lands.
Assrt hardcodes the working expression into the system prompt the agent sees on every run, at line 235 of agent.ts. The next line forbids the agent from modifying it except to substitute the real code. This is a deliberate choice: letting the model improvise produces a different broken variant every run.
What one signup flow looks like end to end
The primitives don't matter on their own. They matter when an agent strings them together. Here is a real six-step signup flow as the agent traverses it, with the specific tool at each step.
Agent path through an OTP-gated signup
Click Sign up
Agent: snapshot → finds button by accessibility ref → click(ref=e12)
Fill email
Agent: create_temp_email → temp-mail.io returns abc123@temp-inbox.tld → type_text
Submit form
Agent: click Submit → wait_for_stable → snapshot shows 'Check your email' heading
Poll the real inbox
Agent: wait_for_verification_code(60) → polls temp-mail.io → returns '482917'
Paste the OTP atomically
Agent: evaluate(OTP_PASTE_EXPRESSION) with 482917 → single ClipboardEvent fills 6 fields
Assert verified
Agent: wait_for_stable → snapshot → assert heading == 'Welcome'. complete_scenario(passed=true)
The same flow as a sequence diagram, showing the actors the agent talks to: the browser (via Playwright MCP), the disposable-inbox service, and eventually the app's own UI.
Agent, browser, and disposable inbox on one OTP step
Primitive 2: adaptive DOM stability, not setTimeout
AI apps stream. Chat UIs paint tokens a character at a time for several seconds after the network goes quiet. Generative UIs rewrite whole sections after a user clicks. A hardcoded sleep either flakes (too short for a slow stream) or wastes time (too long for a fast one). Playwright's built-in waits key off network events, which miss this entirely because the network went quiet long before the DOM did.
wait_for_stable measures what actually matters for a downstream click: DOM activity. It injects a MutationObserver into the live page, counts mutations, and returns as soon as the counter has been flat for the configured quiet window (default 2 seconds, clamped to 10). The timeout defaults to 30 seconds, clamped to 60.
Primitive 3: a real inbox, not a mock
Three tools make up the inbox layer, all defined between lines 115 and 131 of agent.ts: create_temp_email, wait_for_verification_code, and check_email_inbox. They talk to temp-mail.io through a thin client in email.ts. When a test hits a signup form, the system prompt tells the agent to reach for this tool sequence: create an address, use it in the form, submit, poll for the OTP, then paste with the expression from Primitive 1.
This is the difference between "test the signup flow" and "seed a fixture user and skip verification." The fixture route hides every bug that lives in the verify-email step, which is exactly the set of bugs that makes it to production. A real inbox primitive makes the full path testable.
Primitive 4: http_request for downstream assertions
Half of a modern app's tests are about what the page did to a downstream system, not what the page shows. When you click "Send notification," the interesting assertion is that Telegram got a message, that your Slack webhook fired, or that a Stripe event landed. Classical automation suites either skipped these (flaky, partial) or mocked them (lies).
http_request is a first-class tool in the same agent. The agent decides when to call it based on the English step. A #Case like the one below mixes UI actions and webhook verification in a single scenario, no second framework.
Primitive 5: unbounded recovery, bounded exits
Line 7 of agent.ts sets MAX_STEPS_PER_SCENARIO = Infinity. That looks dangerous; it is the most conservative choice. A flaky middle step should not kill a long integration run. Instead, when an action fails, the recovery rules in the system prompt say: snapshot the page fresh, look at what changed, pick a different accessibility ref, try again, and if still stuck after three tries, call complete_scenario with passed=false and evidence.
The agent always exits. The exit is explicit, records evidence, and is reviewable in results/latest.json. The infinity cap buys the agent the room to recover across a transient flake instead of surfacing it as a failure.
One scenario from launch to exit
Launch + navigate
preflightUrl probes your server before a Chromium boot. launchLocal brings up @playwright/mcp over stdio. navigate loads the URL with a 30s bound.
First snapshot + reasoning
The agent gets an accessibility tree with [ref=eN] tags for every interactive element. It chooses a tool call (click, type_text, scroll, etc.) based on the next step of the #Case.
Primitive stack activates as needed
wait_for_stable when the page streams. create_temp_email + the OTP paste on signup. http_request for webhook checks. The agent composes them on demand; no separate frameworks get involved.
Recovery on failure
When a tool call fails, the agent re-snapshots, picks a different ref, retries up to three times. If still stuck, it scrolls and tries once more. The loop has no step cap; it has an exit contract.
complete_scenario with evidence
Every path out of the loop calls complete_scenario(passed, summary). Assertions along the way produce evidence lines. /tmp/assrt/results/latest.json captures everything; the optional cloud sync mirrors it for review.
The system, one beam at a time
Inputs flow into a single model call per turn. The model picks a primitive, the primitive talks to its specific external surface, and structured events stream back to the caller. There is no orchestration layer hidden behind it.
Tools in, decisions out
What the numbers actually are
No invented benchmarks. Each number is a constant you can find in the source: agent.ts lines 7 (MAX_STEPS), 9 (DEFAULT_ANTHROPIC_MODEL), 16-196 (TOOLS array), 957-958 (wait_for_stable clamps), 124-125 (OTP timeout default), and 1087 for the file length.
Watching it run on a real signup flow
This is the log from a single assrt run against a staging signup. Notice how the tool calls alternate: snapshot, act, wait_for_stable, snapshot again. The OTP paste is a single evaluate call. The webhook check is a single http_request.
Against what "AI testing" usually means
| Feature | Classical / closed | Assrt |
|---|---|---|
| OTP-gated signup flow | Skip the OTP step; seed a fixture user in your DB | Real temp-mail.io inbox + one-event paste trick; full flow runs |
| Waiting for streaming / AI content | Hardcoded setTimeout or waitForLoadState('networkidle') | MutationObserver watches real DOM activity; adapts per run |
| Element targeting | CSS / XPath selectors that rot when a class name changes | Accessibility-tree refs like [ref=e12] re-read every step |
| Verifying a downstream webhook | Separate test suite in a different framework, or mocked | http_request tool call in the same #Case; no second runner |
| Flaky middle step | Retry the whole test once, then give up | Infinite-step inner loop; re-snapshot, pick a new ref, continue |
| Test plan format | TypeScript .spec.ts files with locator strings | Plain Markdown #Case blocks at /tmp/assrt/scenario.md |
| Source you can read | Closed SaaS backend; maybe a thin open wrapper | 1,087 lines of open TypeScript at agent.ts (assrt-mcp repo) |
| Cost at seat parity | $7,500 / month for closed AI QA platforms | Open source + your own Anthropic or Gemini key |
Primitive surface at a glance
Every chip below is a concrete name in the agent source or the #Case DSL. No marketing terms.
Where this leaves a team considering AI in automation testing
The moment you start using AI to drive a browser, the question stops being "how smart is the model" and starts being "what primitives did the framework give it." Ask a vendor how they stabilize for streaming content. Ask them how they clear a real OTP. Ask them whether their agent can call an external API in the same #Case. If those answers are not on-disk TypeScript, you are buying a black box whose failure modes nobody can fix for you.
Assrt's five primitives are on disk, named, and within two hundred lines of each other. The LLM is swappable. Your test plan is Markdown. The point of this guide is not to pitch the product; it is to show you the specific shape of the primitives so you can evaluate any tool in the category, not just this one. If you want the easiest path to actually running them on your stack, talk to us.
Want the five primitives running on your signup flow this week?
30 minutes. We'll plug Assrt into your staging environment, clear an OTP flow end to end, and hand you the #Case file to keep.
Book a call →Frequently asked questions about AI in automation testing
What does 'AI in automation testing' actually change, at the code level, versus plain Playwright?
Five specific primitives get added, all visible in the Assrt agent source. First, wait_for_stable injects a MutationObserver at run time and waits for 2 consecutive seconds of zero DOM mutations or 30 seconds elapsed (agent.ts lines 962-999), which replaces hardcoded setTimeout sleeps that either flake or waste time. Second, create_temp_email + wait_for_verification_code + check_email_inbox give the agent its own inbox from temp-mail.io (agent.ts lines 115-131), so signup flows run end-to-end without fixture accounts. Third, a DataTransfer + ClipboardEvent one-liner hardcoded in the system prompt at agent.ts line 235 pastes a full OTP into N single-character input fields in one DOM event, bypassing the one-character-at-a-time race that breaks React-controlled inputs. Fourth, http_request (agent.ts lines 172-184) lets the agent call external APIs like Telegram's getUpdates to verify a webhook fired after a UI action. Fifth, MAX_STEPS_PER_SCENARIO is set to Infinity at agent.ts line 7, so when a step fails the loop re-snapshots the page and tries again with fresh accessibility refs instead of bailing at the first red mark.
Why is the OTP paste trick baked into the system prompt as a hardcoded JS expression instead of being a separate tool?
Because the reliable version is a single DOM event on a specific parent element with a specific event shape, and leaving the agent to derive it from first principles every time produces a different broken version on each run. The exact expression at agent.ts line 235 builds a DataTransfer, sets text/plain on it, and dispatches a synthetic ClipboardEvent on the PARENT of the first input[maxlength="1"]. That matches how browsers actually paste clipboard content across a chunk of grouped inputs, so React's synthetic event handlers process it once with the whole code. The prompt explicitly says 'Do NOT modify this expression except to replace CODE_HERE' because any deviation (typing into each field, dispatching on the input instead of the parent, using keydown) gets swallowed by React's controlled-input logic and the agent ends up with 2 digits entered, a stuck form, and a retry loop. Hardcoding the one working expression in the prompt is a more robust design choice than giving the model freedom to improvise.
How is wait_for_stable different from Playwright's built-in waitForLoadState('networkidle')?
networkidle fires when there are no network requests for 500ms, which says nothing about whether the page has finished rendering. AI-generated UIs, streaming chat, and React SPAs routinely finish their network activity long before they finish painting. wait_for_stable measures the thing that actually matters for a downstream click: DOM mutations. The implementation at agent.ts lines 962-999 injects a MutationObserver watching document.body with childList, subtree, and characterData, polls window.__assrt_mutations every 500 ms, and breaks out of the loop when stable_seconds (default 2) pass with no change in the counter. Maximum timeout is 30 seconds by default, clamped to 60 (line 957). That primitive was added specifically because AI-assisted tests run against AI-assisted apps, and AI-assisted apps stream content for seconds after the network goes quiet.
Why does Assrt's agent need http_request if it already drives a browser?
Because half of modern app tests are not about what shows up on the page. They are about what the page did to a downstream system. After you click 'Send notification' in a web app, the assertion that matters is whether Telegram actually got a message, whether your Slack incoming webhook got called, whether Stripe recorded the payment. http_request (agent.ts lines 172-184) lets the agent poll api.telegram.org/bot<token>/getUpdates or post to a verification endpoint or fetch a Stripe event, and then make assertions against the HTTP response instead of guessing from the UI. Classical Playwright tests either skipped this (flaky, partial) or mocked it (lies). An AI agent with http_request can verify it end-to-end without leaving the one tool surface it already knows.
What stops an AI agent from getting stuck in an infinite retry loop when MAX_STEPS is set to Infinity?
Three structural checks. First, the agent must call complete_scenario (tool at agent.ts lines 146-156) to end a scenario, and every call sets passed=true or passed=false with a summary, so 'infinite' means the scenario stays open until the agent explicitly decides it is done. Second, every assert (lines 133-144) produces evidence and pass/fail, so the agent accumulates a record that biases it toward closing out once it has a decisive result. Third, when a step fails the recovery rule in the system prompt (lines 220-226) says: snapshot, look at the changed page, try a different ref, scroll and retry, and if still stuck after 3 attempts, call complete_scenario with passed=false. The Infinity cap is there so a flaky middle step doesn't kill a long integration run; it is not carte blanche to loop forever.
Is the disposable inbox actually real, or is it a simulated LLM response?
It is real email. create_temp_email calls out to temp-mail.io and returns a working inbox address (implementation in /Users/matthewdi/assrt-mcp/src/core/email.ts). wait_for_verification_code polls that inbox for up to 60 seconds, parses the OTP out of the newest matching message, and returns it as a string the agent can feed into its paste expression. If your app sends real SMTP mail, Assrt's agent sees the real mail. This is the primitive that turns 'test the signup flow' from 'seed a fixture user and skip verification' into 'create a fresh user on prod-like infra, clear the email-verify step, and continue into the logged-in experience'.
Which model is actually running the agent loop, and can I swap it out?
claude-haiku-4-5-20251001 by default (agent.ts line 9). Haiku is the right choice for a tool-use loop that is heavy on fast decisions and light on long-form generation. The whole run() path is provider-agnostic: both Anthropic (with either an API key or a Claude Code OAuth token) and Gemini 3.1 Pro work through the same GEMINI_FUNCTION_DECLARATIONS bridge at agent.ts lines 277-301. You can set ANTHROPIC_MODEL or GEMINI_MODEL env vars, or point ANTHROPIC_BASE_URL at a local proxy to keep all data on your network. Nothing in the 18 tool definitions is provider-specific.
Does my test file end up proprietary once I use Assrt?
No. The test file is Markdown. Each test is a #Case N block with English steps, stored at /tmp/assrt/scenario.md and optionally synced to the app.assrt.ai dashboard if you want sharing. The scenario layout is defined at /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts: scenario.md is your plan, scenario.json is metadata, results/latest.json is the last run. You can grep, diff, and commit the plan alongside your Playwright project. If you ever leave Assrt, your authoritative test artifacts are already plain text on your disk.
How do I actually run this against my own app today?
Two paths. Inside Claude Code or another MCP client: add the assrt-mcp server (one line in your MCP config), then prompt 'Use assrt_test on http://localhost:3000' with a plan, and the three tools assrt_plan, assrt_test, and assrt_diagnose become available. From a plain terminal: npx @assrt-ai/assrt setup, then `assrt run --url http://localhost:3000 --plan '#Case 1: Signup flow completes'`. Add --video to record and auto-open a player, --extension to attach to your real Chrome so you keep your saved sessions, --isolated if you want an in-memory profile. Every flag is documented in /Users/matthewdi/assrt-mcp/README.md.
How did this page land for you?
React to reveal totals
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.