AI in automation testing is five primitives, not one smart model

Every article on the phrase stops at "the agent clicks like a human." That is the easy half. The hard half is a set of adaptive primitives classical Playwright never needed, because classical Playwright ran against deterministic apps written by humans. One of those primitives is a single JavaScript expression that pastes a six-digit OTP into six fields in one DOM event. Get that one wrong and your AI test stalls on every signup flow.

This guide walks through the five primitives that make AI in automation testing actually work, with the exact line numbers in the Assrt agent source where each one lives. You can open the file, read the code, and fork it.

M
Matthew Diakonov
12 min read
4.8from 189 engineers
Open source: 1,087 lines of TypeScript at assrt-mcp/src/core/agent.ts
5 adaptive primitives documented with file:line citations
Works with Claude Haiku 4.5 or Gemini 3.1 Pro, your key
5 primitives

The model is not the clever part. The clever parts are a DataTransfer paste, a MutationObserver, an inbox poll, an HTTP client, and an unbounded retry loop. Every one is in a single file.

agent.ts in assrt-mcp (lines 7-196)

Why "AI in automation testing" is the right phrase

The common framing is that AI replacesautomation testing. That framing is wrong in a way that produces bad tools. Automation testing still happens. Playwright is still the runtime. Chromium still launches. A selector still has to resolve to a DOM node. What AI adds is a small set of primitives that handle situations classical automation could not: streaming AI-rendered content that doesn't finish when the network goes quiet, multi-field OTP inputs that drop characters when you type them one at a time, disposable email verification inside a test run, webhook assertions that require an HTTP client. These are primitives, not intelligence. The AI picks which one to call. The primitive does the work.

The rest of this guide walks each of the five primitives, with the specific lines in the Assrt agent source where each is implemented, so you can replicate the pattern in any stack.

The five primitives at a glance

Adaptive DOM stability

wait_for_stable injects a MutationObserver, polls window.__assrt_mutations every 500ms, and returns as soon as the page has been quiet for 2 seconds. Fast pages return fast; streaming pages wait as long as they actually stream.

Disposable inbox + OTP polling

create_temp_email hits temp-mail.io for a fresh inbox. wait_for_verification_code polls that inbox until the verification mail arrives. Your signup test does not need a fixture account or a magic link shortcut.

One-event multi-field OTP paste

The system prompt at line 235 hardcodes the exact DataTransfer + ClipboardEvent expression the agent must pass to 'evaluate'. One synthetic paste fills all six fields atomically. React-controlled inputs stop dropping characters.

External API verification

http_request is a first-class tool. The agent can poll Telegram's getUpdates after a UI action, fetch a Stripe event, or call a custom assertion endpoint. Integration tests stop needing a separate framework.

Infinite-step recovery

MAX_STEPS_PER_SCENARIO = Infinity (line 7). On a failed action the agent re-snapshots, picks a different accessibility ref, and tries again instead of bailing at the first flake. complete_scenario is the only way out, and it always writes pass/fail with evidence.

Primitive 1: the OTP paste one-liner (the anchor fact)

OTP-gated signup flows are where every previous generation of AI browser tester broke. The pattern is a row of six single-character inputs, usually input[maxlength="1"], controlled by React. An agent that types one character per field loses characters to focus transitions and autoadvance logic. An agent that tries a paste on a single input fills only that input. The version that works is a synthetic ClipboardEvent fired on the parent of the first input, carrying a DataTransfer with the full code. The browser fans the paste across the sibling inputs, React processes it as one event, every digit lands.

Assrt hardcodes the working expression into the system prompt the agent sees on every run, at line 235 of agent.ts. The next line forbids the agent from modifying it except to substitute the real code. This is a deliberate choice: letting the model improvise produces a different broken variant every run.

agent.ts: the OTP paste system-prompt verbatim

What one signup flow looks like end to end

The primitives don't matter on their own. They matter when an agent strings them together. Here is a real six-step signup flow as the agent traverses it, with the specific tool at each step.

Agent path through an OTP-gated signup

1

Click Sign up

Agent: snapshot → finds button by accessibility ref → click(ref=e12)

2

Fill email

Agent: create_temp_email → temp-mail.io returns abc123@temp-inbox.tld → type_text

3

Submit form

Agent: click Submit → wait_for_stable → snapshot shows 'Check your email' heading

4

Poll the real inbox

Agent: wait_for_verification_code(60) → polls temp-mail.io → returns '482917'

5

Paste the OTP atomically

Agent: evaluate(OTP_PASTE_EXPRESSION) with 482917 → single ClipboardEvent fills 6 fields

6

Assert verified

Agent: wait_for_stable → snapshot → assert heading == 'Welcome'. complete_scenario(passed=true)

The same flow as a sequence diagram, showing the actors the agent talks to: the browser (via Playwright MCP), the disposable-inbox service, and eventually the app's own UI.

Agent, browser, and disposable inbox on one OTP step

AgentBrowserInboxsnapshot()accessibility tree with ref=e12 (OTP container)wait_for_verification_code(60)code=482917evaluate(OTP_PASTE_EXPRESSION with 482917)pasted 6 fieldswait_for_stable(2s quiet)page stabilized after 1.4s, 17 mutationssnapshot()heading='Welcome, Matthew'

Primitive 2: adaptive DOM stability, not setTimeout

AI apps stream. Chat UIs paint tokens a character at a time for several seconds after the network goes quiet. Generative UIs rewrite whole sections after a user clicks. A hardcoded sleep either flakes (too short for a slow stream) or wastes time (too long for a fast one). Playwright's built-in waits key off network events, which miss this entirely because the network went quiet long before the DOM did.

wait_for_stable measures what actually matters for a downstream click: DOM activity. It injects a MutationObserver into the live page, counts mutations, and returns as soon as the counter has been flat for the configured quiet window (default 2 seconds, clamped to 10). The timeout defaults to 30 seconds, clamped to 60.

agent.ts: wait_for_stable implementation

Primitive 3: a real inbox, not a mock

Three tools make up the inbox layer, all defined between lines 115 and 131 of agent.ts: create_temp_email, wait_for_verification_code, and check_email_inbox. They talk to temp-mail.io through a thin client in email.ts. When a test hits a signup form, the system prompt tells the agent to reach for this tool sequence: create an address, use it in the form, submit, poll for the OTP, then paste with the expression from Primitive 1.

This is the difference between "test the signup flow" and "seed a fixture user and skip verification." The fixture route hides every bug that lives in the verify-email step, which is exactly the set of bugs that makes it to production. A real inbox primitive makes the full path testable.

Primitive 4: http_request for downstream assertions

Half of a modern app's tests are about what the page did to a downstream system, not what the page shows. When you click "Send notification," the interesting assertion is that Telegram got a message, that your Slack webhook fired, or that a Stripe event landed. Classical automation suites either skipped these (flaky, partial) or mocked them (lies).

http_request is a first-class tool in the same agent. The agent decides when to call it based on the English step. A #Case like the one below mixes UI actions and webhook verification in a single scenario, no second framework.

telegram-alert.md

Primitive 5: unbounded recovery, bounded exits

Line 7 of agent.ts sets MAX_STEPS_PER_SCENARIO = Infinity. That looks dangerous; it is the most conservative choice. A flaky middle step should not kill a long integration run. Instead, when an action fails, the recovery rules in the system prompt say: snapshot the page fresh, look at what changed, pick a different accessibility ref, try again, and if still stuck after three tries, call complete_scenario with passed=false and evidence.

The agent always exits. The exit is explicit, records evidence, and is reviewable in results/latest.json. The infinity cap buys the agent the room to recover across a transient flake instead of surfacing it as a failure.

One scenario from launch to exit

1

Launch + navigate

preflightUrl probes your server before a Chromium boot. launchLocal brings up @playwright/mcp over stdio. navigate loads the URL with a 30s bound.

2

First snapshot + reasoning

The agent gets an accessibility tree with [ref=eN] tags for every interactive element. It chooses a tool call (click, type_text, scroll, etc.) based on the next step of the #Case.

3

Primitive stack activates as needed

wait_for_stable when the page streams. create_temp_email + the OTP paste on signup. http_request for webhook checks. The agent composes them on demand; no separate frameworks get involved.

4

Recovery on failure

When a tool call fails, the agent re-snapshots, picks a different ref, retries up to three times. If still stuck, it scrolls and tries once more. The loop has no step cap; it has an exit contract.

5

complete_scenario with evidence

Every path out of the loop calls complete_scenario(passed, summary). Assertions along the way produce evidence lines. /tmp/assrt/results/latest.json captures everything; the optional cloud sync mirrors it for review.

The system, one beam at a time

Inputs flow into a single model call per turn. The model picks a primitive, the primitive talks to its specific external surface, and structured events stream back to the caller. There is no orchestration layer hidden behind it.

Tools in, decisions out

Your URL
#Case plan
Variables
claude-haiku-4-5
Browser
Inbox
External APIs
Your test report

What the numbers actually are

0Agent tool definitions in agent.ts
0sSeconds of DOM quiet before wait_for_stable returns
0sMax seconds wait_for_verification_code will poll temp-mail.io
0max_tokens on every model call in the agent loop

No invented benchmarks. Each number is a constant you can find in the source: agent.ts lines 7 (MAX_STEPS), 9 (DEFAULT_ANTHROPIC_MODEL), 16-196 (TOOLS array), 957-958 (wait_for_stable clamps), 124-125 (OTP timeout default), and 1087 for the file length.

Watching it run on a real signup flow

This is the log from a single assrt run against a staging signup. Notice how the tool calls alternate: snapshot, act, wait_for_stable, snapshot again. The OTP paste is a single evaluate call. The webhook check is a single http_request.

assrt run --url ... --plan signup.md

Against what "AI testing" usually means

FeatureClassical / closedAssrt
OTP-gated signup flowSkip the OTP step; seed a fixture user in your DBReal temp-mail.io inbox + one-event paste trick; full flow runs
Waiting for streaming / AI contentHardcoded setTimeout or waitForLoadState('networkidle')MutationObserver watches real DOM activity; adapts per run
Element targetingCSS / XPath selectors that rot when a class name changesAccessibility-tree refs like [ref=e12] re-read every step
Verifying a downstream webhookSeparate test suite in a different framework, or mockedhttp_request tool call in the same #Case; no second runner
Flaky middle stepRetry the whole test once, then give upInfinite-step inner loop; re-snapshot, pick a new ref, continue
Test plan formatTypeScript .spec.ts files with locator stringsPlain Markdown #Case blocks at /tmp/assrt/scenario.md
Source you can readClosed SaaS backend; maybe a thin open wrapper1,087 lines of open TypeScript at agent.ts (assrt-mcp repo)
Cost at seat parity$7,500 / month for closed AI QA platformsOpen source + your own Anthropic or Gemini key

Primitive surface at a glance

Every chip below is a concrete name in the agent source or the #Case DSL. No marketing terms.

wait_for_stablecreate_temp_emailwait_for_verification_codecheck_email_inboxhttp_requestevaluatesnapshotaccessibility refs#Case Markdownassrt_planassrt_testassrt_diagnoseinfinite recoverysuggest_improvementclaude-haiku-4-5gemini-3.1-pro@playwright/mcpopen sourceself-hostedno locator rot

Where this leaves a team considering AI in automation testing

The moment you start using AI to drive a browser, the question stops being "how smart is the model" and starts being "what primitives did the framework give it." Ask a vendor how they stabilize for streaming content. Ask them how they clear a real OTP. Ask them whether their agent can call an external API in the same #Case. If those answers are not on-disk TypeScript, you are buying a black box whose failure modes nobody can fix for you.

Assrt's five primitives are on disk, named, and within two hundred lines of each other. The LLM is swappable. Your test plan is Markdown. The point of this guide is not to pitch the product; it is to show you the specific shape of the primitives so you can evaluate any tool in the category, not just this one. If you want the easiest path to actually running them on your stack, talk to us.

Want the five primitives running on your signup flow this week?

30 minutes. We'll plug Assrt into your staging environment, clear an OTP flow end to end, and hand you the #Case file to keep.

Book a call

Frequently asked questions about AI in automation testing

What does 'AI in automation testing' actually change, at the code level, versus plain Playwright?

Five specific primitives get added, all visible in the Assrt agent source. First, wait_for_stable injects a MutationObserver at run time and waits for 2 consecutive seconds of zero DOM mutations or 30 seconds elapsed (agent.ts lines 962-999), which replaces hardcoded setTimeout sleeps that either flake or waste time. Second, create_temp_email + wait_for_verification_code + check_email_inbox give the agent its own inbox from temp-mail.io (agent.ts lines 115-131), so signup flows run end-to-end without fixture accounts. Third, a DataTransfer + ClipboardEvent one-liner hardcoded in the system prompt at agent.ts line 235 pastes a full OTP into N single-character input fields in one DOM event, bypassing the one-character-at-a-time race that breaks React-controlled inputs. Fourth, http_request (agent.ts lines 172-184) lets the agent call external APIs like Telegram's getUpdates to verify a webhook fired after a UI action. Fifth, MAX_STEPS_PER_SCENARIO is set to Infinity at agent.ts line 7, so when a step fails the loop re-snapshots the page and tries again with fresh accessibility refs instead of bailing at the first red mark.

Why is the OTP paste trick baked into the system prompt as a hardcoded JS expression instead of being a separate tool?

Because the reliable version is a single DOM event on a specific parent element with a specific event shape, and leaving the agent to derive it from first principles every time produces a different broken version on each run. The exact expression at agent.ts line 235 builds a DataTransfer, sets text/plain on it, and dispatches a synthetic ClipboardEvent on the PARENT of the first input[maxlength="1"]. That matches how browsers actually paste clipboard content across a chunk of grouped inputs, so React's synthetic event handlers process it once with the whole code. The prompt explicitly says 'Do NOT modify this expression except to replace CODE_HERE' because any deviation (typing into each field, dispatching on the input instead of the parent, using keydown) gets swallowed by React's controlled-input logic and the agent ends up with 2 digits entered, a stuck form, and a retry loop. Hardcoding the one working expression in the prompt is a more robust design choice than giving the model freedom to improvise.

How is wait_for_stable different from Playwright's built-in waitForLoadState('networkidle')?

networkidle fires when there are no network requests for 500ms, which says nothing about whether the page has finished rendering. AI-generated UIs, streaming chat, and React SPAs routinely finish their network activity long before they finish painting. wait_for_stable measures the thing that actually matters for a downstream click: DOM mutations. The implementation at agent.ts lines 962-999 injects a MutationObserver watching document.body with childList, subtree, and characterData, polls window.__assrt_mutations every 500 ms, and breaks out of the loop when stable_seconds (default 2) pass with no change in the counter. Maximum timeout is 30 seconds by default, clamped to 60 (line 957). That primitive was added specifically because AI-assisted tests run against AI-assisted apps, and AI-assisted apps stream content for seconds after the network goes quiet.

Why does Assrt's agent need http_request if it already drives a browser?

Because half of modern app tests are not about what shows up on the page. They are about what the page did to a downstream system. After you click 'Send notification' in a web app, the assertion that matters is whether Telegram actually got a message, whether your Slack incoming webhook got called, whether Stripe recorded the payment. http_request (agent.ts lines 172-184) lets the agent poll api.telegram.org/bot<token>/getUpdates or post to a verification endpoint or fetch a Stripe event, and then make assertions against the HTTP response instead of guessing from the UI. Classical Playwright tests either skipped this (flaky, partial) or mocked it (lies). An AI agent with http_request can verify it end-to-end without leaving the one tool surface it already knows.

What stops an AI agent from getting stuck in an infinite retry loop when MAX_STEPS is set to Infinity?

Three structural checks. First, the agent must call complete_scenario (tool at agent.ts lines 146-156) to end a scenario, and every call sets passed=true or passed=false with a summary, so 'infinite' means the scenario stays open until the agent explicitly decides it is done. Second, every assert (lines 133-144) produces evidence and pass/fail, so the agent accumulates a record that biases it toward closing out once it has a decisive result. Third, when a step fails the recovery rule in the system prompt (lines 220-226) says: snapshot, look at the changed page, try a different ref, scroll and retry, and if still stuck after 3 attempts, call complete_scenario with passed=false. The Infinity cap is there so a flaky middle step doesn't kill a long integration run; it is not carte blanche to loop forever.

Is the disposable inbox actually real, or is it a simulated LLM response?

It is real email. create_temp_email calls out to temp-mail.io and returns a working inbox address (implementation in /Users/matthewdi/assrt-mcp/src/core/email.ts). wait_for_verification_code polls that inbox for up to 60 seconds, parses the OTP out of the newest matching message, and returns it as a string the agent can feed into its paste expression. If your app sends real SMTP mail, Assrt's agent sees the real mail. This is the primitive that turns 'test the signup flow' from 'seed a fixture user and skip verification' into 'create a fresh user on prod-like infra, clear the email-verify step, and continue into the logged-in experience'.

Which model is actually running the agent loop, and can I swap it out?

claude-haiku-4-5-20251001 by default (agent.ts line 9). Haiku is the right choice for a tool-use loop that is heavy on fast decisions and light on long-form generation. The whole run() path is provider-agnostic: both Anthropic (with either an API key or a Claude Code OAuth token) and Gemini 3.1 Pro work through the same GEMINI_FUNCTION_DECLARATIONS bridge at agent.ts lines 277-301. You can set ANTHROPIC_MODEL or GEMINI_MODEL env vars, or point ANTHROPIC_BASE_URL at a local proxy to keep all data on your network. Nothing in the 18 tool definitions is provider-specific.

Does my test file end up proprietary once I use Assrt?

No. The test file is Markdown. Each test is a #Case N block with English steps, stored at /tmp/assrt/scenario.md and optionally synced to the app.assrt.ai dashboard if you want sharing. The scenario layout is defined at /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts: scenario.md is your plan, scenario.json is metadata, results/latest.json is the last run. You can grep, diff, and commit the plan alongside your Playwright project. If you ever leave Assrt, your authoritative test artifacts are already plain text on your disk.

How do I actually run this against my own app today?

Two paths. Inside Claude Code or another MCP client: add the assrt-mcp server (one line in your MCP config), then prompt 'Use assrt_test on http://localhost:3000' with a plan, and the three tools assrt_plan, assrt_test, and assrt_diagnose become available. From a plain terminal: npx @assrt-ai/assrt setup, then `assrt run --url http://localhost:3000 --plan '#Case 1: Signup flow completes'`. Add --video to record and auto-open a player, --extension to attach to your real Chrome so you keep your saved sessions, --isolated if you want an in-memory profile. Every flag is documented in /Users/matthewdi/assrt-mcp/README.md.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.