Test Reliability

Flaky Tests: Why They Happen and How to Actually Kill Them

A flaky test is one that passes and fails on the same code and the same inputs. It is not a test of a broken feature. It is a test of a nondeterministic test. Treating the two the same is the first mistake most teams make.

Flaky tests are the second worst thing in a test suite. The worst is tests that always pass because they assert nothing. Flaky tests are worse than bugs because they train engineers to ignore red builds. Once the team learns that red sometimes means nothing, red always means nothing, and the suite becomes decorative.

The mental model for eliminating flakiness is narrow: every flaky test has a hidden input. Find the hidden input, declare it, pin it, and the flakiness is gone. There are only eight places that hidden input can live. Here they are.

Before classifying, run the test in isolation five times, then in the full suite five times, then shuffle the order and run five more. The pattern of where it passes and fails tells you which of the eight classes you are dealing with before you read a single stack trace. A test that fails 1 in 20 in isolation is almost always an async race or unseeded randomness. A test that passes alone and fails in the suite is global state or order dependency. A test that passes locally and fails on CI is time, network, or environment.

Taxonomy of flakiness

Async race

timing

Time-dependent

clock

Network

external I/O

Filesystem

shared paths

Randomness

unseeded RNG

Global state

bleed between tests

Order-dependent

A needs B

Environment

CPU, viewport

Async races are most of it

The first class covers roughly 60 percent of flakes in browser and component tests. The test reads state before the async work that produces that state has finished. On a fast machine the work happens to complete first and the test passes. On a slow CI worker it does not, and the test fails.

The tell: the assertion message shows the expected value and an empty, null, or stale actual value. The failure is never halfway, the state simply was not there yet.

The fix: never use a fixed setTimeout to wait. Wait on the thing you actually care about. In Testing Library use findBy* or waitFor. In Playwright use expect.poll or web-first assertions (await expect(locator).toHaveText), which retry until the condition holds or a timeout elapses.

A common sub-case is the pre-unmount state update. A React component fires an async request, unmounts before the response lands, and the test asserts on state that was torn down. The symptom is an intermittent act() warning in the test output, often ignored, followed by a flake a day later. Fix it by aborting the request in the cleanup effect and asserting that the request was canceled, not that the response was ignored.

checkout.spec.ts

The four classes that come from undeclared inputs

Time, network, filesystem, and randomness are flaky for the same reason. The test reads a value the test did not set. Fix them by setting the value.

Time-dependent tests

Straightforward

Tell: the test passes locally, fails in CI at certain hours, or fails after DST. Assertions touch Date.now(), new Date(), or rely on CI container timezone.

Fix: freeze the clock. In Vitest or Jest use vi.useFakeTimers() and vi.setSystemTime(new Date('2026-01-15T12:00:00Z')). Set TZ=UTC in your CI workflow. Treat timezone as a declared input.

Network-dependent tests

Moderate

Tell: intermittent timeouts, 503s, or DNS errors with no change to application code. The test reaches a real host: a CDN, a third party API, a staging environment that restarts during deploys.

Fix: mock the boundary. Use mswfor HTTP at the fetch layer, or Playwright's page.route() for end-to-end. If you genuinely need to hit a real service, that is an integration test, tag it, and run it on its own schedule with a retry budget that does not gate PRs.

Filesystem collisions

Moderate

Tell: tests pass in serial, fail with EEXIST or ENOENT when sharded. Two workers are writing the same path under /tmp.

Fix: derive every path from a per-test unique id. Use fs.mkdtemp or os.tmpdir() + crypto.randomUUID() per test. Clean up in an afterEach so a crash mid-test does not poison the next run.

Unseeded randomness

Straightforward

Tell: a property-based or fuzz test fails once per 50 runs with a value you cannot reproduce locally. Generators use Math.random, crypto.randomUUID, or Faker without a seed.

Fix: seed every generator at the top of the test file. faker.seed(42), fc.assert(prop, { seed: 1337 }). When a fuzzer does find a real bug, log the seed so you can replay it. An unreproducible failure is not actionable.

Try Assrt for free

Open-source AI testing framework. No signup required.

Get Started →

The state problems: global bleed and order dependency

Classes 6 and 7 look similar in the logs. The test fails only when run with others, or only when the suite runs in a specific order. The root cause is state that outlives a single test: module-level caches, singleton instances, mutated environment variables, a database row that a previous test left behind.

The tell for global state bleed: the test passes in isolation (vitest run path/to/test.ts) but fails in the full suite. The failure depends on which other tests ran first, not on the code under test.

The tell for order dependency: shuffle the test order (most runners support --sequence.shuffle) and the failure moves. Test B passes after A runs, fails when B runs first.

The fix is the same for both: make each test own its setup and teardown. Do not rely on state built up across tests. Reset the module registry, the DI container, the database, and the environment in beforeEach.

test-setup.ts

Class 8: the environment itself

The eighth class is the hardest to diagnose because the test, the code, and the inputs are all fine. The test fails because the CI worker has one CPU and is running at 100 percent, because the headless browser is animating a transition the test does not wait for, or because the viewport is 1024 wide on CI and 1440 locally.

The tell: the test passes on a developer laptop every time and fails on CI one run in five. The failure pattern correlates with CI load rather than code changes.

The fix: disable animations in the test environment (Playwright has page.emulateMedia({ reducedMotion: 'reduce' })), pin the viewport in the config, and give browser tests enough CPU. A 2 vCPU runner that hits 100 percent utilization is the most common cause of flakiness that looks like race conditions but is actually starvation.

Two more environment-level culprits deserve named attention. Autosaving browser caches (service workers, IndexedDB, localStorage) carry state between sessions and mean a test that worked on the first run fails on the second in the same worker. Clear storage in a beforeEach or use Playwright's storageState to start from a known snapshot. Stale DOM references are the other: clicking a node, waiting for a re-render, then calling a method on the old handle. Always re-query the locator after any action that might re-render. Playwright locators are lazy by design, exactly to dodge this.

Three responses that are worse than the flakiness

Once a team learns a test is flaky, three bad responses are common. Each looks like a fix and each makes the problem worse.

Anti-pattern 1

Retry-on-failure in CI

Configuring retries: 2 so a flaky test is green if any of three attempts pass. This does not fix the test, it hides it. Every real bug that happens to fail only sometimes is now also hidden. The team loses visibility into a whole class of production defects.

Anti-pattern 2

Skip or .only the test away

Marking the test .skip with a TODO comment. The TODO is never revisited. Six months later the feature the test covered is broken in production and no one notices because the test has not run in six months. A skipped test without a deadline is a deleted test.

Anti-pattern 3

Lower timeouts until the failure rate stabilizes

Raising testTimeout from 5s to 30s until the flake stops. You now have a real bug where a page takes 20s to render and no one will ever see it, because the test has been given permission to wait that long.

The correct response: quarantine, then fix or delete

The workflow that actually reduces flakiness over time has three phases. First, isolate the flaky test from the main signal so it does not train the team to ignore red builds. Second, fix it within a bounded window. Third, delete it if no one fixes it in that window, because a test no one owns is not worth the CI minutes.

vitest.config.ts

Tag flaky tests with a filename suffix (.flaky.test.ts) or a test-runner tag. Run them in a separate CI job that reports results but does not gate merges. The main suite stays clean and honest.

.github/workflows/flaky-quarantine.yml

Every quarantined test carries an owner and a deadline. Fourteen days is a reasonable default. If a test is still flaky after 14 days, it gets deleted. Not disabled. Deleted. The code it covered either gets a new test that is not flaky, or loses coverage and the team accepts that tradeoff consciously.

This sounds harsh until you track the data. Teams that quarantine aggressively and delete on a 14-day timer end up with fewer flaky tests and more real coverage than teams that retry-on-failure forever, because the quarantine pressure makes engineers actually fix the root cause instead of hiding it.

The shape of a flake-free suite

A suite that stays green honestly has four properties: zero retries, shuffled order, pinned clock and seed, and mocked external I/O. Every test declares its inputs. Every test owns its setup and cleanup. Nothing leaks across tests. Red means broken and green means working, every time.

Getting to that state from a suite that is already 15 percent flaky takes weeks, not hours. Work through the classes in order of prevalence: async races first, then state bleed, then time and network, then the long tail. Track the flake rate as a first-class metric alongside coverage. When the rate trends to zero, the whole team learns to trust the signal again, and the testing culture comes back.

Treat the flake rate as a product metric. Post a weekly number in the team channel. When the rate is above 1 percent, fixing flakes takes priority over new feature work. When the rate is below 0.1 percent, teams stop talking about testing infrastructure and start talking about the product again, which is the actual goal.

If you are starting a new suite from scratch, or auto-generating end-to-end tests with a tool like Assrt, these defaults are cheap to bake in up front: seeded data, mocked network, web-first assertions, zero retries. It is a lot easier to stay flake-free than to get back to it.