Every engineering team has the same conversation eventually.
A test fails on the build. Someone investigates. Forty minutes later they figure out the product is fine. A class name changed. The selector did not match. They update the locator, push a fix, the build goes green.
That story repeats itself dozens of times a week across most teams. And nobody is tracking what it actually costs.
The hidden tax that brittle tests impose on engineering teams. Why most AI testing tools made it worse instead of better. And what an honest test suite actually looks like.
At the end there is a calculator that will show you exactly how much your current test suite is costing you in noise. Most teams are surprised by the number.
The Two Kinds of Test Failures
When a test fails, only one of two things is true.
The product broke. The test caught a real regression. This is the failure you want. It is the entire reason you have a test suite.
Or, the test broke. The product works fine. A class name changed, a div got wrapped in another div, a button moved three pixels to the right. The test is failing because of how it was written, not because something is wrong.
That second category is noise. Expensive, trust-destroying noise.
Every minute spent investigating a noise failure is a minute that could have been spent building the product. Every false alarm is a small chip taken out of your team's confidence in the test suite. Eventually, teams stop trusting their tests. Builds get merged with red CI. Releases ship without verification. The test suite quietly dies as a useful signal.
Why Self-Healing Is a Patch, Not a Fix
The industry knows this is a problem. The standard response has been self-healing tests.
Tools like mabl, Testim, and Octomind built ML layers on top of their selector-based engines. When a locator breaks, the ML tries to find the element using nearby attributes, parent structure, or semantic similarity. Sometimes it works. Sometimes it does not.
It is a clever patch. But it is still a patch.
The fragile foundation is still there. The test still depends on selectors. The ML layer just makes the dependency slightly less brittle. You are adding intelligence on top of fragility instead of replacing the fragility itself.
A genuine fix requires asking a different question. Not "how do we make selectors less brittle?" but "why are tests coupled to selectors at all?"
What a Human Tester Actually Does
A QA engineer does not look at the DOM. They look at the screen.
They see a button that says "Submit." They click it. They check whether the right thing happened.
They do not know or care that the button has the class btn-primary-v2 or that it lives inside div.checkout-wrapper > form > section:nth-child(3). If the designer renames the class tomorrow, the human still finds the button. They still click it. The test still runs.
That is the bar. A test should be sensitive to what the user experiences, and blind to how it was implemented. Selector-based testing gets this exactly backwards.
How Specialized AI Models Change the Contract
At Text2Test we have developed specialized AI models for software testing. They read the screen the way a human tester does, visually and semantically, by what things look like and what they say, rather than by parsing HTML.
When you write a test in plain text: "Go to the checkout page, fill in the card details, click Submit, verify the confirmation screen appears". Text2Test executes that instruction by looking at what is on screen. Not by querying the DOM.
The practical consequence: the test does not break when a class name changes, a component gets refactored, or a designer moves a button. It breaks when the product breaks. Submit stops working. The confirmation screen does not appear. The checkout flow is genuinely broken.
That is what determinism in testing means. A test that fails when, and only when, the product fails.
The Dishonest Test Suite
Most test suites today are dishonest. They tell you something is broken when it is not.
The honesty of a test suite can be measured. The percentage of failures that represent real product regressions versus noise from selector drift, timing issues, and infrastructure flakiness.
For most teams that number is well below 50%. Some teams are operating with test suites where 80% of failures are noise.
That is a tax on every engineering hour. And it compounds. The more tests you add, the more noise you generate. Eventually the suite becomes net-negative. It slows you down more than it protects you.
Calculate Your Test Suite Honesty Score
We built a calculator that quantifies this. Plug in five numbers about your current test suite and it outputs three things: your Honesty Score, hours per week wasted on noise failures, and the annualized cost in engineering time.
Test Suite Honesty Calculator
Adjust the sliders to match your team. Results update instantly.
Default values are grounded in industry research. Google engineering data shows 84% of CI failures involve flaky tests rather than real regressions. Tricentis reports 30 to 50% of QA time goes to test maintenance. Industry benchmarks place noise investigation at 20 to 30% of a QA engineer's work week. Sources: Google Engineering Blog (Micco, 2016), Tricentis State of Testing, LambdaTest Future of QA Survey 2024.
Once a test suite is honest, three things start happening.
Engineering trust returns. Red builds get taken seriously again because they actually mean something. The team stops merging through failures.
Coverage expands. Writing tests stops feeling like a tax because the tests stay stable. Teams add more, not fewer.
Releases get faster, not slower. The test suite becomes a green-light system instead of a blocker, because passing tests are evidence the product works rather than evidence the selectors did not drift.
A test suite built on specialized AI for software testing changes the fundamental contract of what a test is. A test becomes a behavioral specification, written in plain language, executed against what the user actually sees. The selector is not the test. The behavior is the test.
That is what we are building toward at Text2Test. A test suite that is honest with you.
The bar for AI testing is not generating tests faster. It is generating tests you can actually trust.
