Text2Test Logo
Request a Demo
Back to Resources
BlogMay 2026·7 min read

The LLM Only Sees What You Paste

When a QA engineer pastes a fragment of a user story into an LLM, the model fills the rest with guesswork. Here is why AI-generated test cases feel thorough and miss everything important.

When a QA engineer pastes a fragment of a user story into an LLM and asks for test cases, the model generates output based only on what is in the prompt. Every piece of context that lives outside that prompt, in a Slack thread, a design comment, a verbal decision from standup, or institutional memory, gets filled in with statistical guesswork. The output looks thorough. The gaps are invisible.

This is the core problem with LLM-only test case generation: the model cannot test what it was not told.

Why LLM-Generated Test Cases Look Complete But Are Not

There is a moment every QA engineer recognizes.

You paste a user story into an LLM. You ask it to generate test cases. What comes back looks impressive. Structured. Comprehensive. It covers the happy path, a few edge cases, some negative scenarios. You feel like you made progress.

Then you ship. And something breaks that was never in the test cases.

Not because the model was wrong about what you pasted. Because you only pasted part of the picture.

The model does not know what the designer added in a comment three weeks ago. It does not know what the product manager said in the standup. It does not know this feature has a known edge case from the last sprint that the team agreed to handle differently. Everything outside the prompt gets filled in with probability. What comes out is plausible. It reads like real test cases. But it is built on guesswork about the parts you did not paste.

You are the one filling the gaps from memory. The model just makes that invisible.

The Oracle Problem in AI Test Generation

This has a name in software testing research: the oracle problem.

A test oracle is the mechanism that determines whether a test passes or fails correctly. When you generate tests from code, the oracle is the code itself. If the code has a bug, the tests inherit that assumption and pass confidently. This is exactly the scenario we covered in We Asked 50 QA Professionals. 61% Gave the Same Answer — when we asked whether AI should write tests from requirements, the majority said yes, precisely because requirement-based generation avoids this trap.

Google engineering research found that 84% of CI test failures involve flaky or incorrectly specified tests rather than real product regressions. A significant portion of that stems from tests that were never correctly specified to begin with.

A test case that looks thorough is more dangerous than one that obviously looks incomplete. When a test case looks complete, you stop questioning it. You trust the coverage. You ship.

This connects directly to what we wrote about in Volume is Not Coverage. Speed is Not Strategy. Generating 500 test cases from a pasted fragment is not coverage. It is a large volume of plausible guesses. The checkout flow works when you paste the checkout spec. But the edge case where a user with an expired card tries to apply a discount code at the same time? That lives in a comment thread from four months ago that you were not thinking about when you opened the chat window.

Why Gap-Filling Happens Silently

What makes this hard to catch is that you do not know what the model assumed.

The gaps get filled silently. The output looks complete. You read it and your brain, knowing the full context, reads the gaps as covered even when they are not.

This is a documented cognitive bias called satisficing. When something looks good enough, we stop looking for what is wrong with it. AI-generated test cases trigger satisficing exactly because they look so structured and thorough.

The result is a test suite that gives you confidence you have not earned. And as we covered in The Hidden Cost of Dishonest Tests, that misplaced confidence has a measurable cost. Industry research shows most teams are spending between $80k and $200k per year investigating failures that should never have passed in the first place.

How to Fix It: Start From Your Source of Truth

The instinct is to improve the prompt. Add more context. Paste more of the story. Be more specific about edge cases. That helps at the margins. But it does not solve the structural problem: you cannot prompt your way to test cases that capture context you have not written down.

The actual fix is generating tests from the artifacts your team already works from. Jira tickets written to be complete. Figma designs that capture the intended behavior. API documentation that specifies what the system is supposed to do. Requirements that were created deliberately, not reconstructed from memory.

As we explained in AI Test Case Generation from Source of Truth: How It Works and Why It Matters, when test cases are generated from a connected source of truth rather than from a pasted fragment, the guesswork disappears. The model is reading the same thing your engineers read when they built the feature. That is the difference between tests that confirm implementation and tests that validate correctness.

Key Takeaways

  • LLMs generating test cases from prompts can only test what is in the prompt. Everything else is guesswork.
  • Plausible-looking test cases are more dangerous than obviously incomplete ones because they trigger satisficing and suppress critical review.
  • The oracle problem means code-generated tests inherit code bugs. They confirm implementation rather than validate correctness.
  • The fix is not a better prompt. It is generating tests from a connected, complete source of truth: Jira, Figma, API docs, or plain text requirements.
  • When the test and the code are independent of each other, tests can actually catch bugs. When they share the same assumptions, they share the same blind spots.

Frequently Asked Questions

Can LLMs generate good test cases?

Yes, when given complete and accurate input. The quality of LLM-generated test cases is directly proportional to the completeness of the source material. A fragment of a user story produces incomplete tests. A complete, connected requirement produces accurate ones.

Why do AI-generated tests pass even when there is a bug?

Because the tests were generated from the same source as the bug. If you generate tests from code that has a bug, the tests describe what the buggy code does and pass against it. This is the oracle problem: the test and the implementation share the same incorrect assumption.

What is the best source of truth for test case generation?

Design files (Figma), requirement tickets (Jira), API documentation, and plain text behavioral descriptions written independently of the implementation. Anything that captures what the product should do rather than what the code currently does.

How is Text2Test different from pasting a prompt into ChatGPT?

Text2Test connects directly to your sources of truth rather than relying on pasted fragments. Test cases are generated from complete Jira tickets, Figma designs, or plain text requirements, and executed against the actual product rather than confirmed against the implementation.

Sources: Google Engineering Blog on flaky tests (Micco, 2016), Tricentis State of Testing 2024, LambdaTest Future of QA Survey 2024.

Ready to fix your test coverage?
Text2Test generates test cases from your requirements automatically.
Request Early Access →