How do you test a product nobody uses yet?
You can put rigorous users in front of your work before a single real one shows up, and done honestly it finds real defects instead of hypothetical ones.
You test it by building the users you don’t have yet. Not survey respondents, not a focus group of friends who will be kind to you. A small set of evaluators, each handed a specific kind of professional life and the real product, each told to use it the way that person would and report honestly where it would fail them. Then you check your design choices against published research instead of your own taste. That is the method, and the part worth your attention is that it produces real, falsifiable problems rather than a list of opinions.
The oldest pre-launch problem has two halves. You are too close to your own work to see its blind spots, and you have no one to test it on. Most makers solve the second half by waiting. Ship it, watch the support inbox, learn from the people who churn. That works, but it spends your first real users as crash-test dummies, and the first impression they form is the one they tell other people about. The alternative is to find the failures while the cost of finding them is still a few hours of reading.
What a simulated user actually is
The instinct, when someone says “simulate users,” is to picture made-up testimonials. That is the wrong picture, and it is the version of this that deserves the suspicion it gets.
A useful simulated user is an evaluator given three things. A persona with a real use-pattern, so the lens is specific rather than generic. The actual built artifact, meaning the real code and the real behavior, not the marketing page that describes what the product is supposed to do. And an honest brief, which is the part that does the work: find where this fails me, stay read-only, and report what you see, not what I hope you’ll see.
The personas should disagree with each other on purpose. A solo user who opens the app once a week wants different things than someone overseeing many commitments at once, who wants different things again than someone whose work has hard external cutoffs. A newcomer seeing the empty screen for the first time tests something none of the others can. And one evaluator whose only job is to distrust the product, to hunt for the way it loses data or quietly does the wrong thing, finds the failures the optimists walk past. Five lenses on the same artifact catch what one lens, however careful, will miss.
The reason this beats your own judgment is not that the evaluators are smarter than you. It’s that they aren’t you. You know what the product is supposed to do, so you unconsciously use it correctly. A persona built to be a casual, distractible user does the thing a casual, distractible user actually does, which is close the tab and forget. That single behavior, simulated, exposed the most dangerous decision in the whole product.
The proof is in what it caught
A method like this is only worth writing about if it finds things you would have shipped. It found three.
The first was a silent correctness bug. Every account stored its timezone as a default placeholder, and no sign-in path ever wrote the real one. The evaluator playing the distrustful user traced it through the code and saw the consequence. For anyone not on that default zone, “today,” “due today,” “overdue,” and the send-time of a reminder would all compute against the wrong day boundary. A deadline due today could read as overdue a day early. The whole promise of a tool like this is that you can trust the dates. A date wrong by a day is the exact failure that breaks that trust, and it was invisible from the inside because the person who built it happened to be testing near the default zone. The fix was small. The catch was everything, and you only catch it if something is rigorously pretending to be a real user far from where you sit.
The second was a default. Reminders shipped off by default, one per deadline, out of a sense of restraint. Don’t email people about things they can already see. The casual persona flipped that reasoning in one move. A user signs up, adds deadlines, closes the tab, and the tool whose entire job is to remember does nothing at all. The asymmetry settles it. A missed deadline is unrecoverable and visible to other people. Too many emails is recoverable and has an off switch. This is the shape of loss aversion that Kahneman and Tversky described in 1979, where a loss looms larger than an equivalent gain. The annoyance of one extra email does not weigh the same as the deadline it would have saved. A tool built to remember cannot default to silence.
The third was a feature that failed when it mattered, and the honest part is in how that was measured. The product had a browser pop-up reminder that fired only while a tab was open. Simulated in normal conditions, the evaluators estimated its hit-rate. These are estimates from the simulation, not measured numbers from real use, and that distinction matters. The estimate ran around 95 percent for the rare user who keeps the dashboard open all day, somewhere in the 40 to 65 percent range for an occasional checker, near zero on a phone, and near zero for anything that came due at night or on a weekend. A reminder that mostly fires when you are already looking, and goes quiet exactly when you aren’t, is not doing the job a reminder exists to do. So it was cut. Cutting something you built, when the evidence says it fails at the moment it’s supposed to work, is a design decision and not a defeat.
Notice what those three have in common. None is a matter of taste. Each is falsifiable. The timezone bug is either there in the code or it isn’t. The silent default either leaves a new user unreminded or it doesn’t. The pop-up either fires when you’re away from the screen or it doesn’t. The simulation didn’t produce a mood board of preferences. It produced claims you could check.
Where the research does the second half
Simulated users are good at finding what’s broken in front of them. They are weak at telling you whether a choice holds up across thousands of people you haven’t met. That is the second half of the method, and it’s the cheaper half. For a question like how often to send something, the literature has already run the experiment at a scale you never will before launch. A randomized study by Fitz and colleagues in 2019 found that people whose notifications were batched a few times a day felt more attentive and less stressed than those getting a constant stream, and that the group receiving nothing at all fared worse than both. A predictable, batched signal beats both the firehose and the silence. You don’t have to relearn that on the backs of your first users. You read it, apply it, and spend your own testing budget on what no paper can tell you, which is the specifics of your own artifact.
The pairing is the whole point. The simulation finds the concrete defect in the thing you actually built. The research grounds the judgment call that no amount of staring at your own product will settle. One without the other leaves a gap.
There’s a failure mode here worth naming, because it’s the one that makes people distrust the whole approach. If you write a soft brief, you get soft results. Tell an evaluator your product is great and ask what they like, and you have built a machine for flattering yourself. The honesty isn’t a nicety. It’s the load-bearing wall. The brief has to invite the worst finding, the read-only constraint has to keep the evaluator from rationalizing on your behalf, and the personas have to be allowed to walk away unimpressed. A simulation that can only return good news has told you nothing, and you paid for it in false confidence, which is the most expensive thing a maker can buy before launch.
You will still need real users. Nothing here replaces the day someone you’ve never met does something with your product that no persona predicted. What it replaces is the version of launch where that person is also the one who finds the bug that a far-from-home, rigorously skeptical, honestly-briefed simulated user would have found in an afternoon, for free, before anyone was watching.
Common questions
Isn’t a simulated user just you talking to yourself? It is if you let it be. The guardrails are what keep it honest. Each evaluator gets a persona unlike you, reads the real built product rather than your description of it, and is briefed to find failures and stay read-only. A brief that invites the worst finding returns real ones. A brief that asks what the evaluator likes returns flattery. The method lives or dies on which brief you write.
How is this different from just asking friends to try it? Friends are kind, and they share your assumptions about how the thing is meant to work, so they use it correctly and miss the same blind spots you do. A persona built to be careless, or skeptical, or working under a hard external cutoff, behaves in ways your friends won’t, and surfaces the failures those behaviors trigger. The point isn’t more goodwill. It’s more unlike-you behavior pointed at the same artifact.
Doesn’t this still miss things a real user would catch? Yes. It is a pre-launch instrument, not a substitute for launch. Real users do things no persona predicts, and you find those only by shipping. What the method buys you is that the obvious, falsifiable defects, the wrong day boundary, the silent default, the feature that fails when you look away, get caught before a paying stranger is the one to find them.
Are the usage numbers from this method real data? The estimates a simulation produces are estimates, and they have to be labeled as such. A simulated hit-rate is a reasoned projection from the conditions you set up, not a measurement of real-world behavior. Treat it as a signal that points you at a problem, not as a production statistic. The honesty about that distinction is part of what makes the rest of the findings trustworthy.