Sandfield Associates | Synthetic User Testing: What AI can and can't tell you about your UX

It’s a familiar story: a software project runs out of time, and "nice-to-have" tasks get chopped from the scope. Sadly, user testing is usually the first to go. While incredibly valuable, setting up proper human testing scenarios is historically the first thing deemed too time-consuming to save.

That's when Synthetic User Testing comes into play.

Synthetic User Testing is an AI-assisted UX evaluation method in which a language model, given browser access via tools like Chrome DevTools MCP, navigates a working prototype and reports friction points, accessibility violations, and goal-completion outcomes — without involving human participants.

Think of it as a pre-flight check for your UI. While it isn't a total replacement for human testing, it allows designers to rapidly test complex user journeys and validate interactions early in the prototyping phase.

What is Synthetic User Testing good for?

Synthetic User Testing is most effective for functional validation, accessibility auditing, and catching obvious UX friction points early in the design process - tasks where machine speed and consistency outperform the time cost of recruiting human participants.

Validate user flows

Ensure that both the happy paths work and that the unhappy paths can be resolved by your personas. Functional testing, automated. Gather basic functional feedback for a multitude of scenarios early on, then translate this into actual test cases afterwards.

Accessibility and performance audits

Catch WCAG violations during the development phase, before they become expensive to fix.

Rapid iterations

Test a large number of iterations of a landing page's layout or structure to see which solution fits best for a specific persona - quickly, and at scale.

Catch obvious UX friction points

The synthetic tests will surface the obvious friction, leaving the in-depth, nuanced testing to actual humans.

What is Synthetic User Testing not useful for?

Synthetic User Testing cannot replicate emotional depth, real-world unpredictability, or the final human judgment that determines whether a product truly makes sense to the people using it.

Emotional depth

While AI can simulate a persona, it can't truly map the frustrations and emotions involved while interacting with a system.

Real-world chaos

Synthetic testing happens in a safe environment. There is no chaos, everything is predictable and predetermined. It is not simulating real-world conditions.

Final decisions

This is a very inhuman way to test human software. It tells us that the system works functionally, but it doesn't explain whether it actually makes sense to a real person.

How does Synthetic User Testing compare to regular user testing?

The limitations and advantages of synthetic user testing are directly correlated to the implicit qualities of machines versus humans. Fitts defined these properties in 1951:

Human strengths	Machine strengths
Detection & perception	Speed & power
Judgment & induction	Computation & replication
Improvisation	Simultaneous operations
Long-term memory	Short-term memory

Source: Fitts, 1951

The obvious main advantage to a designer is rapidly catching UX friction points during the prototyping phase - a phase that is starting increasingly earlier on in projects. The tests can provide a quick validated answer from different perspectives, not just answering 'does this work?' but answering 'does this interaction work for this persona in this scenario?'.

How do you set up a Synthetic User Testing proof of concept?

A Synthetic User Testing proof of concept requires three components: an LLM with browser access, a working software prototype to be tested, and configured browser MCPs — such as chrome-devtools-mcp and playwright-mcp - that give the agent the ability to navigate the prototype.

I was tasked with evaluating whether this approach would make sense to implement in Sandfield's workflow. Our LLM licensing supports GitHub Copilot and Claude Code, we'll use both interchangeably throughout the project. We tested the concept initially on an existing prototype for one of our clients, but we'll do a re-run on our Origin website for this blog post.

To be able to talk to the browsers, we'll use chrome-devtools-mcp and playwright-mcp. The Chrome MCP allows for easy navigation in the browser, while the Playwright MCP adds automated quality testing and regression testing to the mix - more on that later.

After creating a folder for the project and setting up the MCPs, it's time to test the waters. Here's a folder structure that makes sense:

The requirements folder is where functional requirements live. The more context the LLM understands, the better it can fine-tune the personas to the project and ensure the scenarios being tested are correct and useful. In most cases, we'll create our own personas - but it doesn't hurt to have the LLM generate several personas from the requirements alone. After all, the agent might surface creative ideas we'd never thought of.

The reports folder holds the final Markdown reports for each test run, highlighting UX friction points, successes, failures, and WCAG audit results.

The tests folder is the place for Playwright. Since Synthetic User Testing focuses on persona and scenario-based testing, it's useful to connect the scenarios that have been created to Playwright: automated quality assurance testing.

What does a Synthetic User Testing session look like in practice?

To show this in action, we ran a session on our own Origin website, which is currently undergoing a revamp, so the findings are directly useful.

Defining the persona

We created a project called 'Origin Supply Chain' and used three example personas. One of them is Marcus:

Marcus

Role: IT Systems Manager Company: Multi-entity logistics group Location: Auckland, New Zealand

Who they are

Marcus is 38, IT Systems Manager at a multi-entity logistics group that runs SAP for finance and a legacy WMS built in-house. He's been pulled into a TMS evaluation by the ops team and his job is to answer two questions before the exec team signs off: can this integrate with what we already have, and is it secure enough to pass their ISO compliance audit? He's not the decision-maker, but he's the veto. A bad outcome for Marcus is recommending a vendor who later turns out to have a "custom integration project" price tag attached, or a security posture that fails the audit.

Goals for this session

Understand what integration options Origin offers — specifically whether SAP and existing WMS connections are standard or bespoke
Find Origin's security posture, certifications, and data residency information — he needs to complete a vendor risk assessment form
Determine whether customers get API access or if everything goes through a managed integration service

What matters to them

Technical specifics, not marketing language: "seamless integration" means nothing; supported protocols, connectors, and EDI formats mean everything
Security documentation: ISO certifications, data residency, penetration testing cadence — he'll need to screenshot these for the audit
Honest scoping of custom work: What's in the standard product vs. what requires a project? He's been burned by scope creep before
The ability to do his own research: He'll avoid sales calls until he has a clear picture; a demo request is a last resort

Behaviours to simulate

Go directly to the Products > Integration page first
Look for protocol/standard/connector lists on integration pages
Navigate to About > Security looking for certification badges or download links
Search for API documentation or developer resources
Check footer and nav for a "developers" or "docs" link
Read carefully for "managed service" vs. "self-service" language around integration

Success criteria

Finds the Integration product page and understands the service model
Locates the Security page and identifies any certifications
Gets a clear answer on whether API access is available to customers
Can partially fill out a vendor risk assessment

Red flags to watch for

Integration page is vague about protocols and connectors
No security page or certification information
No distinction between what's standard and what requires scoping
API or developer documentation absent or behind a contact wall
Security certification links broken or generic

We’ll test to see whether Marcus can successfully reach his goals: learning whether SAP and WMS connections are a ‘default’ option or a ‘bespoke’ option, completing a vendor risk assessment and to find out if our API is accessible. Specific goals, let’s put our current website to the test.

MCP set up

Before starting the test, it's worth confirming that both MCPs are running correctly so the agent can browse through the project. With chrome-devtools and playwright both confirmed as live, the agent has access to both - meaning we can perform the user testing process and follow up with automated QA testing.

Approach

Since working with LLMs is inherently non-deterministic, we need to provide sufficient guidelines and structure to ensure consistent output. To do this, we created two markdown files:

Both files contain instructions regarding the workflow. The synthetic user testing process is highly structured to ensure consistent, actionable results. It begins by defining a clear persona and real-world task scenarios (Goals). The core execution involves a "think-aloud" session, where the agent narrates the persona's inner monologue while navigating the product at both desktop and mobile viewports.

Friction is classified by severity (Goal-blocking 🔴, Goal-friction 🟠) and documented in a report that prioritises actionable recommendations. The process concludes by generating two separate engineering artefacts: Playwright tests for automated regression coverage on critical findings, and an automated accessibility audit using axe-core across all tested pages.

The think-aloud method

The most useful insights from user tests often come to light using the think-aloud method: a simple approach that asks participants to verbalise their thoughts while using a product. The markdown file stresses the importance of first-person, present-tense narration - including honesty about confusion and reasoning for decision-making.

Workflow summary

Persona defined (.md file in personas/)
Chrome MCP live browser exploration, goal-first
Friction report written to reports/
Playwright .spec.ts written from report findings, saved to tests/personas/
Accessibility audit run

What did the Synthetic User Testing run reveal?

The synthetic run for Marcus, the IT Systems Manager, provided immediate, valuable clarity on the self-service evaluation journey. While he ultimately needed a sales call to complete his vendor audit, the test identified several strong points and key areas for improvement.

On the positive side, Marcus easily found our verifiable ISO 27001:2022 certification. However, friction arose in two areas: the website was ambiguous about whether SAP integration was a standard connector or required a bespoke project, and the "fully managed service" messaging incorrectly suggested a total lack of customer-accessible API access.

The agent did its job well - it recognised the earlier report without being specifically instructed to, and decided to start a fresh session. The browser opened at two different screen resolutions. The first issue came to light quickly: we do not mention API access on our website. The agent then moved on to the second goal - the security audit - found the security page, and cleared the ISO 27001 goal after independently verifying the JASANZ register link.

Within five minutes, the report was written.

Think-aloud excerpt: Goal 1

Goal 1: Understand what integration options Origin offers - specifically whether SAP and existing WMS connections are standard or bespoke

Outcome: 🟠 Goal-friction

"OK, I'm on the homepage. 'Logistics software for operators who refuse to compromise' - fine, I'm not here for the marketing. Let me find Integration." Marcus scans the nav - Products is right there. He opens the dropdown.

"Integration - yes, and the description says 'connect your systems, partners, and customers with our fully managed supply chain integration service.' Fully managed. I'll note that." He clicks Integration.

He reaches the connector ecosystem list. "Internal Systems: ERP (SAP, NetSuite), WMS, TMS, Forwarding. Good - SAP is there by name. WMS too. But it's a bullet point, not a connector catalogue. I don't know if SAP is a standard connector or if 'we've done it before, let's scope your project' is the actual answer."

He scrolls to the bottom. Contact form. "Of course. Nothing else here - no spec sheet, no connector list, no pricing. I'll look at Crossfire." He navigates to Crossfire's site, finds the SAP connector listed. "So SAP is a real connector, not a hypothetical. But I'm now on a completely different website and I still don't know what's pre-built versus what Origin would want to scope as a project."

Recommendations

Based on the Marcus audit, here are the prioritised fixes:

Recommendations

High impact

Add a "What's included vs. what's scoped" section to the Integration product page — even a simple table listing standard connectors vs. connectors requiring a project conversation would let evaluators like Marcus answer their core question without booking a demo
Add Security to the About dropdown nav — it currently exists only in the footer, meaning any user who navigates About > [looking for Security] hits a dead end and may not find the page at all
Specify the AWS data region on the Security page — "hosted in Amazon Web Services" with no region is an automatic audit gap for any enterprise vendor assessment in NZ or AU

Medium impact

Surface Crossfire's protocol details (REST, SOAP, JSON, XML, webhooks, EDI) on Origin's own Integration page — currently Marcus must navigate off-site and dig into Crossfire's FAQ to find protocol specifics that would answer G1 and G3 directly
Clarify on the Integration page that customers can self-manage API keys via the Crossfire Customer Portal — the "fully managed service" framing implies a black box when in fact there is a customer-accessible API layer
Add a NZ Privacy Act / Australian Privacy Act reference to the Security page alongside GDPR — the current compliance section cites GDPR but neither regional framework despite serving exclusively Australasian enterprise clients

Low impact

Replace "For details of our ISO 27001 certification, please contact us" with an unambiguous CTA — the link beside it already goes to the live JASANZ register, making the "contact us" instruction confusing and underselling the fact that the cert is independently verifiable right now
Add a named pen testing vendor and annual/biannual cadence detail to the Environment section — "frequent pen testing" is too vague for a vendor risk assessment form

What did we learn from running the proof of concept?

Fine-tuning of reporting is ongoing

The initial test runs quickly showed that LLM agents need ongoing refinement to ensure output is consistently useful. While the think-aloud method immediately pinpointed friction, raw outputs sometimes lacked the structured clarity needed for a busy engineering team. The lesson: continually fine-tune the agent's prompts to produce not just data, but highly specific and actionable analysis - so every test run delivers maximum value for sprint planning and design review.

A dedicated recommendations section changes everything

A key structural improvement was adding a 'Recommendations' section at the top of the final report. Initially, findings were buried deep within the goal-specific outcomes, making it difficult for busy stakeholders to grasp the high-impact fixes quickly. By introducing prioritised, action-oriented items - such as "Add Security to the About dropdown nav" - we created an easily scannable list that transformed the report from an audit document into a sprint-ready playbook.

Persona quality determines result quality

The utility of synthetic user testing is highly dependent on matching digital personas to real-life users and specific scenarios. Generalised testing yields general, low-value feedback. When we apply highly specific goals - like Marcus's need to find "AWS data residency" or "standard vs. bespoke integration" - the results immediately become powerful. This underscores the need for rigorous persona engineering at the outset of every project, ensuring scenarios are not just functional checks, but true reflections of actual high-value user behaviour.

When should design teams use Synthetic User Testing?

Synthetic User Testing is most valuable during the prototyping phase, before real-world user acceptance testing - when catching obvious friction early saves the most time and cost.

Its core value lies in providing a pre-flight check for your UX and UI by rapidly testing complex user journeys, validating flows, and catching obvious friction points in a fraction of the time. For Sandfield, this method is useful both internally and externally: it accelerates our design process by providing quick, validated answers from different perspectives during prototyping, and it ensures we deliver higher quality, functionally sound products to our clients through automated validation and accessibility audits.

The specific findings from the Marcus audit - run on our current Origin website - are now being actioned to directly inform the new website revamp, ensuring our new design addresses critical evaluator needs like security detail and integration clarity. This systematic approach ensures that by the time a product reaches real-world User Acceptance Testing, the majority of obvious friction has been smoothed out.

Furthermore, by using Playwright to turn key friction points into automated QA tests, we build a robust regression suite into the project from the very start.

Frequently asked questions

Is Synthetic User Testing a replacement for real user testing?

No. Think of it as the pre-flight check before you bring in the pilots. It catches the obvious friction so your real user sessions can be dedicated to exploring more complex, high-impact challenges.

What LLMs work for Synthetic User Testing?

Any LLM with browser tool support works. We used both GitHub Copilot and Claude Code, configured with chrome-devtools-mcp and playwright-mcp.

How long does a synthetic user testing session take?

From prompt to full Markdown report, the Marcus session took under five minutes. Writing Playwright specs from the findings adds more time, but the initial audit is very fast. The key to executing this quickly is having the right persona and testing criteria set up before starting the test run.

What's the difference between Synthetic User Testing and automated QA testing?

Automated QA testing checks whether functionality works. Synthetic User Testing checks whether a specific persona can achieve their goals. It's scenario-based and judgment-driven, not simply pass/fail.

When in a project should Synthetic User Testing happen?

Ideally during the prototyping phase, early and often. The earlier friction is caught, the cheaper it is to fix — and the more focused your real user testing sessions can be.

Insights & Blogs

Synthetic User Testing: What AI can and can't tell you about your UX

What is Synthetic User Testing good for?

What is Synthetic User Testing not useful for?

How does Synthetic User Testing compare to regular user testing?

How do you set up a Synthetic User Testing proof of concept?

What does a Synthetic User Testing session look like in practice?

MCP set up

What did the Synthetic User Testing run reveal?

Think-aloud excerpt: Goal 1

Recommendations

Recommendations

What did we learn from running the proof of concept?

When should design teams use Synthetic User Testing?

Frequently asked questions

What We Do

About Sandfield

Join The Team

Contact

Edge

Origin

Crossfire

OnAccount

Insights & Blogs

Synthetic User Testing: What AI can and can't tell you about your UX

What is Synthetic User Testing good for?

What is Synthetic User Testing not useful for?

How does Synthetic User Testing compare to regular user testing?

How do you set up a Synthetic User Testing proof of concept?

What does a Synthetic User Testing session look like in practice?

MCP set up

What did the Synthetic User Testing run reveal?

Think-aloud excerpt: Goal 1

Recommendations

Recommendations

What did we learn from running the proof of concept?

When should design teams use Synthetic User Testing?

Frequently asked questions