Claude Code with Playwright: 4-agent test generation pipeline

Learn how to use Claude Code with Playwright using a 4-agent pipeline. Real token costs, app.context.md setup, Playwright MCP quick-start, and lessons from testing TestDino itself.

Best tool to track AI-generated tests

Your team ships faster than QA can write tests. Single-prompt AI generators make it worse.

They read the DOM, not your product rules, so scripts break on the next deploy. Using Claude Code with Playwright in a structured 4-agent pipeline fixes this.

We built it for TestDino itself. A 20-line prompt became 3 page objects and a full spec in 15 minutes.

TL;DR
  • 4 agents, 1 job each: Exploration, Test Case Generation, Automation, Maintenance. File-based handoffs, no shared memory.

  • Context is the differentiator: app.context.md + JSDoc + Playwright Skills give agents product knowledge the DOM never will.

  • Human checkpoint before code runs. Test Case Agent proposes coverage. You approve. Then automation starts.

  • CLI, not MCP, for this pipeline. About 4x lower token usage. Full comparison here.

  • Agents propose. Humans approve. The Maintenance Agent shows a diff. Nothing auto-applies.

Playwright MCP setup with Claude Code (quick-start)

Playwright MCP is how Claude Code gets live browser control. If you're already set up, skip ahead. If you're coming from Playwright MCP on Cursor, the server is identical.

terminal
claude                                                        # start Claude Code
claude mcp add playwright npx @playwright/mcp@latest  # add MCP server
/mcp                                                 # verify

Expected output:

expected-output
playwright    ✓ connected    npx @playwright/mcp@latest
              Tools: browser_navigate, browser_click,
                     browser_snapshot, browser_take_screenshot (+18 more)

Note: Use @playwright/mcp, not the deprecated @modelcontextprotocol/server-playwright. Same API, the old one stopped getting updates.

Play

This setup also works from VS Code and Windsurf; the server is editor-agnostic. The pipeline below uses Playwright CLI instead of MCP for token efficiency, but MCP is what you want for exploratory testing and self-QA loops.

Playwright CLI vs Playwright MCP

Playwright MCP Playwright CLI
How it works Streams full browser state into model context each turn Writes artifacts to disk; model reads only what it needs
Token cost High. Complex pages: 50k+ tokens per snapshot ~4x lower. Cached context reuses between runs
Context risk High on long sessions Low
Best for Exploratory debugging, self-QA, one-off flows Repeated test generation, large projects, CI

For the 4-agent pipeline, use CLI. The agents already get product context from app.context.md and JSDoc, so they don't need a full accessibility snapshot every turn.

Pro Tip: Don't mix interfaces in one pipeline run. MCP for exploration + CLI for automation in the same session breaks caching and inflates context unpredictably.

Built-in Playwright agents vs. the 4-agent pipeline

Playwright ships 3 built-in agents (Planner, Generator, Healer) that work with Claude Code and Cursor out of the box. Good for simple flows. Here's where the custom pipeline adds real value:

Capability Built-in Playwright agents 4-agent pipeline
Business logic awareness Reads accessibility tree only app.context.md + JSDoc: routes, permissions, edge cases
Test case review before code None Human checkpoint before Automation Agent runs
Multi-role tests Manual re-recording per role storageState per user, built in
Maintenance on failures Healer auto-applies fixes (masks bugs) Maintenance Agent proposes a diff; you approve

The Healer auto-applying fixes is the gap that matters most. If a test broke because the feature's permission rule changed, a silent fix hides a real bug.

The 4-agent pipeline

4-Agent Pipeline Architecture: Exploration produces app.context.md which feeds Test Case Gen, then Automation, then Maintenance which also queries TestDino MCP for CI failure history

Every handoff is a file on disk. No orchestrator. You can stop between any 2 phases, edit the output, and resume the next day.

Agent Input Output
Exploration Docs + live app via Playwright CLI app.context.md + per-feature JSDoc
Test Case Generation app.context.md + JSDoc Test case files with steps, preconditions, expected results
Automation Test cases + live app + Playwright Skills Page objects + .spec.ts files + results
Maintenance Failing tests + page objects + TestDino MCP Fix proposals or bug reports

Note: The intelligence is not in the model. It's in what you feed the model. Without app.context.md, JSDoc, and Skills, you get generic scripts.

The 3-layer context strategy

3-layer context strategy for Claude Code Playwright agents: app.context.md at base, per-feature JSDoc in middle, Playwright Skills at top, with TestDino MCP as 4th source for Maintenance Agent only

Layer 1: app.context.md - product knowledge before the agent touches a selector.

tests/app.context.md
# tests/app.context.md
## What this app does
TestDino is a Playwright test reporting platform.
- Core workflows: upload runs, view failure groups, triage flaky tests, manage API tokens
- User roles: Owner, Admin, Member, Viewer
- Auth: email/password at /auth/login
- NOT in scope: billing/Stripe, Google OAuth

Layer 2: Per-feature JSDoc: what the accessibility tree won't reveal: clipboard-only tokens, permission rules, conditional flows.

tests/pages/api-keys.page.ts
/**
 * @feature API Keys Management
 * @route /org_.../projects/.../settings?tab=api
 * @rules
 *   - Only Owner and Admin can create/rotate/delete
 *   - Token prefix: trx_
 *   - Create: token visible in response dialog
 *   - Rotate: token clipboard-only, NOT shown in UI — use API interception
 */
export class ApiKeysPage { ... }

Pro Tip: Only document what the accessibility tree can't reveal. If the agent can find it by exploring the UI, don't put it in JSDoc. Start minimal and add only where the agent loops.

Layer 3: Playwright Skills: curated markdown guides that enforce production patterns: getByRole over CSS selectors, storageState for auth, auto-waiting assertions, Page Object Model structure.

Play

terminal
# Install the Playwright Skill into your project
npx skills add testdino-hq/playwright-skill

Without Skills, Claude generates valid Playwright code. With Skills, it generates code that matches what your team would write by hand. The Playwright Skill guide covers the full 70+ guide structure.

Real example: API key workflow (20-line prompt to working spec)

This is the hardest part: a workflow that crosses browser and CLI boundaries, involves conditional logic, and has 2 different token-capture strategies depending on the path.

The Automation Agent prompt:

prompt.txt
Write a Playwright test suite using Page Object Model.
App: https://app.cms.testdino.com — login via env vars.
Tests run serially. Token from test 2 must be available to test 3.

Test 1: Happy path login
Test 2: Create or rotate API key named "e2e-test"
- If "e2e-test" doesn't exist: create it, capture token
- If it exists: rotate it, capture new token
- Token only shown once after create/rotate
Test 3: Upload report via CLI and verify counts in UI
  npx tdpw upload ./playwright-report --token=<TOKEN>
  --environment="staging" --tag="e2e-pipeline" --json

The tricky part: rotation sends the token directly to the clipboard. It's never rendered in the DOM. The agent needs API response interception, and the promise must be registered before the click, not after. If you click first, the response arrives before the listener exists. That's 1-in-10 flakiness in CI.

tests/pages/api-keys.page.ts
// tests/pages/api-keys.page.ts
async rotateKey(name: string): Promise<string> {
  const row = this.page.getByRole('row', { name });
  await row.getByRole('button', { name: 'Rotate' }).click();
  await expect(this.rotateDialog).toBeVisible();

  // Promise-first: register BEFORE confirming rotation
  const responsePromise = this.page.waitForResponse(
    (resp) => resp.url().includes('/api-key') && resp.request().method() === 'PUT',
  );
  await this.confirmRotateButton.click();

  const response = await responsePromise;
  const body = await response.json();
  return body.data.token; // trx_... prefix
}

Both create and rotate use this interception pattern, more reliable than parsing dialog text that could change with any UI update.

Expected output:

terminal
Running 2 tests using 1 worker

  ✓  [chromium] › api-keys.spec.ts › create or rotate key (4.3s)
  ✓  [chromium] › api-keys.spec.ts › upload report and verify (9.1s)

  2 passed (13.4s)

Where agents actually get stuck

These are real loops from our runs, not hypotheticals:

Stuck point Root cause Fix
Rotation token not in UI Clipboard-only, never rendered "Intercept the PUT /api-key response instead"
Over-reading files Agent reads full files not sections Restructured page objects to be smaller, no prompt fix
Create vs rotate confusion Ambiguous conditional "Check whether the key name exists in the table first"
Wrong CLI flags Non-standard tdpw syntax Pasted the exact upload command with all flags
Token lost across serial tests Missing test.describe.serial() "Write the token to a variable at describe scope"

Standard patterns (login, form fill, navigation) worked first try. The loops happened exclusively where the answer wasn't in the DOM or codebase but in someone's head. That's what JSDoc is for.

Token costs from real runs

API key workflow session: 110k of the 200k context window (55%) at one mid-session snapshot.

Single file read: 137.9k tokens in 1 tool call. One large unfocused page object. Keep them small.

The costliest loop: 3 attempts on the clipboard token before 1 sentence of guidance resolved it. The caching of app.context.md and JSDoc means subsequent feature runs cost significantly less because the Exploration phase doesn't need to repeat.

Expert insight: The RBAC test on a 207-file refactor was a lesson. About 40% needed significant rework because the agent guessed at permission rules instead of reading every changed file. Write tests alongside development when the diff is still small. Large refactors overwhelm context reasoning.

CLAUDE.md for the pipeline

CLAUDE.md tells Claude Code its role, where context lives, and what it must never do autonomously.

CLAUDE.md
# CLAUDE.md
## Read first
tests/app.context.md — read at session start, once.

## Playwright patterns
Load the Playwright Skill before writing any test code.
Install: npx skills add testdino-hq/playwright-skill

## Hard rules
- Never auto-apply test fixes. Show a diff first, wait for approval.
- Never modify a page object without reading its JSDoc block first.
- Locators: getByRole or getByTestId. No CSS selectors.
- Auth: storageState from tests/auth/ only. Never hardcode credentials.
- Staging only. Never run against production.

One CLAUDE.md covers every agent, every session, every team member. It eliminates the "Claude forgot the rules" problem on long runs. See the Playwright best practices guide for how this fits into a full project setup.

Multi-user auth with storageState

RBAC tests need sessions for each role. Set them up once in the Exploration phase; every downstream agent reuses them.

tests/auth/global-setup.ts
// tests/auth/global-setup.ts
const roles = [
  { name: 'owner', email: process.env.OWNER_EMAIL!, password: process.env.OWNER_PASS! },
  { name: 'admin', email: process.env.ADMIN_EMAIL!, password: process.env.ADMIN_PASS! },
  { name: 'member', email: process.env.MEMBER_EMAIL!, password: process.env.MEMBER_PASS! },
  { name: 'viewer', email: process.env.VIEWER_EMAIL!, password: process.env.VIEWER_PASS! },
];

for (const role of roles) {
  const context = await browser.newContext();
  const page = await context.newPage();
  await page.goto(`${process.env.BASE_URL}/auth/login`);
  await page.getByLabel('Email').fill(role.email);
  await page.getByLabel('Password').fill(role.password);
  await page.getByRole('button', { name: 'Sign in' }).click();
  await page.waitForURL('**/dashboard');
  await context.storageState({ path: `tests/auth/${role.name}.json` });
  await context.close();
}

Use role-scoped storage per spec:

tests/permissions/api-keys-member.spec.ts
test.use({ storageState: 'tests/auth/member.json' });

test('member cannot create API keys', async ({ page }) => {
  await page.goto('/settings?tab=api');
  await expect(page.getByRole('button', { name: 'Generate Key' })).not.toBeVisible();
});

See the E2E testing setup guide for the full playwright.config.ts structure for multi-environment runs.

Install the open-source Playwright Skill on GitHub

5 mistakes that will hurt you

1. Skipping app.context.md. The agent explores the DOM and generates generic scripts. 30 minutes of documentation prevents 10 loops per feature.

2. Trusting the Healer on permission-sensitive tests. Auto-applied fixes on RBAC tests silently change your permission coverage without flagging it. Use the Maintenance Agent with human review.

3. Running the full pipeline on a large refactor. At 50+ changed files, the agent guesses at scope instead of reading. Keep diffs small, run the pipeline incrementally.

4. MCP for the Automation Agent. 3-4x higher token cost over CLI for the same output. MCP is for exploration, not code generation at scale.

5. Missing test.describe.serial() on data-dependent tests. Parallel execution silently breaks tests that share state between steps. Check the Playwright cheatsheet for the serial test pattern.

Common mistake: Using page.waitForSelector() over expect(locator).toBeVisible(). The old API bypasses Playwright's auto-waiting and forces you to manage timing manually. See the locators guide for the full breakdown.

Tracking AI-generated tests in CI

TestDino tracks failure patterns across runs. Tag AI-generated tests so you can compare their flakiness rates against hand-written ones:

terminal
npx tdpw upload ./playwright-report \
  --token=$TESTDINO_TOKEN \
  --environment="staging" \
  --tag="ai-generated,claude-code" \
  --json

When tests break, the Maintenance Agent queries error grouping and trace viewer data across CI runs, not just 1 failing trace in isolation. That history changes the fix proposal: a test failing 3 of 5 runs gets a retry strategy; a test failing after every deployment gets a bug report.

Track your Playwright results in TestDino
See failure patterns, flakiness trends, and CI history for every test run — including AI-generated ones.
Start Free CTA Graphic

Conclusion

The 4 things worth remembering: app.context.md quality determines output quality more than prompt length.

The human checkpoint between Test Case Agent and Automation Agent catches most coverage mistakes before they cost tokens.

Large refactors break the pipeline, keep diffs small. And the Maintenance Agent needs CI history, not just a local trace, to diagnose correctly.

Full Playwright AI ecosystem overview if you want to see where this fits. For a fresh project, start with the Playwright Skill: the fastest path to Claude Code writing production-quality tests from session one.

FAQ

What is Claude Code with Playwright?
Claude Code (Anthropic's CLI agent) writes, runs, and fixes Playwright tests. It connects via Playwright MCP for live browser control or Playwright CLI for disk-based artifacts. The 4-agent pipeline structures this across Exploration, Test Case Generation, Automation, and Maintenance, each with 1 focused job.
How do I install Playwright MCP on Claude Code?
Run claude mcp add playwright npx @playwright/mcp@latest inside a Claude Code session, then verify with /mcp. Node.js 18+ required. Use @playwright/mcp, not the deprecated @modelcontextprotocol/server-playwright.
How does Claude Code compare to Cursor for Playwright tests?

Both use the same MCP server. Cursor with Playwright gives inline editor suggestions. Claude Code gives a terminal agent that reads your full project, runs tests, and iterates on failure output. For structured test generation with app.context.md + JSDoc, Claude Code produces more consistent results; for quick in-file fixes, Cursor is faster.

Can I use this pipeline with Playwright MCP on Cursor?

Yes. The app.context.md, JSDoc, and Skills architecture is editor-agnostic. Swap the Claude Code MCP config for Cursor's and the pipeline is identical.

How much does the pipeline cost in tokens?
The API key workflow (3 page objects + 1 spec) used 110k of 200k context window in 1 session. The Automation Agent phase costs the most (~70k). app.context.md caches between runs, so subsequent features cost less. One large unfocused file read hit 137.9k tokens. Keep page objects small.
What Node.js version is required?
Node.js 18 LTS or later. Both Claude Code and @playwright/mcp require it.
When should I use built-in Playwright agents instead?
Use built-in agents for simple flows (login, form, basic navigation) with no complex permission model. Use the 4-agent pipeline when tests need to understand permission rules, clipboard behaviors, or API-level data the DOM doesn't expose, or when you need human review of coverage before automation runs.
Does it work for visual regression testing?
Yes. The Automation Agent generates toHaveScreenshot() assertions. Load the visual testing guide into context alongside the Skill and include your baseline paths in app.context.md.
Pratik Patel

Founder & CEO

Pratik Patel is the founder of TestDino, a Playwright-focused observability and CI optimization platform that helps engineering and QA teams gain clear visibility into automated test results, flaky failures, and CI pipeline health. With 12+ years of QA automation experience, he has worked closely with startups and enterprise organizations to build and scale high-performing QA teams, including companies such as Scotts Miracle-Gro, Avenue One, and Huma.

Pratik is an active contributor to the open-source community and a member of the Test Tribe community. He previously authored Make the Move to Automation with Appium and supported lot of QA engineers with practical tools, consulting, and educational resources, and he regularly writes about modern testing practices, Playwright, and developer productivity.

Get started fast

Step-by-step guides, real-world examples, and proven strategies to maximize your test reporting success