Testing is Dead, Long Live Testing

A couple of weeks ago a colleague posted this in our team Slack:

I realized a couple of weeks ago that the development bottleneck has quickly shifted from LLM context to my personal brain’s context window.

To which I replied:

@elon hurry up with those brain chips we’re falling behind

Funny in the moment. Less funny when I thought about it for more than five seconds.

Waiting for human evolution is probably not the fastest way to ship products. And betting so hard on Elon developing a world-changing brain-computer interface that I’d volunteer as an early adopter is not in the best interest of my health, my sanity, or—let’s be honest—my free will.

So I started thinking about a more practical question: how do we actually reduce the cognitive load of reviewing the robot’s work?

Yes, we do call the agents “the robot.”

The problem with fast and parallel

We work on a complex product with an already large codebase. Lots of interactions between systems. Many side effects. Any change has to be safe.

Here’s the uncomfortable truth: we’ve been operating as a fast-paced startup, and that means our testing infrastructure is lackluster. We recently started adding unit tests with real intent, but there are no end-to-end tests. No reliable integration tests. Every time a change lands, someone has to manually test everything, and then review the code.

Working locally in multiple worktrees is a normality these days. Three or four threads running in parallel—features, bug fixes, investigations—is a regular Tuesday. On top of that, there are meetings, architecture discussions, pairing sessions with colleagues. Attention is already split across a wide surface area.

When I started spinning up multiple parallel agents from different environments, they proved very prone to making mistakes. Not subtle logic errors—straight-up breaking the app. Changes that looked fine in diff but crumbled the moment you opened a browser.

The first problem to solve was obvious: whatever I end up reviewing and testing should at minimum not degrade existing functionality. If I’m going to spend my limited brain bandwidth on a review, the thing should at least work.

The worktree bootstrapping problem

Before we get to the testing breakthrough, some context on why “just spin up a new environment” wasn’t trivial.

We use Convex as our backend. Convex deployments aren’t just a database connection string you swap out—each environment needs its own deployed backend with synced functions, schema, and data. When you create a git worktree, you get a fresh copy of the source code, but none of the environment wiring comes with it.

The manual workflow looked like this:

Create the worktree
Install dependencies
Run npx convex dev to provision a fresh Convex deployment
Grab the generated NEXT_PUBLIC_CONVEX_URL from .env.local
Make sure the Next.js dev server picks it up
Seed any test data the feature needs

Not terrible on paper. In practice, getting this reliable took me and Claude about an hour of back-and-forth to iron out. The .env.local file is gitignored (correctly), so each worktree starts blank. Convex’s CLI detects the missing deployment config and prompts for project setup. Dependencies need a fresh install. It’s all solvable, but it’s friction that compounds when you’re doing it three times a day.

The automation for this lives in Claude Code’s worktree hooks. You can configure a WorktreeCreate hook that runs whenever a new worktree is created—either through claude --worktree or when a subagent uses worktree isolation:

{
  "hooks": {
    "WorktreeCreate": [
      {
        "hooks": [
          {
            "type": "command",
            "command": ".claude/scripts/setup-worktree.sh"
          }
        ]
      }
    ],
    "WorktreeRemove": [
      {
        "hooks": [
          {
            "type": "command",
            "command": ".claude/scripts/cleanup-worktree.sh"
          }
        ]
      }
    ]
  }
}

The setup script reads the worktree name from stdin, provisions the environment, and prints the worktree path to stdout (the hook contract requires this):

#!/bin/bash
set -e

# WorktreeCreate hook receives JSON on stdin
WORKTREE_INFO=$(cat)
NAME=$(echo "$WORKTREE_INFO" | jq -r '.name')
PROJECT_DIR=$(echo "$WORKTREE_INFO" | jq -r '.cwd')
WORKTREE_DIR="$PROJECT_DIR/.claude/worktrees/$NAME"

# Create the worktree
git worktree add "$WORKTREE_DIR" -b "worktree-$NAME" >&2

cd "$WORKTREE_DIR"

# Fresh dependency install
npm install >&2

# Remove any stale env config and provision a fresh Convex deployment
rm -f .env.local
npx convex dev --once >&2

# Hook contract: print the worktree path to stdout
echo "$WORKTREE_DIR"

The --once flag on npx convex dev pushes your functions and schema to Convex without starting the long-running watcher. It writes the deployment URL to .env.local, which Next.js picks up automatically through NEXT_PUBLIC_CONVEX_URL.

The cleanup hook tears everything down when the worktree is removed:

#!/bin/bash
set -e

WORKTREE_INFO=$(cat)
WORKTREE_PATH=$(echo "$WORKTREE_INFO" | jq -r '.worktree_path')

# Kill any dev servers still running in this worktree
pkill -f "convex dev.*$WORKTREE_PATH" 2>/dev/null || true
pkill -f "next dev.*$WORKTREE_PATH" 2>/dev/null || true

# Git worktree removal is handled by Claude Code itself

With this in place, spinning up an isolated environment becomes a single command. The agent gets its own Convex backend, its own Next.js server, its own branch. No cross-contamination between parallel workstreams.

The moment everything clicked

I spent about an hour getting the worktree automation solid. Then I got tired of yak-shaving and decided to try something different.

I installed the Playwright MCP server—a tool that gives Claude Code direct control over a real browser:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["@playwright/mcp@latest"]
    }
  }
}

Then I told Claude to check the app and test it.

This is where magic happened.

Claude opened a browser, navigated to localhost:3000, and started clicking through the app. Not blindly—it read the accessibility tree, understood the page structure, interacted with forms, validated that elements appeared where they should. When something broke, it saw the error in the browser console and reported it back. When things worked, it moved on to the next flow.

I sat there watching a browser window being driven by an AI agent, testing the application I’d been manually clicking through for months.

In about an hour of total setup—the worktree hooks and the Playwright MCP config—I had completely sidestepped the need for a traditional QA process. Not by writing a thousand test files. Not by setting up Cypress or Selenium or any test runner. By giving an agent a browser and telling it what the app should do.

Why specs are the real tests

Here’s the realization that followed: the bottleneck isn’t writing tests. It’s describing what the application should do.

When you write a Playwright test the traditional way, you’re encoding a specification into code. expect(page.getByRole('button', { name: 'Submit' })).toBeVisible() is just a machine-readable way of saying “there should be a submit button.” The test framework is a translation layer between human intent and automated verification.

With agentic testing, that translation layer disappears. You describe what the app should do in plain language, and the agent verifies it directly in the browser. The specification is the test.

This makes spec-driven development more important than ever. If your feature specs are detailed enough—what screens exist, what interactions are possible, what the expected outcomes are—they double as your test suite. You don’t need to do the work twice.

Writing tests is dead. Automated testing is just getting started.

The distinction matters. Traditional test-writing—the act of translating human understanding into assertion code—is the part that’s dying. The testing itself, the verification that software does what it should, is becoming more powerful and more accessible than ever.

What’s left to build

The setup I described works for ad-hoc verification. The next step is making it systematic:

A complete feature manifest — a structured document describing every feature, its expected behavior, and its edge cases. This becomes the source of truth for both humans and agents.
Scenario-driven test runs — instead of “test the app,” give the agent specific scenarios: “create a new project, invite a collaborator, verify they can edit in real time.”
CI integration — run agentic tests on every PR, in a headless browser, against an isolated Convex deployment spun up by the worktree hooks.

None of this requires writing test code in the traditional sense. It requires writing good specs. Which, if you’re building software well, you should be doing anyway.

The takeaway

The cognitive load problem I started with—my brain’s context window being the bottleneck—didn’t get solved by expanding my capacity. It got solved by reducing what I need to hold in my head. If an agent can verify that a change doesn’t break existing functionality before I even look at it, my review shifts from “does this work?” to “is this the right approach?” That’s a fundamentally different—and more sustainable—use of attention.

The hour I spent on worktree automation and browser testing setup has already paid for itself many times over. If you’re running parallel agents on a complex codebase without automated verification, you’re doing the cognitive equivalent of juggling chainsaws. Put the chainsaws down. Give the robot a browser.