Judging the Winner: Tests and LLM-as-Judge as the Referee

In the fan-out post I built a script that
runs the same task in N isolated worktrees and hands you N finished diffs. Then it stops — on purpose. I
wrote that the script "can't tell you which is correct… a human (or a test) deciding the winner is
load-bearing." This post is about that load-bearing part, because it's the half of best-of-N that actually
determines whether the whole exercise was worth it. Generating five attempts is easy. Choosing well is
the entire game.

The mistake is to treat judging as one problem. It's two, and they want different tools:

Is it correct? — an objective question with an objective referee: tests. This stage is cheap,
automatable, and ruthless. It eliminates the broken attempts before you waste a second of attention on
them.
Among the ones that work, which is best? — a question of taste and judgment: simplicity,
readability, whether it solved the right problem. Here the referee is your eyes for a handful of
diffs, and an LLM-as-judge when there are too many to read by hand.

Run them in that order. Tests first to cut the field, judgment second to pick the winner.

Stage 1: tests as the first cut

This stage is gorgeous in its simplicity, and it's exactly what worktrees were built for. Each attempt is
already a separate, complete checkout — so you just run the test suite inside each one and keep the
greens. Here's judge.sh, the companion to last post's fan-out.sh:

#!/usr/bin/env bash
# judge.sh — run a test command in each fan-out worktree and rank by pass/fail.
# Usage:  ./judge.sh <stamp> <test-command...>
set -uo pipefail

stamp="${1:?usage: judge.sh <stamp> <test-command...>}"; shift
[ "$#" -gt 0 ] || { echo "need a test command"; exit 1; }
TEST=("$@")

repo_root="$(git rev-parse --show-toplevel)"
runs_dir="$(dirname "$repo_root")/$(basename "$repo_root")-fanout-$stamp"

passed=(); failed=()
for dir in "$runs_dir"/run-*; do
  [ -d "$dir" ] || continue
  i="${dir##*run-}"
  if ( cd "$dir" && "${TEST[@]}" >test.log 2>&1 ); then
    echo "run $i  PASS"; passed+=("$i")
  else
    echo "run $i  FAIL  ($(cd "$dir" && tail -1 test.log))"; failed+=("$i")
  fi
done

echo
echo "green (survived): ${passed[*]:-none}"
echo "red   (eliminated): ${failed[*]:-none}"

judge.sh (and its siblings fan-out.sh / orchestrate.sh) are in
github.com/egarim/agent-fanout — MIT, with a runnable demo.

The only clever bit is ( cd "$dir" && "${TEST[@]}" ) — a subshell that runs your test command in each
worktree's own directory. Because the worktrees are isolated, the suites can't interfere with each other,
and the exit code is the verdict: zero is green, anything else is red.

I ran it for real. I set up three "attempts" at implementing an add function — run 1 correct and clean,
run 2 deliberately buggy (it subtracts), run 3 correct but more elaborate — and pointed a one-line test
suite at them. The output:

$ ./judge.sh demo bash test.sh
run 1  PASS
run 2  FAIL  (FAIL: add 2 3 = '-1', want 5)
run 3  PASS

green (survived): 1 3
red   (eliminated): 2

That's the whole point of stage one. Run 2 looked perfectly plausible as a diff — it was a one-line change
that compiled and ran — and the test killed it instantly with a concrete reason. No human attention was
spent on the broken attempt. The field went from three to two without anyone reading a line of code.

A couple of honest notes on this stage. First, the dependency caveat from the last two posts applies
doubly here: a worktree has no node_modules or bin/obj, and tests need the project to build, so your
test command usually has to install/build first — bash -c "npm ci && npm test" rather than bare npm test. Second, and more important: green does not mean correct. Tests only check what they cover. An
attempt can pass every test and still be wrong in a way you didn't write a test for. Stage one eliminates
the provably broken; it does not certify the survivors. That's why there's a stage two.

Stage 2: judging the survivors on quality

Now I have two green attempts that both pass the suite. They are not the same. Here are the actual diffs:

# run 1 — minimal and direct
-add() { :; }   # TODO: implement
+add() { echo $(($1 + $2)); }

# run 3 — correct, but more general (sums any number of args)
-add() { :; }   # TODO: implement
+add() { local s=0; for x in "$@"; do s=$((s + x)); done; echo "$s"; }

Both pass. Which do you want? There's no universal answer — and that's the point. If the task was "add
two numbers," run 1 is the better citizen: it's the simplest thing that solves the stated problem. If you
secretly wanted a general summation, run 3 read your mind. This is a judgment call, not a fact, which
is precisely why you can't fully automate it away.

For a handful of small diffs, you are the right judge — read them and decide. The whole reason stage one
exists is to make sure the diffs you read by hand are all already working, so your scarce attention goes
to taste, not to bug-hunting.

When there are too many to read: LLM-as-judge

Best-of-8 across several files is too much to eyeball every time. This is where LLM-as-judge earns its
place: you hand a model the task description and the candidate diffs and ask it to pick, with reasons. The
recipe is simple — pipe the diffs into whatever model CLI you use:

# sketch — feed the green candidates to a model and ask for a pick + rationale
{
  echo "TASK: implement add(); simplest correct solution preferred."
  for i in 1 3; do
    echo "=== CANDIDATE $i ==="
    git diff main..fanout/$stamp/$i
  done
  echo "Pick the best candidate. Answer with the number and one sentence why."
} | your-model-cli

That's the shape. A few things separate a judge you can trust from one that quietly misleads you:

Grade against explicit criteria, not vibes. "Pick the best" invites the model to rationalize. "Pick
the simplest correct solution; penalize unrequested generality" gives it a rubric and gives you a
result you can defend.
Prefer pairwise comparison for close calls. Models are more reliable saying "A is simpler than B"
than assigning B a 7/10. For many candidates, run a small tournament of pairwise judgments.
Mind the known biases. LLM judges have a documented position bias (they favor whichever
candidate came first or last) and a verbosity bias (they mistake longer for better). Shuffle the
order, and tell the judge explicitly that length is not merit.
The judge is itself non-deterministic and can be wrong. It's another model doing another fallible
pass. Use it to rank and filter, not as a final authority — especially not to auto-merge.

The strongest version: make the judge adversarial

The judging move I trust most isn't "score these and pick the top one" — it's try to refute each one.
Instead of asking a model which candidate is best, ask it to find the bug in each candidate, defaulting
to suspicion. A candidate that survives a genuine attempt to break it is worth more than one that merely
scored well in a beauty contest. Stack a few of these skeptics and take a majority vote, and you've built
something much harder to fool than a single judge — the same logic as adversarial review among humans.

The honest bottom line

Put the two stages together and you have a real pipeline: fan out → tests cut the field → judgment picks
the winner. But keep the failure modes in view, because a confident-looking pipeline is the most
dangerous kind:

Tests only prove the absence of the bugs you tested for. Green narrows the field; it never certifies
correctness.
The quality judge — human or model — is fallible, and the model kind is non-deterministic and biased.
It's a filter and a ranker, not an oracle.
Keep a human gate on anything that matters. Automate the elimination (let red tests kill attempts
freely) and the ranking (let the judge order the survivors), but a person should approve the merge of
anything with real consequences. The pipeline exists to bring you a short list of working, ranked
candidates — not to push to main while you sleep.

This is the other half of best-of-N, and it's the half people skip. Generating attempts is the easy,
flashy part; the referee is where the quality actually comes from. It's also exactly what the better
agent harnesses now do for you under the hood — Claude Code, for instance, can run tests against each
isolated attempt and then put candidates through adversarial verification before anything is committed.
Building the two-script version by hand — fan-out.sh to generate, judge.sh to referee — is worth it for
the same reason as before: once you've watched tests eliminate a plausible-but-broken diff and a judge
reason about two working ones, you know precisely what your tools are doing when they do it for you. And
you'll never again mistake "the agent produced something" for "the agent produced the right thing."