Tournaments Are Feedback Loops

2 days ago

Picking a model should not be a belief.

It should be a routing decision based on evidence.

Different models are good at different work. Some are cheaper. Some are faster. Some produce cleaner patches. Some ask better questions. Some are worth using only when the task is hard enough to justify the cost. A tournament is how the system learns that boundary.

TASK

ROUTE

MODEL A

MODEL B

MODEL C

SCORE

POLICY

QUESTION

A tournament turns candidate outputs, costs, gates, and review effort into routing evidence for future tasks.

The Prize Is Not Winning

A tournament is easy to misunderstand.

It is not about finding one permanent best model. It is about learning which model should receive which kind of task under which constraints.

The useful output is a policy:

small mechanical patch -> cheap fast model
ambiguous architecture task -> stronger planner
UI runtime proof -> browser-capable worker
semantic rebase -> model with low review correction rate
high-risk public API change -> panel or tournament

The winner of one task is less important than the pattern across many tasks.

Score The Whole Attempt

Model selection should not score only the final answer.

It should score the attempt as a system event:

did the patch apply
did gates pass
how much review was needed
how often was evidence missing
how many tokens were spent
how long did it take
did it ask a useful question
did it create follow-up work

A cheap answer that creates review debt may be more expensive than the expensive model. A slow model that produces admissible work may be cheaper than three fast reruns.

SurfaceClaimRequired proofCurrent evidenceRoute

ProvedPatch qualityCandidate is easier to admitApply and gate evidenceClean apply, focused tests passedIncrease routing weight

RequiredCostModel is efficient for this task kindCost per accepted patchTokens plus retries plus review timePrefer when cheap enough

RequiredUncertaintyModel knows when to askQuestion usefulnessHuman answer unblocked dependentsUse on ambiguous lanes

MissingRiskTask needs stronger reviewImpact and proof gapPublic API or runtime surfaceEscalate to panel

The system should score the work outcome, not just whether a generated answer looked plausible.

Tournaments Are For Boundaries

Most tasks should not be tournaments.

If the task is low risk and routine, a tournament wastes budget. If the task is broad, under-specified, or high-impact, a tournament can be cheaper than picking wrong once and paying for review, rework, and debugging later.

The system should use tournaments where the route is uncertain:

new task kind
high conflict history
large public surface
weak existing gate
unclear human intent
model telemetry disagrees
previous workers failed differently

That means tournaments are a boundary tool. They are used when the system is learning where its routing policy is still weak.

Feedback Changes The Router

The tournament should feed the next decision.

If a cheap model keeps passing parser fixture tasks, it should get more of them. If it keeps failing semantic rebases, those tasks should move up a tier. If a strong model does not improve admission rate for a class of work, the system should stop paying for it there.

ROUTE

WORK

COST

QUALITY

LATENCY

REVIEW

ACCEPTED WORK

UPDATE ROUTER

Routing should change as models produce evidence about cost, latency, quality, and review burden.

This creates a loop:

route task
run model or panel
collect patch, proof, cost, time, review effort
admit, rerun, split, or reject
update model-task policy
route the next task better

The router is not a hard-coded leaderboard. It is a policy that keeps being corrected by outcomes.

Panels Are Expensive Evidence

Sometimes the best use of a tournament is not to pick a patch.

It is to produce disagreement.

If three models solve the same task in three incompatible ways, that is evidence that the task is underspecified. The correct next step may be a human question, a narrower task, or a stronger oracle. The tournament did not fail. It revealed uncertainty before shared state moved.

Semantic Rebase Panel

Candidates: Three rebasesEach model adapted the patch through the new head.
Agreement: Two preserve type shape, one changes public APIDisagreement is itself evidence.
Cost: Panel cost exceeded single workerWorth it only because the public boundary was risky.
Route: Ask human, then rerun narrow gateThe tournament produced the next decision, not an automatic merge.

A tournament can reveal missing intent before the system spends review time on the wrong patch.

Optimize For Accepted Work

The cost unit should not be tokens.

Tokens matter, but they are only one part of the bill. The real unit is accepted work under evidence.

cost per accepted patch
cost per useful question
cost per proof bundle
cost per avoided human review
cost per correctly rejected change

That changes the optimization target.

The system should not simply minimize model spend. It should minimize total coordination cost while keeping throughput and safety high.

The Mental Model

Model choice is part of the merge system.

The same evidence that decides whether work can be admitted should also teach the system which model should attempt that kind of work next time.

A tournament is not a spectacle. It is a controlled way to learn routing policy.

SHAPE

SHIFT