Tournaments Are Feedback Loops
Picking a model should not be a belief.
It should be a routing decision based on evidence.
Different models are good at different work. Some are cheaper. Some are faster. Some produce cleaner patches. Some ask better questions. Some are worth using only when the task is hard enough to justify the cost. A tournament is how the system learns that boundary.
The Prize Is Not Winning
A tournament is easy to misunderstand.
It is not about finding one permanent best model. It is about learning which model should receive which kind of task under which constraints.
The useful output is a policy:
small mechanical patch -> cheap fast model
ambiguous architecture task -> stronger planner
UI runtime proof -> browser-capable worker
semantic rebase -> model with low review correction rate
high-risk public API change -> panel or tournamentThe winner of one task is less important than the pattern across many tasks.
Score The Whole Attempt
Model selection should not score only the final answer.
It should score the attempt as a system event:
did the patch apply
did gates pass
how much review was needed
how often was evidence missing
how many tokens were spent
how long did it take
did it ask a useful question
did it create follow-up workA cheap answer that creates review debt may be more expensive than the expensive model. A slow model that produces admissible work may be cheaper than three fast reruns.
Tournaments Are For Boundaries
Most tasks should not be tournaments.
If the task is low risk and routine, a tournament wastes budget. If the task is broad, under-specified, or high-impact, a tournament can be cheaper than picking wrong once and paying for review, rework, and debugging later.
The system should use tournaments where the route is uncertain:
new task kind
high conflict history
large public surface
weak existing gate
unclear human intent
model telemetry disagrees
previous workers failed differentlyThat means tournaments are a boundary tool. They are used when the system is learning where its routing policy is still weak.
Feedback Changes The Router
The tournament should feed the next decision.
If a cheap model keeps passing parser fixture tasks, it should get more of them. If it keeps failing semantic rebases, those tasks should move up a tier. If a strong model does not improve admission rate for a class of work, the system should stop paying for it there.
This creates a loop:
route task
run model or panel
collect patch, proof, cost, time, review effort
admit, rerun, split, or reject
update model-task policy
route the next task betterThe router is not a hard-coded leaderboard. It is a policy that keeps being corrected by outcomes.
Panels Are Expensive Evidence
Sometimes the best use of a tournament is not to pick a patch.
It is to produce disagreement.
If three models solve the same task in three incompatible ways, that is evidence that the task is underspecified. The correct next step may be a human question, a narrower task, or a stronger oracle. The tournament did not fail. It revealed uncertainty before shared state moved.
- Candidates
- Three rebasesEach model adapted the patch through the new head.
- Agreement
- Two preserve type shape, one changes public APIDisagreement is itself evidence.
- Cost
- Panel cost exceeded single workerWorth it only because the public boundary was risky.
- Route
- Ask human, then rerun narrow gateThe tournament produced the next decision, not an automatic merge.
Optimize For Accepted Work
The cost unit should not be tokens.
Tokens matter, but they are only one part of the bill. The real unit is accepted work under evidence.
cost per accepted patch
cost per useful question
cost per proof bundle
cost per avoided human review
cost per correctly rejected changeThat changes the optimization target.
The system should not simply minimize model spend. It should minimize total coordination cost while keeping throughput and safety high.
The Mental Model
Model choice is part of the merge system.
The same evidence that decides whether work can be admitted should also teach the system which model should attempt that kind of work next time.
A tournament is not a spectacle. It is a controlled way to learn routing policy.