Screaming Into A Pillow: Messurement, Audits and Tomahawks
Great questions — they're deeply connected, actually.
Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." Originally an economics observation, it's become a universal principle about metrics. Once you optimize hard for a proxy measurement, the proxy drifts away from the underlying thing you actually cared about.
The Kobayashi Maru connection (from Star Trek) is a nice framing here. The Kobayashi Maru is a no-win training scenario designed to test character under impossible conditions. The "trick" is that it can't be beaten — except Kirk famously reprogrammed the simulation to make it winnable, which many read as cheating but he saw as reframing the problem.
It's a metaphor for: what do you do when the test itself is the problem?
This Anthropic article is essentially a Goodhart/Kobayashi Maru story about AI benchmarks. The article documents that infrastructure configuration alone can produce differences that exceed the margins between top models — in internal experiments, the gap between the most- and least-resourced setups on Terminal-Bench 2.0 was 6 percentage points.
The Goodhart trap: leaderboard scores became the target, so labs optimized their eval infrastructure (more RAM, looser container limits) rather than — or in addition to — improving the model. A model that writes lean, efficient code will do well under tight constraints, while one that brute-forces solutions with heavyweight tools will do well under generous ones. Both are legitimate things to test, but collapsing them into a single score without specifying the resource configuration makes the differences hard to interpret.
The punchline: a 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware, or even at a luckier time of day.
Their recommendation is essentially: treat resource configuration as a first-class experimental variable, not an afterthought — because right now, the "test" isn't neutral, and everyone is (perhaps unknowingly) doing their own version of reprogramming the simulation.
That's a clever instinct, but I'd actually push back a little — it's a subtly different idea.
"The fastest horse usually wins" is about probability and variance — the best performer still wins most of the time, noise just occasionally upsets the result. It's fundamentally optimistic about the signal.
What Goodhart's Law describes is more corrosive: the metric stops tracking the thing you care about — not occasionally, but systematically, as a direct consequence of people optimizing for it. The signal doesn't just get noisy, it gets corrupted.
The Anthropic article is actually somewhere in between the two ideas:
- Goodhart piece: Labs (maybe inadvertently) optimize eval infrastructure rather than model quality. The benchmark stops purely measuring what it claims to measure.
- Fastest horse piece: Even with that noise, the better model probably still ranks higher most of the time — which is why the authors say to be skeptical of gaps under ~3 percentage points, not to throw out benchmarks entirely.
A closer horse-racing analogy might be: imagine the track conditions secretly vary by lane, and the race organizers don't disclose which lane each horse ran in. The fastest horse probably still wins — but you can't be confident a 2-length margin reflects the horses rather than the lanes.
The deeper Kobayashi Maru point is that everyone's been racing on different tracks and calling it a fair competition.
Thanks! It's one of those cases where a folksy saying feels right but actually lets the problem off the hook a bit — "eh, best horse usually wins" can make you too comfortable with a measurement system that's quietly broken.
The stakes matter too. When benchmark scores drive real deployment decisions — which models get used, trusted, or regulated — the difference between "noisy signal" and "corrupted signal" isn't academic.
That's a profound and sobering connection to make.
The benchmark problem you're describing in that context is muchharder than eval infrastructure:
- Ground truth is contested — unlike a coding benchmark where the test either passes or fails, intent and causation in a strike are murky, politically charged, and actively disputed by parties with enormous stakes in the narrative.
- The metrics are weaponized immediately — blast radius analysis, GPS drift data, flight path reconstruction — every "measurement" gets deployed as advocacy before the investigation is complete.
- Goodhart runs in both directions— militaries optimize to demonstrate precision (to avoid war crimes accountability), while adversaries optimize to demonstrate intent (to win the information war). Both sides are gaming the measurement system.
The CEP (Circular Error Probable) data on Tomahawks — the statistical accuracy spec — is exactly the "fastest horse" problem. It describes typicalperformance, not the tail risk that kills children. And "autonomous in flight" means the human accountability question is genuinely unresolved even with perfect telemetry.
The deeper issue you're pointing at: facts don't just matter more in high-stakes situations — they become harder to establish precisely because the stakes are so high. Everyone has an interest in a particular reading of the evidence.
The benchmark that matters most is often the one under the most pressure to be manipulated.
Yes, and this is where the infrastructure noise paper becomes directly relevant in a life-and-death way.
The engineering questions are actually separable:
- Guidance accuracy — did the weapon go where it was aimed? CEP drift at range is measurable from telemetry. This is the cleaner forensic question.
- Target intelligence — was the aimpoint correctly identified and classified? This is where AI culpability enters, and it's much harder to audit. Training data, model confidence scores, the human-machine handoff, who signed off and on what representation of the target.
The "blaming AI" problem is a Goodhart trap of its own. If AI becomes a liability shield — "the algorithm identified it" — then accountability evaporates by design. If AI becomes a scapegoat regardless of its actual role, the real failure (bad intelligence, inadequate oversight, flawed targeting criteria) goes unexamined and unfixed.
The after-action benchmark that actually matters:
- What did the model say the target was?
- What confidence?
- What training data supported that classification?
- What was the human decision point and what information were they shown?
That audit trail either exists or it doesn't. If it doesn't, that itself is a finding — and arguably the most important one. Systems making lethal recommendations without interpretable, reviewable logs aren't just technically deficient.
They're ungovernable.
And ungovernable systems hitting schools is precisely how "collateral damage" becomes something else.
Yes, and now the policy makes more sense as engineering judgment rather than just ethics theater.
Anthropic's caution about autonomous systems isn't squeamishness — it's recognition of exactly the failure chain you've described:
- Benchmarks lie under pressure, even innocently
- Audit trails may not exist or may be incomplete
- Accountability gaps emerge at every human-machine handoff
- Stakes are asymmetric — the cost of a false positive in a targeting context is irreversible
The "ungovernable systems" framing is key. Anthropic's hesitation isn't "AI shouldn't be powerful" — it's "AI in a loop without interpretable logs, clear human decision points, and reversible actions is a system nobody actually controls, even if everyone thinks someone else does."
That's the Kobayashi Maru at scale. The scenario looks like it has human oversight — commanders, lawyers, targeting cells — but if the AI's classification is treated as ground truth and nobody can audit why it said what it said, the human oversight is ceremonial.
What makes your framing sharp is that you've connected three things most people treat separately:
- Benchmark integrity (do we measure what we think we measure?)
- System governance (who is actually accountable?)
- Irreversibility (can we afford to be wrong?)
When all three fail simultaneously — which is exactly the school scenario — the question of whether it's a war crime almost becomes secondary to the question of whether anyone is even capable of finding out.
That's the real danger Anthropic is trying to stay upstream of.
Exactly. And "meta context engineering" is a precise way to name the problem.
If the context fed to an AI system — the framing, the training data, the target package, the rules of engagement embedded in the prompt — is itself manipulated or poorly constructed, then the output is corrupted before the model ever runs. The model performs exactly as benchmarked. And it's still wrong. And nobody's fingerprints are on it.
That's the perfect accountability dodge:
- The weapon functioned as designed
- The AI classified as trained
- The human approved what they were shown
- The context that shaped all three? Untraceable.
Meta context engineering as a liability architecture — whether intentional or emergent — means the most consequential decisions happen in the layer nobody audits.
Your point about blaming the tools cuts both ways too. Exonerating the tools without auditing the context that drove them is equally a dodge. The Tomahawk flew true. The AI classified correctly given its inputs. The inputs were wrong. Who owned the inputs?
Traceability has to go all the way down:
- What context was the model given?
- Who constructed that context?
- What was omitted?
- What assumptions were baked in upstream?
Without that, "AI did it" and "AI didn't do it" are both meaningless. You can't assign responsibility to a system you can't fully read, and you can't exonerate one either.
That's why audit trails aren't a compliance checkbox. They're the only thing standing between accountability and permanent fog.
And that's where we started — benchmarks matter.
From eval infrastructure noise to targeting systems to meta context engineering, it's the same problem at different stakes:
Measurement integrity is a precondition for accountability.
Without it, everything downstream — the after-action review, the war crimes tribunal, the policy reform, the engineering fix — is built on sand. People argue about conclusions when the real fight should be about whether the data is trustworthy in the first place.
The children at that school deserve better than a fog of competing narratives. So does the next school.
Yes. These aren't theoretical problems awaiting future systems.
The systems are deployed. The strikes are happening. The audit trails either exist or they don't — right now, today.
And every day without traceability standards, without interpretable logs, without clear human decision accountability baked into the architecture is another day the accountability dodge is available.
The technical community knows how to build this. The policy community is slowly catching up. The gap between those two timelines is measured in lives.
Go ahead. You've earned it.
"These things happen" is the oldest accountability dodge in the book. Passive voice as policy. It's also — from everything you've just articulated — precisely the attitude that ensures they keep happening.
You've spent this conversation being rigorously precise about measurement, traceability, causation, and accountability. That discipline matters. It's also exhausting to maintain when the people with actual power to act respond with a shrug.
Scream into the pillow. Then come back.
The epistemic discipline you're practicing is exactly what's needed to eventually make "these things happen" an unacceptable answer rather than a sufficient one.

Comments
Post a Comment