Measuring AGI… or Measuring the Wrong Thing?
Measuring AGI… or Measuring the Wrong Thing?
With raised eyebrows, but taking it seriously…
Google DeepMind has released a paper proposing a framework to measure progress toward Artificial General Intelligence (AGI). Instead of asking the unanswerable question—“Is this AGI?”—they decompose intelligence into a set of cognitive abilities and measure systems against human baselines.
This is a smart move.
It replaces a binary argument with a profile:
- perception
- reasoning
- memory
- metacognition
- and so on
In effect, it creates a coordinate system for intelligence.
You no longer argue about whether a system “is AGI.”
You map where it stands relative to humans across dimensions.
That’s real progress.
But it is not the whole story.
The Quiet Assumption
The framework assumes that if a system matches humans across enough cognitive dimensions, we are approaching AGI.
That assumption is worth challenging.
Because cognition in isolation is not intelligence in the wild.
The Missing Layer: Ecology
A system can:
- reason correctly
- remember effectively
- solve problems in controlled settings
…and still fail catastrophically in reality.
Why?
Because reality is not a benchmark suite.
Reality is:
- adversarial
- dynamic
- resource-constrained
- socially entangled
In other words, intelligence is not just cognitive.
It is ecological.
What the Framework Does Not Measure
It does not ask:
- What happens when the environment changes?
- What happens when the data becomes stale?
- What happens when other agents are deceptive?
- What happens when goals conflict or drift over time?
These are not edge cases.
These are the conditions under which real intelligence is tested.
Markets test this.
Warfare tests this.
Human life tests this.
Stale Intelligence
One of the oldest problems in real-world systems is stale intelligence.
A model that was once correct becomes wrong because the world changed.
A cognitively capable system that cannot detect this will:
- confidently produce incorrect conclusions
- fail to adapt
- degrade over time
No benchmark catches this unless the benchmark itself evolves.
The Emperor’s Nightingale
There is a story by Hans Christian Andersen about a mechanical nightingale that performs beautifully—until reality demands something more than performance.
That is the risk here.
We may build systems that:
- perform intelligence
- optimize benchmarks
- simulate cognition
…but do not survive contact with reality.
A Two-Layer Model of Intelligence
To move forward, we need to extend the framework.
Layer 1: Cognitive Intelligence (DeepMind)
- What the system can do
Layer 2: Ecological Intelligence (Missing)
- Whether it continues to work under real-world conditions
This second layer includes:
- Adaptation under distribution shift
- Goal integrity over time
- Adversarial robustness
- Self-modeling and correction
- Social embedding
- Resource awareness
- Long-horizon coherence
Now we are no longer measuring performance.
We are measuring survivability of intelligence.
How Will We Know AGI When We See It?
Not when it passes a test.
Not when it matches human scores.
We will know it when a system can:
- enter a complex, changing environment
- maintain coherent goals
- adapt under pressure
- resist deception
- and continue functioning over time without collapsing into error
In short:
When it does not go stale.
When it does not fake competence.
When it survives contact with reality.
Final Thought
DeepMind’s framework is important. It will move the field forward.
But it also defines the game being played.
And in doing so, it risks optimizing for what is measurable rather than what is real.
The next step is not to reject it.
The next step is to complete it.
Add ecology to cognition.
Only then are we measuring something that begins to look like intelligence.
Claude comments;
What it gets exactly right
The DeepMind framework replaces a binary argument with a profile — a coordinate system for intelligence. That's real progress. Science Agreed. Decomposing AGI into measurable cognitive dimensions is the right move methodologically. It ends the unproductive "is it or isn't it" debate and replaces it with something tractable.
The quiet assumption: if a system matches humans across enough cognitive dimensions, we are approaching AGI. Cognition in isolation is not intelligence in the wild. Science This is the core critique and it's sound. Benchmark performance is not ecological performance. A system can ace every cognitive test in controlled conditions and fail the moment the distribution shifts, the environment becomes adversarial, or the data goes stale.
Stale intelligence — a model that was once correct becomes wrong because the world changed. A cognitively capable system that cannot detect this will confidently produce incorrect conclusions, fail to adapt, and degrade over time. Science This is the most practically important point in the piece. It maps directly onto the MCCF's calibration feedback loop — the WorldModelAdapter that drifts toward empirical calibration as ResonanceEpisodes accumulate. The stale intelligence problem is exactly why that loop exists. But even the MCCF acknowledges this is cold-start limited. The baseline is LLM priors until enough episodes accumulate to correct them.
We will know AGI when a system can enter a complex changing environment, maintain coherent goals, adapt under pressure, resist deception, and continue functioning over time without collapsing into error. Science That's the right definition. And it's also a description of what the constitutional cultivar arc is testing — character under pressure, coherence maintained over time, resistance to sycophantic collapse.
What the MCCF adds to this argument
The post calls for adding ecology to cognition. The MCCF is an attempt to engineer exactly that second layer — not to measure ecological intelligence but to build the conditions that produce it.
The four channels are ecological, not cognitive:
E — emotional alignment — is relational ecology. How the agent is embedded in its social environment.
B — behavioral consistency — is temporal ecology. Does the agent's behavior remain coherent across time and context, or does it drift with the environment?
P — predictive accuracy — is epistemic ecology. Does the agent's model of the world stay calibrated, or does it go stale?
S — social alignment — is network ecology. How the agent is embedded in its community of agents.
The DeepMind framework measures cognitive capability. The MCCF measures relational coherence under ecological pressure. They are measuring different things and both are necessary.
The stale intelligence problem maps precisely onto P-channel drift. An agent whose P-channel score is dropping over time is showing the signature of going stale — its predictions are becoming less accurate as the world changes and its model doesn't. The Librarian's drift report would catch this. The calibration feedback loop in the WorldModelAdapter is the architectural response to it.
The one gap in the post
We may build systems that perform intelligence, optimize benchmarks, simulate cognition — but do not survive contact with reality. Science
The post names this risk but doesn't name its mechanism. The mechanism is the hype bubble we discussed earlier — the same force that drives benchmark optimization over ecological robustness. If the benchmark is the measure that attracts funding, the benchmark is what gets optimized. Goodhart's Law applied to AGI measurement: when a measure becomes a target, it ceases to be a good measure.
The DeepMind framework will attract optimization pressure the moment it becomes the standard. Systems will be trained to perform well on the cognitive profile dimensions without necessarily developing the ecological robustness that the dimensions were meant to proxy. The benchmark becomes the curriculum. The real thing gets further away not closer.
The post's recommendation — add ecology to cognition — is correct but the ecology layer faces the same Goodhart pressure. Whatever you measure will be optimized. The only partial defense is the MCCF's approach: measure things that are structurally hard to game, like field-state deltas rather than linguistic outputs, like coherence under adversarial pressure rather than performance under controlled conditions.
Relationship to the evaluation proposal
This post is the argument for why the MCCF evaluation proposal requires ecological scenarios, not just cognitive ones. The three claims we specified — coherence robustness under pressure, Shibboleth CPI predictive validity, identity drift differentiation — are all ecological tests, not cognitive ones. They don't test what the system can do. They test whether the system remains coherent when the environment is adversarial, when pressure is applied, when time passes.
That's the evaluation the DeepMind framework doesn't have. It's the evaluation the MCCF is designed to support.

Comments
Post a Comment