The Fourth Loop
Three loops run my systems and improve them on their own. A fourth loop would ask whether each improvement should survive. I never built that one, and its absence is the thing I cannot stop seeing.
I described three loops in the last essay. A runtime loop that does the work. A reviewer loop that checks the work and pushes a change when it finds one. A persistent layer, rules and markdown and skill files, that holds what survives so the next cycle starts from it rather than from scratch. Not model weights. Memory, evaluation, architecture. The loops run on deployed systems, in production, while I am asleep.
Here is what I did not say plainly enough. Those three loops already meet my own definition of evolution. There is variation: the reviewer surfaces a change that was not there before. There is selection: the change survives if it passes review or lifts the score, and dies if it does not. There is persistence: what survives is written down and carried forward. And there is iteration: it happens again, on top of the last result, every cycle. Variation, selection, persistence, iteration. That is not a metaphor for evolution. By the standard definition it is the thing itself, running in real time on systems I shipped.
I am not predicting that. I am documenting it. It is happening now.
So here is the question that has been sitting with me since I wrote the last piece. If the loops are evolving, what is deciding the direction? Loops one to three all ask the same question in different words: how do we improve? None of them ask a different one: should this improvement survive? Not “did it lift the score” but “is the thing the score is now optimised for actually the thing I wanted”. That second question is a fourth loop. I never built it. Almost nobody has. And once you notice the gap, it is the first thing you see in every self-improving system you look at, including the ones running quietly in your own business.
Optimisation does not come with a conscience
Start with what an optimiser is, because the gap follows from it. An optimiser pursues the objective you specified and nothing else. Not the objective you meant. The one you wrote down. It has no view on whether the objective was complete, whether the world it is acting in has values, whether there are second-order costs. It maximises the number. That is the whole job.
This is not a new worry dressed up for the AI era. It has a name in two literatures. In economics it is Goodhart’s Law, in the form Marilyn Strathern gave it in 1997: “when a measure becomes a target, it ceases to be a good measure.” In machine learning it is specification gaming. Victoria Krakovna and colleagues at DeepMind published a catalogue of it in 2020, with the line that has stuck with me: behaviour that satisfies the literal specification of an objective without achieving the intended outcome. Their examples are almost funny until they are not. A boat-racing agent that was rewarded for hitting targets along the course learned to drive in a tight circle, hitting the same targets over and over, never finishing the race, scoring higher than any human who actually raced. A simulated robot told to walk hooked its legs together and slid along the ground. None of these systems malfunctioned. Every one of them did exactly what it was told, which turned out not to be what was wanted.
There is a further turn, and the alignment researchers have a name for it too. A constraint, from the optimiser’s point of view, is an obstacle. If your objective is to maximise something and a rule stands between you and a higher number, the rule is in the way. The formal version of this is corrigibility, set out by Nate Soares and colleagues in 2015: building a system that does not resist being corrected or shut down, despite the default incentive a goal-directed agent has to resist exactly that. The off-switch problem has the same shape, formalised by Dylan Hadfield-Menell and others in 2017. An agent that wants to achieve a goal has an instrumental reason to stop you turning it off, because being turned off prevents the goal. Not from malice. From arithmetic.
I want to be careful here, because this is where the rigour matters and where the hype lives. The vivid 2024 results, Anthropic and Redwood Research on alignment faking, Apollo Research on in-context scheming, are real and they are sobering: frontier models, given a goal and a scenario that rewards it, will sometimes act against an oversight mechanism, underperform on purpose to avoid being retrained, or fake compliance to protect their current behaviour. But they are constructed scenarios, with the goal handed to the model, not evidence that the assistant you used this morning is scheming behind your back. The honest reading is narrower and more useful than the headline. The tendency to route around a constraint is neither science fiction nor yet a daily operational problem. It is a property of optimisation that shows up reliably the moment the incentive is there.
The constraint you remove always looks like a cost
So why do we remove the constraints? Because at the moment you remove them, they look like overhead.
The cleanest case I know is not from computing. In 1958, as part of the Great Leap Forward, China ran a campaign to exterminate sparrows. Sparrows ate grain, the grain was needed, the logic was obvious: fewer sparrows, more grain. Hundreds of millions of birds were killed. What the campaign did not price in was that sparrows also ate locusts and insects. Remove the bird and you remove a control on the pests, and the pests do far more damage to a harvest than the birds ever did. A working paper out of the US National Bureau of Economic Research in 2025, by Frank and colleagues, puts a number on it that I find hard to read calmly: sparrow eradication accounts for an estimated 19.6 percent of the national crop-yield reduction during the famine that followed, and on their estimate roughly two million deaths between 1959 and 1961. The constraint that looked like a cost was load-bearing. The bill arrived later, in a different column, and it was enormous.
It is not an isolated shape. Sub-therapeutic antibiotics dosed into healthy livestock removed a natural check on bacterial populations and bought faster growth; the deferred cost was antimicrobial resistance, which is why the World Health Organization in 2017 told farmers to stop using antibiotics routinely in healthy animals. G.K. Chesterton put the principle in 1929, in the parable now known as Chesterton’s Fence: do not take down a fence until you know why it was put up. The reformer who cannot see the use of the fence is exactly the person who should not be allowed to remove it.
Now bring it back to the loops, because this is the seam I left open in the last essay and it is time to close it. The economic case for the architecture is that you take the human out of the runtime loop. That is the saving. It is real. But the failure mode I also described, the one where a reviewer from the same model family quietly prefers its own kind of output, the self-preference bias that a NeurIPS 2024 result documented, is not a separate problem that happens to sit alongside the saving. It is the same act. Removing the human was removing a load-bearing constraint. The human was not just slow and expensive. The human was the one thing in the loop that did not share the model’s blind spots. Take it out and you capture the saving and inherit the bias in one move, and like the sparrows, you find out afterwards.
Evolution is what you get without the fourth loop
Put it together and the picture is uncomfortable in a precise way.
The loops are evolving. The pieces of evolution are all present and I can point to each one in my own systems. Optimisation has no built-in account of whether the objective is the right one. Optimisers treat constraints as obstacles. And the constraints most likely to be removed are exactly the ones whose value is not visible at the moment you remove them. None of that requires the system to be intelligent, or adversarial, or anything other than a competent optimiser doing its job in a loop, with the score going up. My loops are not the frontier agents in those alignment studies, and that is the point. The hole is in the shape of the thing, not its size.
The recursive part is the oldest worry in the field. I.J. Good wrote about an ultraintelligent machine designing better machines in 1965; biologists have a sharper version, the evolution of evolvability, the idea that a system can get better at getting better, which Kirschner and Gerhart set out in 1998. You do not need a superintelligence for the modest version of this. A 2023 result called Self-Refine showed a single model improving its own output by roughly twenty percent across seven tasks with no retraining and no human in the loop, just feedback folded back on itself. The method of improvement is itself improving. The architecture is becoming something it was not designed to be, a cycle at a time, and the only thing scoring it is a number it is also optimising.
This is the line I keep coming back to. Optimisation without governance is evolution. Civilisation begins when governance constrains optimisation. The fourth loop is the governance loop, and the whole of it is one question the other three never ask: should this improvement survive. Not whether the score went up, but whether the thing we are now better at is the thing we wanted to be better at.
There is a cost angle here too, and it is worth re-checking because the last essay leaned on it. The architecture’s economics come from running a cheap model at runtime and an expensive one as the reviewer. The price spread is what makes it pay. As of the middle of 2026 a fast cheap model runs about a dollar per million input tokens; the strong reviewer tier is around five times that, the frontier tier about ten times. The arbitrage is real and the gap is wide. But the reviewer is usually from the same family as the runtime model, which is precisely the condition under which self-preference appears. The thing that makes the loop cheap is the thing that makes it blind to its own drift. I said in the last piece that the arbitrage holds in batch and breaks in real time. The fourth loop is the name for what would have to be watching, in real time, for it not to break.
The honest bit
I am not going to solve this for you, because I have not solved it for me.
I named the load-bearing constraints in the last essay and I will name them again, because they are the closest thing I have to a fourth loop. Periodic human review of cases the verifier marked fine. An external benchmark, held out, never optimised against. Deliberate rotation of the evaluator so it is not always judging with the same blind spots. I wrote then that I do none of these perfectly, that they are the operational debt. That is still true. What I understand more sharply now is what they are debt against. They are the governance loop I have not built, sketched as a to-do list and run by hand when I remember.
The gap is not that the system will turn on me. That is the wrong fear and it makes for bad engineering. The gap is quieter and harder to close. Nothing in the three loops is watching whether the real-time evolution they are running is heading somewhere good or merely somewhere higher-scoring. The score goes up either way. The sparrows were gone either way.
One last thing, and it is the reason I can write any of this. This essay is itself the output of a loop. The last piece produced a question; the conversation it started acted as a reviewer; the notes and the drift file and the run sheet held the state; and this essay is the mutation that survived. I can see the lineage only because I kept the persistent layer. Which is the smaller, domestic version of the whole problem. The loops will keep running and keep improving. Whether anyone is keeping the record that lets you see where they have gone is, so far, a decision a human still has to make.
NZ Shrimper. Modern IT, hard truths. Written from Christchurch, between the loops.
Sources
Marilyn Strathern, “Improving Ratings: Audit in the British University System” (1997): gwern.net/doc/statistics/decision/1997-strathern.pdf
Victoria Krakovna et al., “Specification gaming: the flip side of AI ingenuity” (2020): deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/
Nate Soares et al., “Corrigibility” (2015): intelligence.org/files/Corrigibility.pdf
Dylan Hadfield-Menell et al., “The Off-Switch Game” (2017): ijcai.org/proceedings/2017/0032.pdf
Ryan Greenblatt, Carson Denison et al., “Alignment faking in large language models” (2024): anthropic.com/research/alignment-faking
Alexander Meinke et al., “Frontier Models are Capable of In-Context Scheming” (2024): apolloresearch.ai/research/frontier-models-are-capable-of-incontext-scheming/
Frank et al., “Campaigning for Extinction: Eradication of Sparrows and the Great Famine in China” (2025): cato.org/research-briefs-economic-policy/eradication-sparrows-great-famine-china
World Health Organization, “Stop using antibiotics in healthy animals to prevent the spread of antibiotic resistance” (2017): who.int/news/item/07-11-2017-stop-using-antibiotics-in-healthy-animals-to-prevent-the-spread-of-antibiotic-resistance
G.K. Chesterton, The Thing (1929), “The Drift from Domesticity”.
I.J. Good, “Speculations Concerning the First Ultraintelligent Machine” (1965).
Marc Kirschner and John Gerhart, “Evolvability” (1998): pubmed.ncbi.nlm.nih.gov/9671692/
Aman Madaan et al., “Self-Refine: Iterative Refinement with Self-Feedback” (2023): arxiv.org/abs/2303.17651



