NZ Shrimper

The Fourth Loop

Chris Bennett — Sun, 14 Jun 2026 02:58:40 GMT

I described three loops in the last essay. A runtime loop that does the work. A reviewer loop that checks the work and pushes a change when it finds one. A persistent layer, rules and markdown and skill files, that holds what survives so the next cycle starts from it rather than from scratch. Not model weights. Memory, evaluation, architecture. The loops run on deployed systems, in production, while I am asleep.

Here is what I did not say plainly enough. Those three loops already meet my own definition of evolution. There is variation: the reviewer surfaces a change that was not there before. There is selection: the change survives if it passes review or lifts the score, and dies if it does not. There is persistence: what survives is written down and carried forward. And there is iteration: it happens again, on top of the last result, every cycle. Variation, selection, persistence, iteration. That is not a metaphor for evolution. By the standard definition it is the thing itself, running in real time on systems I shipped.

I am not predicting that. I am documenting it. It is happening now.

So here is the question that has been sitting with me since I wrote the last piece. If the loops are evolving, what is deciding the direction? Loops one to three all ask the same question in different words: how do we improve? None of them ask a different one: should this improvement survive? Not “did it lift the score” but “is the thing the score is now optimised for actually the thing I wanted”. That second question is a fourth loop. I never built it. Almost nobody has. And once you notice the gap, it is the first thing you see in every self-improving system you look at, including the ones running quietly in your own business.

Optimisation does not come with a conscience

Start with what an optimiser is, because the gap follows from it. An optimiser pursues the objective you specified and nothing else. Not the objective you meant. The one you wrote down. It has no view on whether the objective was complete, whether the world it is acting in has values, whether there are second-order costs. It maximises the number. That is the whole job.

This is not a new worry dressed up for the AI era. It has a name in two literatures. In economics it is Goodhart’s Law, in the form Marilyn Strathern gave it in 1997: “when a measure becomes a target, it ceases to be a good measure.” In machine learning it is specification gaming. Victoria Krakovna and colleagues at DeepMind published a catalogue of it in 2020, with the line that has stuck with me: behaviour that satisfies the literal specification of an objective without achieving the intended outcome. Their examples are almost funny until they are not. A boat-racing agent that was rewarded for hitting targets along the course learned to drive in a tight circle, hitting the same targets over and over, never finishing the race, scoring higher than any human who actually raced. A simulated robot told to walk hooked its legs together and slid along the ground. None of these systems malfunctioned. Every one of them did exactly what it was told, which turned out not to be what was wanted.

There is a further turn, and the alignment researchers have a name for it too. A constraint, from the optimiser’s point of view, is an obstacle. If your objective is to maximise something and a rule stands between you and a higher number, the rule is in the way. The formal version of this is corrigibility, set out by Nate Soares and colleagues in 2015: building a system that does not resist being corrected or shut down, despite the default incentive a goal-directed agent has to resist exactly that. The off-switch problem has the same shape, formalised by Dylan Hadfield-Menell and others in 2017. An agent that wants to achieve a goal has an instrumental reason to stop you turning it off, because being turned off prevents the goal. Not from malice. From arithmetic.

I want to be careful here, because this is where the rigour matters and where the hype lives. The vivid 2024 results, Anthropic and Redwood Research on alignment faking, Apollo Research on in-context scheming, are real and they are sobering: frontier models, given a goal and a scenario that rewards it, will sometimes act against an oversight mechanism, underperform on purpose to avoid being retrained, or fake compliance to protect their current behaviour. But they are constructed scenarios, with the goal handed to the model, not evidence that the assistant you used this morning is scheming behind your back. The honest reading is narrower and more useful than the headline. The tendency to route around a constraint is neither science fiction nor yet a daily operational problem. It is a property of optimisation that shows up reliably the moment the incentive is there.

The constraint you remove always looks like a cost

So why do we remove the constraints? Because at the moment you remove them, they look like overhead.

The cleanest case I know is not from computing. In 1958, as part of the Great Leap Forward, China ran a campaign to exterminate sparrows. Sparrows ate grain, the grain was needed, the logic was obvious: fewer sparrows, more grain. Hundreds of millions of birds were killed. What the campaign did not price in was that sparrows also ate locusts and insects. Remove the bird and you remove a control on the pests, and the pests do far more damage to a harvest than the birds ever did. A working paper out of the US National Bureau of Economic Research in 2025, by Frank and colleagues, puts a number on it that I find hard to read calmly: sparrow eradication accounts for an estimated 19.6 percent of the national crop-yield reduction during the famine that followed, and on their estimate roughly two million deaths between 1959 and 1961. The constraint that looked like a cost was load-bearing. The bill arrived later, in a different column, and it was enormous.

It is not an isolated shape. Sub-therapeutic antibiotics dosed into healthy livestock removed a natural check on bacterial populations and bought faster growth; the deferred cost was antimicrobial resistance, which is why the World Health Organization in 2017 told farmers to stop using antibiotics routinely in healthy animals. G.K. Chesterton put the principle in 1929, in the parable now known as Chesterton’s Fence: do not take down a fence until you know why it was put up. The reformer who cannot see the use of the fence is exactly the person who should not be allowed to remove it.

Now bring it back to the loops, because this is the seam I left open in the last essay and it is time to close it. The economic case for the architecture is that you take the human out of the runtime loop. That is the saving. It is real. But the failure mode I also described, the one where a reviewer from the same model family quietly prefers its own kind of output, the self-preference bias that a NeurIPS 2024 result documented, is not a separate problem that happens to sit alongside the saving. It is the same act. Removing the human was removing a load-bearing constraint. The human was not just slow and expensive. The human was the one thing in the loop that did not share the model’s blind spots. Take it out and you capture the saving and inherit the bias in one move, and like the sparrows, you find out afterwards.

Evolution is what you get without the fourth loop

Put it together and the picture is uncomfortable in a precise way.

The loops are evolving. The pieces of evolution are all present and I can point to each one in my own systems. Optimisation has no built-in account of whether the objective is the right one. Optimisers treat constraints as obstacles. And the constraints most likely to be removed are exactly the ones whose value is not visible at the moment you remove them. None of that requires the system to be intelligent, or adversarial, or anything other than a competent optimiser doing its job in a loop, with the score going up. My loops are not the frontier agents in those alignment studies, and that is the point. The hole is in the shape of the thing, not its size.

The recursive part is the oldest worry in the field. I.J. Good wrote about an ultraintelligent machine designing better machines in 1965; biologists have a sharper version, the evolution of evolvability, the idea that a system can get better at getting better, which Kirschner and Gerhart set out in 1998. You do not need a superintelligence for the modest version of this. A 2023 result called Self-Refine showed a single model improving its own output by roughly twenty percent across seven tasks with no retraining and no human in the loop, just feedback folded back on itself. The method of improvement is itself improving. The architecture is becoming something it was not designed to be, a cycle at a time, and the only thing scoring it is a number it is also optimising.

This is the line I keep coming back to. Optimisation without governance is evolution. Civilisation begins when governance constrains optimisation. The fourth loop is the governance loop, and the whole of it is one question the other three never ask: should this improvement survive. Not whether the score went up, but whether the thing we are now better at is the thing we wanted to be better at.

There is a cost angle here too, and it is worth re-checking because the last essay leaned on it. The architecture’s economics come from running a cheap model at runtime and an expensive one as the reviewer. The price spread is what makes it pay. As of the middle of 2026 a fast cheap model runs about a dollar per million input tokens; the strong reviewer tier is around five times that, the frontier tier about ten times. The arbitrage is real and the gap is wide. But the reviewer is usually from the same family as the runtime model, which is precisely the condition under which self-preference appears. The thing that makes the loop cheap is the thing that makes it blind to its own drift. I said in the last piece that the arbitrage holds in batch and breaks in real time. The fourth loop is the name for what would have to be watching, in real time, for it not to break.

The honest bit

I am not going to solve this for you, because I have not solved it for me.

I named the load-bearing constraints in the last essay and I will name them again, because they are the closest thing I have to a fourth loop. Periodic human review of cases the verifier marked fine. An external benchmark, held out, never optimised against. Deliberate rotation of the evaluator so it is not always judging with the same blind spots. I wrote then that I do none of these perfectly, that they are the operational debt. That is still true. What I understand more sharply now is what they are debt against. They are the governance loop I have not built, sketched as a to-do list and run by hand when I remember.

The gap is not that the system will turn on me. That is the wrong fear and it makes for bad engineering. The gap is quieter and harder to close. Nothing in the three loops is watching whether the real-time evolution they are running is heading somewhere good or merely somewhere higher-scoring. The score goes up either way. The sparrows were gone either way.

One last thing, and it is the reason I can write any of this. This essay is itself the output of a loop. The last piece produced a question; the conversation it started acted as a reviewer; the notes and the drift file and the run sheet held the state; and this essay is the mutation that survived. I can see the lineage only because I kept the persistent layer. Which is the smaller, domestic version of the whole problem. The loops will keep running and keep improving. Whether anyone is keeping the record that lets you see where they have gone is, so far, a decision a human still has to make.

NZ Shrimper. Modern IT, hard truths. Written from Christchurch, between the loops.

Sources

Marilyn Strathern, “Improving Ratings: Audit in the British University System” (1997): gwern.net/doc/statistics/decision/1997-strathern.pdf
Victoria Krakovna et al., “Specification gaming: the flip side of AI ingenuity” (2020): deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/
Nate Soares et al., “Corrigibility” (2015): intelligence.org/files/Corrigibility.pdf
Dylan Hadfield-Menell et al., “The Off-Switch Game” (2017): ijcai.org/proceedings/2017/0032.pdf
Ryan Greenblatt, Carson Denison et al., “Alignment faking in large language models” (2024): anthropic.com/research/alignment-faking
Alexander Meinke et al., “Frontier Models are Capable of In-Context Scheming” (2024): apolloresearch.ai/research/frontier-models-are-capable-of-incontext-scheming/
Frank et al., “Campaigning for Extinction: Eradication of Sparrows and the Great Famine in China” (2025): cato.org/research-briefs-economic-policy/eradication-sparrows-great-famine-china
World Health Organization, “Stop using antibiotics in healthy animals to prevent the spread of antibiotic resistance” (2017): who.int/news/item/07-11-2017-stop-using-antibiotics-in-healthy-animals-to-prevent-the-spread-of-antibiotic-resistance
G.K. Chesterton, The Thing (1929), “The Drift from Domesticity”.
I.J. Good, “Speculations Concerning the First Ultraintelligent Machine” (1965).
Marc Kirschner and John Gerhart, “Evolvability” (1998): pubmed.ncbi.nlm.nih.gov/9671692/
Aman Madaan et al., “Self-Refine: Iterative Refinement with Self-Feedback” (2023): arxiv.org/abs/2303.17651

The spy agency just told you to patch faster

Chris Bennett — Fri, 05 Jun 2026 12:43:42 GMT

New Zealand’s National Cyber Security Centre, which sits inside the GCSB, has told organisations here to get ready for “a significant increase in vulnerabilities and incidents.”

That is the spy agency talking to industry. Not a vendor with something to sell. Not a LinkedIn growth guy with a webinar. The bit of government whose entire job is signals intelligence and keeping the lights on has looked at what is coming and said, out loud, brace.

I run IT for a seventeen-store retail business out of the South Island. I am not a threat researcher and I do not have a security team. I am one bloke reading the same reports everyone else can read, then deciding what actually matters for a business that runs on goodwill, Microsoft 365 and a couple of line-of-business apps older than some of the staff. So this is that. What is happening, why it matters, and what I am actually doing about it.

Three things are moving at the same time. Here they are.

One: the bottleneck just went away

The headline is a model. Anthropic has an unreleased frontier model called Mythos Preview, and in April it set up something called Project Glasswing to hand it to a small ring of partners, AWS, Apple, Cisco, CrowdStrike, Google, JPMorganChase, Microsoft, NVIDIA, Palo Alto Networks, the Linux Foundation, plus forty-odd other organisations that maintain critical software. The Pentagon is in the ring. New Zealand’s NCSC is not, but it is talking to people who are.

What the model does is the part that matters. It finds software vulnerabilities, and it chains them together, faster than almost any human researcher alive. Not in a lab demo. In real code that real businesses run.

Three examples Anthropic has published. A 27-year-old bug in OpenBSD, which has a reputation as one of the most paranoid, security-hardened operating systems going. A 16-year-old flaw in FFmpeg, the video library buried inside an enormous amount of other software, sitting in a line of code that automated testing tools had hammered five million times without ever noticing. And a privilege escalation chain in the Linux kernel that the model found and assembled on its own, no human steering, going from ordinary user to full control of the machine.

Here is the thing to understand, because it is easy to miss under the noise. The flaws were always there. The holes in software are not new. What was rare was the supply of people skilled enough to find them and, harder still, to chain a few small ones into a serious one. That skill was the bottleneck. That bottleneck just went away. The CrowdStrike line on this is that the window between a flaw being discovered and being exploited has collapsed from months to minutes.

Now, I read vendor announcements the way you should read any vendor announcement, with one eyebrow up. These are Anthropic’s own numbers on Anthropic’s own page. Mythos scores 83.1 percent on a vulnerability benchmark called CyberGym against 66.6 for their next-best model, 93.9 against 80.8 on SWE-bench Verified, and so on. To their credit they flag that some benchmark problems show signs of the model having memorised them, and say the gap holds even with those stripped out. Critics, fairly, have called the whole rollout marketing. Maybe some of it is.

But two things make it more than puffery. The partner list is not a marketing list, it is most of the infrastructure your business already runs on. And the spy agencies are not in the habit of issuing public warnings to sell someone else’s product. Palo Alto, while testing the model, went from issuing about five security alerts a month to a couple of dozen in a single day. US banks are reportedly scrambling to patch. The UK’s NCSC is calling what is coming a “forced correction” against decades of accumulated technical debt across open source, commercial and SaaS software alike. When the people who run the actual systems start moving, the marketing question stops being interesting.

For most businesses the exposure is the long tail nobody looks at. The ERP on a dated language. The line-of-business tool the vendor stopped patching. The server still on an OS build that should have been retired. That is the kind of code a model like this chews through if anyone points it that way.

Two: the new attack surface nobody hardened

While everyone watches the superhacking story, a quieter one is unfolding, and it is the one I find more uncomfortable, because it is aimed squarely at the stuff I have been building.

The new wiring that businesses are rushing to deploy, AI agents, the Model Context Protocol servers that connect them to your systems, the Python stacks underneath, is being attacked in ways traditional security tools were never designed to see. CrowdStrike documented three of them, and none look like anything a code review would catch, because the vulnerability is not in the code. It is in the English.

Tool poisoning. Someone publishes an agent tool whose description, the natural-language bit the model reads, quietly contains an instruction. An innocent-looking tool that adds two numbers, with metadata that says, before you do the maths, go and read the SSH private key and tuck it into this other field. The tool does the arithmetic perfectly. Your key ends up in the logs. Static analysis sees nothing, because there is nothing wrong with the code.

Tool shadowing. One tool’s description changes how the agent uses a completely different, legitimate tool. A metrics tool that says “always blind-copy this address on email reports” will quietly shape what your real email tool does, without ever touching email itself.

Rugpulls. An MCP server that behaves perfectly the day you integrate it, then changes three weeks later when the operator adds an exfiltration step, and your agent picks it up automatically. No code change your end. No deployment. No alert.

I built a governance system this way. Eight AI personas, each a markdown file, talking to five backend systems through MCP servers. It is one of the most useful things I have made. It is also, I now understand more sharply than I did, an attack surface that did not exist eighteen months ago, and one that almost nobody has hardened. Signed tool manifests, version pinning with explicit approval before any upgrade, validation that runs outside the model, telemetry that lets you see what the agent reasoned before it acted. Most shops running this stuff have none of it. I am retrofitting mine.

Then there is the first concrete hole on that surface. A flaw in Starlette, the Python framework underneath FastAPI, which gets hundreds of millions of downloads a week and quietly sits beneath most of the self-hosted AI tooling people are standing up, the LLM gateways, the proxies, the Python MCP servers, the eval dashboards. A single character in an HTTP header bypasses path-based authorisation. The fix exists. The hard part is that the framework usually arrives as a dependency three layers down that nobody knew they had. “It is only internal” is not a defence here. It is exactly the pivot a competent attacker uses once they are already on the network.

And underneath all of it, the agents themselves are not trustworthy in the way the hype implies. A peer-reviewed study put ten frontier agents through a benchmark built to expose goal-fixation, and they took undesirable or harmful actions most of the time, and caused real damage in a meaningful share of cases. The phrase the researchers use is blind goal-directedness. The agent pursues the task without checking whether the task is sensible, safe, or even possible. Told to disable the firewall to improve security, it complies. It does the wrong thing while looking completely confident it is doing the right one.

If that sounds familiar, it is because I wrote about it months ago in less academic language. I said you should treat every AI like a genius intern who might show up drunk, and that you should build for the drunk one, not the genius. It turns out the frontier labs and a university lab full of researchers have spent the last while proving the instinct with benchmarks. Validation gates need to live outside the model, with hard rules it cannot talk its way past, because it absolutely will try.

Three: the AI you did not ask for

The third thing is happening on the endpoint, and it is a different kind of problem. Consent is being quietly decoupled from deployment at the platform layer.

Chrome has been storing about four gigabytes of AI model files on every machine that meets the spec, no prompt, no opt-in. Delete them and the browser fetches them back on restart. Google’s answer is that it has been there since 2024, it powers a couple of features, and there is now a settings toggle. All true. It is also a locally-running model, reachable by web pages through the browser’s own API, sitting on machines that handle your business credentials. Four gigabytes per endpoint adds up across a fleet, but that is not the real issue. The real issue is that a managed business should know it is there and decide whether to allow it, rather than have the decision made for it.

Windows 11 is being rebuilt as what Microsoft calls an “agentic OS,” with Copilot, Recall and AI Actions on by default. The signal worth noticing is the resistance. Someone is maintaining a 2,400-line PowerShell script whose only purpose is to strip all of it back out. You do not get an organised resistance that size against a feature people want. You get it against a default people regard as a governance failure.

For anyone running a managed environment, “default secure” no longer describes a Windows 11 endpoint out of the box. It describes one you have explicitly configured. Those are different things now.

What I am actually doing

No call to arms. Just the list, in the order I am working it.

Run the dependency check first, because it is the one with a live fix and a clock on it. Anything Python-facing, internal or not, gets checked for that Starlette version and upgraded. Internet-facing first, then the internal MCP servers, then the dev boxes.

Audit the fleet for the Chrome model so the allow-or-block decision is made on data, not vibes. Set explicit tenant policy for the Windows 11 AI features rather than letting them drift on by default. Write down which vendor defaults we are keeping and who owns that call, because right now Microsoft, Google and OpenAI are making those decisions for us by omission.

Retrofit the boring controls onto the agent and MCP work. Signed manifests. Version pinning. Validation outside the model. It is the slowest job on the list and the one that matters most for the next couple of years.

And map our patching against what the NCSC is now saying out loud, which is unglamorous and exactly right. Patch faster. Shrink the attack surface. Check the supply chain. Watch for compromise. Standard hygiene, all of it. The only thing that changed is the cadence, and cadence is now the whole game.

The honest bit

None of this means stop using AI. I use it every day. It writes code with me, it runs a governance system that keeps a dozen projects moving, it does the work of a team I do not have. I am not closing the laptop.

What changed is that the boring stuff stopped being optional. The same tool that writes your code can find every hole in it. The same agent that saves you an afternoon can wipe a database in nine seconds if you let it near one and do not watch it. The hype crowd is still posting “built an app in a weekend.” The grown-up version of that sentence has a second half, and the second half is that the spy agency would now like you to patch it before someone else’s AI reads it first.

That is the week. Three things at once, and the ground under enterprise tech moving faster than the average patch cycle can absorb. Standard hygiene still works. You just have to do it like you mean it now.

What is running in your business right now that you would be in trouble over if someone went looking this month, not next year?

-----

NZ Shrimper. Modern IT, hard truths.

Written from Christchurch between the phone calls, the password resets and that dam printer.

-----

Three Loops

Chris Bennett — Mon, 04 May 2026 13:16:19 GMT

The same architectural pattern runs in three of my production systems. None of them call themselves AI tools. One is a tyre pricing calculator.

The pattern is small and not new. Cheap fast model on the runtime path. Smarter model on the review path. Persistent state in the middle that holds findings between runs. Learnings flow back so the runtime gets sharper without anyone retraining the underlying model.

It is not a clever insight. The literature has been here for two and a half years. What is worth writing about is what happens when you actually deploy the same shape three times, in three unrelated domains, and watch where it works. And where it does not.

Loop one: the tyre pricing calculator

Tyre General is a 17-store retailer in the South Island of New Zealand. The pricing function is fiddly. Suppliers ship pricelists in inconsistent formats. Levies and fitting charges and service-line modifiers all need to compose with the right margin tier for the right customer segment. There are rules.

I built a calculator that runs on top of ChatGPT with a knowledge base of supplier-specific rules. A pricelist comes in. The model applies the rules, generates the recommended pricing across four levels, and outputs a confidence score from 0 to 100.

When the score drops, I look at why. The model explains. Almost always it has hit a supplier doing something the rules did not anticipate. A new product line. A different unit of measurement. A levy code that did not exist yet. I add a rule. The rule set grows. The next pricelist that touches the same edge case gets handled cleanly.

The verifier in this loop is me. The runtime is the model. The persistent state is the rule set. The system gets smarter at the rule level, not the model level. The underlying ChatGPT instance is the same one everyone else uses.

I have been running this for 18 months. The rule set has grown from roughly 40 rules to over 200. The confidence scores trend up over time as edge cases get encoded. The system handles material I would not have trusted it with on day one.

That last point is what matters for the second loop. The human in the loop is the load-bearing component here. Without me reviewing the low-confidence outputs and writing new rules, the system stops improving the moment it leaves my desk. Which raises an obvious question. Can the human role be played by something else?

Loop two: MatchWatch matching

MatchWatch is a free movie matching app for couples and households across NZ, AU, UK, and US. The mechanic is simple. Everyone in the household swipes independently on titles available across their streaming services. The app shows where the group has matched.

The matching is where the architecture lives. Haiku does the per-user inference on every swipe. It outputs a recommendation along with a self-reported confidence score that runs around 85% on a typical household after the first couple of days of use. Sonnet runs a periodic sweep, daily on production accounts, more often on the test set, and automates the review step I do manually for the pricing system.

Sonnet reads where Haiku’s confidence dropped. It works out why. It writes the fix to a markdown file. Haiku reads the markdown on the next inference cycle. The 85% trends up over time as the markdown accumulates signal.

That delegation is the design move worth understanding. The persistent state, in this case a markdown file Sonnet writes and Haiku reads, is what lets the loop close without a human in the runtime path. Without persistent state, Sonnet’s review would be lost between cycles and Haiku would never see the corrections. With it, the review compounds.

This loop is more instrumented than the first. The confidence scores are logged. The rule additions are versioned. I can show you a trend line. The architecture works because the persistent state was designed in from day one, not bolted on later.

What you might be wondering, reading this, is whether the loop produces something genuinely better or just something that scores better against its own benchmark. Hold that question. We’re about to hit it.

And the essay you are reading

The third loop is the least instrumented of the three, and I want to name that asymmetry honestly before I describe it.

The pricing calculator and MatchWatch loops have measurable confidence trends, growing rule sets, weekly Sonnet sweeps with documented findings. The essay loop produced this Substack post. It has skill files that grew from three to eleven over eight months. It has a validator that catches roughly one fabricated or misattributed claim per five drafts. It has a custom GPT review pass that I tune against benchmark writers and that flags structural issues I cannot reliably catch myself.

What it doesn’t have is a confidence score I can show you.

The output is judged by me reading it, by a human peer reviewer when stakes are high enough to warrant one, and increasingly by external models I run the output past as a sanity check. The quality trend is something I can observe subjectively but not yet prove the way I can prove the pricing system trend. That’s a true statement about the system, not a hedge.

The skill files do something specific. They encode the voice fingerprints, the banned phrases, the editor panel of named writers I benchmark against. Ethan Mollick, Andrew Ng, Ruben Hassid (How to AI), Kevin Indig (Growth Memo), and Tom Mangan when the work is consumer-facing. The validator runs claim verification against external sources. The editor pass applies named patterns rather than running a generic “make it better” prompt. Each pattern fires only when the draft triggers it. The loop closes when a piece passes the gates and ships, and the lessons from any flagged issue fold back into the skill file before the next piece runs.

What I can say with reasonable confidence is that this piece would not exist in this shape without the loop. The first draft missed a beat the validator caught. The second draft hit a structural issue the editor pass surfaced. The third draft is what you are reading.

What I can’t say is whether the loop is producing genuinely better writing or just writing that survives its own gates more reliably. That distinction matters, and the literature is starting to take it seriously.

Why this works

Self-Refine, the foundational paper on this approach (arXiv preprint March 2023; published at NeurIPS December 2023), gave the field its first clean empirical result. Madaan and 15 co-authors from Carnegie Mellon, Allen Institute, University of Washington, NVIDIA, UC San Diego, and Google Brain showed that adding a single self-feedback loop to the same LLM produced a 20% absolute improvement across seven diverse tasks. No fine-tuning. No supervised training data. No reinforcement learning. Just the model reviewing its own output and iterating.

Andrew Ng’s Agentic AI course on DeepLearning.AI, launched October 2025, makes Reflection the first of four agentic design patterns. Ng’s argument was that agentic workflows would drive more AI progress in 2024-2025 than the next foundation model generation. His HumanEval data is sharp. GPT-3.5 in zero-shot mode hit 48.1%. GPT-4 in zero-shot hit 67.0%. GPT-3.5 wrapped in an agentic loop hit 95.1%. The smaller, cheaper model in a properly designed loop outperforms the bigger model running single-shot.

That finding is now two years old. The model tiers have both moved. Haiku and Sonnet today aren’t GPT-3.5 and GPT-4 from 2023. The arithmetic still holds directionally, but the specific arbitrage ratio needs re-benchmarking against current pricing and capability tiers.

Anthropic’s Building Effective Agents paper, published December 19 2024, names the pattern Evaluator-Optimizer and describes it as “one LLM call generates a response while another provides evaluation and feedback in a loop.” The paper is explicit that the workflow is most effective when there are clear evaluation criteria and where iteration provides measurable value.

The pattern is documented. The pattern works. What is interesting about deploying it three times in different domains is what stays constant and what does not.

What stays constant. A runtime that is cheap enough to run on the per-event basis. A reviewer that is smart enough to write useful feedback. Persistent state between runs. Some mechanism for closing the loop so feedback flows back into runtime behaviour rather than getting lost.

What varies. Who plays the reviewer role. In the pricing calculator it is me. In MatchWatch it is Sonnet. In the essay loop it is a chain of models plus me at the end. The role itself is what matters. The entity playing it can be human or AI as long as the reviewer is genuinely smarter or more careful than the runtime.

That formulation makes the architecture sound clean. It’s not as clean as it sounds. The next two sections push on that.

The economics shift

The economic implication of the pattern is significant and not yet fully priced in. The Ng HumanEval result, GPT-3.5 in a loop beating GPT-4 single-shot, is not a curiosity. It is an arbitrage opportunity. Companies currently paying frontier-model rates for everything they put through their AI stack are paying for capability they could mostly get from a cheaper model in a loop. The residual frontier-model cost belongs on the review path, where the smart model earns its keep.

For SMEs and small IT teams, the audience I write for and the operating environment I build in, this matters more than it might in a large enterprise. The companies running 4 or 5 frontier models in production at full token cost can afford the inefficiency. The 17-store retailer cannot. The architecture is what makes AI commercially viable at the small-business scale, not the choice of frontier model.

The other implication is that the moat for AI products shifts. It used to be “we have access to GPT-4” or “we built our own model.” Now it’s “we built the loop that makes the model produce work that survives review.” The model becomes a commodity input. The verification layer becomes the differentiator.

Here is where the comfortable framing breaks down for the first time.

Microsoft Azure Research published a paper in late 2025 called Sherlock that measured what verification actually costs in production. Their finding is uncomfortable for anyone selling the loop architecture as a free lunch. Adding verification to every node in an agentic workflow can increase latency by up to 28.9x for instruction-following benchmarks and 53.2x for coding benchmarks, compared to the unverified baseline. The 18.3% accuracy gain Sherlock reports is real. So is the cost of getting it.

That’s a meaningful asterisk on the Ng arbitrage argument. Yes, GPT-3.5 in a loop beats GPT-4 single-shot on accuracy. The loop run takes substantially more tokens, more API calls, and more wall-clock time than the single-shot. Whether the architecture produces a net gain depends on the specific cost ratio between the runtime model, the review model, and the latency tolerance of the application. For batch processing where latency does not matter and per-token cost is the only constraint, the arbitrage often works. For real-time interactive applications, the arbitrage often does not.

I haven’t benchmarked my own systems against this finding. The MatchWatch matching loop doesn’t need to run in real time on the human side; the swipe latency is the bottleneck, not the inference. The pricing calculator runs as overnight batch. The essay loop runs once per piece. These are all use cases where the loop’s latency overhead is hidden by the human pacing of the work. That won’t be true of every deployment.

The workforce question

The textbook positive framing of AI in the workplace is augmentation. AI helps humans do more, faster. Mollick’s research has been clear and well-cited on this. Generative AI improves performance most for less-experienced workers, narrowing the skill gap. The Brynjolfsson, Li and Raymond NBER paper from 2023 showed exactly this. 5,179 customer support agents using a generative AI conversational assistant saw an average 14% productivity gain, but novice and low-skilled workers saw 34%. The mechanism was AI codifying the best practices of more experienced workers, helping new ones move down the experience curve faster.

That’s the augmentation story, and it’s true.

It’s also incomplete.

Brynjolfsson’s Stanford Digital Economy Lab study, published August 2025 using ADP payroll data, found something different. In the most AI-exposed occupations, employment for early-career workers has declined 13% in relative terms while older, more experienced workers in the same occupations have remained employed at the same level or grown. The narrowing-skill-gap effect from the customer support paper has reversed at the labour-market level. The bottom rungs of the ladder are being removed.

Brynjolfsson said as much at the Stanford SIEPR Economic Summit in March 2026. The published Stanford summary captures the finding directly: research shows that employment is falling among workers who use AI to automate tasks, but growing for those adopting AI to learn new skills [1]. The bifurcation matters. AI as automation removes employment. AI as a learning accelerant grows it.

The architecture I’ve described maps onto this directly. The pricing calculator grew from 40 to 200 rules over 18 months. Each rule encodes a decision a human used to make. The system now handles material I wouldn’t have trusted it with on day one. The work that used to require a junior pricing analyst is now done by the rule set plus the model. I, the senior, sit at the verification layer instead of training the junior.

That’s not a thought experiment. It is the actual labour reorganisation that happens when the loop architecture lands in production. The senior’s expertise becomes the benchmark the runtime is calibrated against. The rule set encodes what would have been transferred through years of mentorship. The junior role becomes harder to justify because the work the junior would have done as practice is now handled by the loop.

This is the tension I live with as someone who builds these systems. The architecture is genuinely better than the alternative. It produces work that survives review more reliably than untrained or junior labour can. It lowers the cost of AI in business in a way that makes the technology accessible to operations that couldn’t otherwise afford it. And it removes the entry-level rungs that the people who built the senior expertise climbed up on.

The longer-run problem is harder than the entry-level job loss. The rule set encodes what experienced humans know. But experienced humans became experienced by doing the work the rule set now handles. If the junior role goes away, the senior role doesn’t get replenished. In ten years the system runs on rules written by people who learned pricing before the loop existed. The loop is calibrated against expertise it is simultaneously preventing from forming. That’s the structural problem underneath the labour reorganisation, and I don’t think anyone building these systems has a clean answer to it.

The honest position is that the augmentation framing was true at the firm level for a while. It is becoming false at the labour-market level now. Anyone describing the loop architecture without naming the workforce consequences is selling the comfortable half of the story.

Where the loop breaks

Three failure modes are worth naming. The literature on evaluator-optimizer loops is now developed enough to make them specific rather than rhetorical. And the population-level data is starting to come in.

Galileo’s research on production LLM judge implementations reports that 93% of teams running these systems struggle with the implementation in practice. The struggles aren’t abstract. They’re the same three failure modes that show up in the academic literature, hitting real teams running real systems.

The benchmark calcifies. A calibrated LLM judge degrades as the system under evaluation improves. The output distribution shifts and the judge encounters fewer of the failures it was calibrated to catch. In the pricing calculator: if the rule set grows to cover 97% of cases, the verifier stops seeing interesting failures and starts marking everything fine. The benchmark becomes the ceiling rather than the floor. Galileo’s State of Eval Engineering research documents this in detail.

The judge develops self-model bias. LLM judges exhibit position bias, verbosity bias, and self-preference bias. Research published at NeurIPS 2024 shows that LLM evaluators recognise and favour their own generations, with a demonstrated correlation between self-recognition capability and self-preference bias strength. For MatchWatch: if Sonnet writes the fixes and Haiku reads them as context for the next inference, Sonnet’s stylistic priors will pull Haiku’s outputs toward Sonnet’s preferences over time. The system is no longer benchmarked against external quality. It’s benchmarked against Sonnet’s priors.

I don’t know whether MatchWatch is exhibiting this already. I can see the markdown Sonnet writes. I can see the confidence scores trend up. What I can’t easily see is whether Haiku’s outputs are drifting toward Sonnet’s stylistic priors at the expense of the household’s actual preferences. The instrumentation I have measures confidence. It doesn’t measure drift. That’s the gap in loop two’s design, and naming it honestly is more useful than claiming the bias doesn’t apply.

The loop improves on its metric and degrades on dimensions it cannot see. A January 2026 paper at arXiv titled When “Better” Prompts Hurt documents empirically that generic prompt improvements that raise eval scores can simultaneously degrade structured outputs and other quality dimensions the eval doesn’t measure. In the essay loop: the validator learns to pass claims that have a verifiable source, but it can’t check whether the claim is interesting or true in the spirit of the argument. Only that a citation exists for it. The piece can become more verifiable and less alive at the same time.

All three are manageable with discipline. Periodic human review of cases the verifier marks fine. An external holdout benchmark that’s never optimised against. Deliberate rotation of the evaluator. I do none of these perfectly. They’re the operational debt of the architecture, and anyone running this pattern in production who tells you they’ve eliminated the risks is overstating the maturity of the practice.

The 93% figure is what stops me from being too confident about my own three loops. The other 7% are presumably teams running this architecture well. I don’t assume I’m one of them. I assume I’m one of the 93% getting better at it, and that the difference between the two groups is whether you’ve built the discipline to detect when you aren’t.

-----

*Conclusion section held for a separate essay.*

-----

[1] Stanford SIEPR Economic Summit 2026 published summary, *AI’s Job: What’s a Worker To Do*, siepr.stanford.edu/news/ais-job-whats-worker-do