The Monkeys, the Librarian, and the Magician
Why LLMs aren't an absolute path to AGI
Ah, the hockey stick curve we all came to know and love via pitch decks. All too common when reading AI hype posts with current capabilities marked somewhere in the middle, and “AGI” floating at the top right corner like a destination on a map. The implicit assumption is that we just need to keep going to ascend the curve, feeding it with more compute, more parameters, more data. The line continues upward because… such lines continue upward. You’ve seen a hockey stick, right?
This LLM-related line doesn’t entirely work like that, though, and some executives approving budgets are making bets on an AGI finish line at the top of the graph that may not exist.
Those impressive scaling curves use logarithmic scales, a visualization choice that can make diminishing returns look like steady progress. When DeepMind researchers mapped compute-optimal pre-training in their Chinchilla study, they found a power law relationship with a roughly 20:1 ratio of training tokens to parameters [1]. In other words, the loss function flattens and more compute yields less improvement.
For now, the industry has learned to work around this for production. Companies like Meta and Microsoft deliberately “overtrain” smaller models (Llama 3 used 200 tokens per parameter, Phi-3 pushed past 800) because smaller models are cheaper to run at inference time when millions of users are hitting the API. This isn’t truly escaping the curve, rather it’s trading training cost for inference cost. You’re still fighting diminishing returns; you’re just choosing where to take the hit. The fundamental ceiling on what pre-training can achieve hasn’t moved, so the hockey stick bends whether you want it to or not.
Not clear yet? It was a little muddy for me until I saw a completely unrelated TikTok this morning. Let me explain with three metaphors.
The Monkeys
You’ve probably heard the infinite monkeys thought experiment: Give infinite monkeys infinite typewriters and infinite time, and eventually one of them produces the complete works of Shakespeare. It’s a statement about probability and infinity asserting that, given enough random attempts, any specific outcome becomes inevitable. (This metaphor was featured in the TikTok that prompted me to write this article.)
This is roughly how some (I would argue many) people think about scaling LLMs toward AGI. Keep generating, keep training, and eventually emergent capabilities compound into something that looks like general intelligence. The math is fuzzy, but the trajectory feels right. Well, right enough to keeping wanting the outcome.
But the thought experiment leaves out that someone (or some auditable process) has to read the output.
Walk with me here. Infinite monkeys produce infinite pages, and the overwhelming majority is gibberish. Assuming you have the smartest of monkeys, they can get good at guessing what letters come next. Finding Shakespeare in that pile requires a reader who already knows what Shakespeare looks like in order to act as a verifier. That’s a verification process that cannot scale infinitely in that it requires judgment, context, and understanding of quality. It requires, in other words, the very intelligence you’re trying to create, and this is in part why Berkeley RDI runs agentic AI benchmark evaluation competitions trying to automate that process [2].
The same organizations demanding that we scale toward AGI are also (rightly) demanding safety reviews, alignment checks, output verification, and human-in-the-loop processes in general. These aren’t optional add-ons, but regulatory requirements, liability shields, and basic operational sanity. But they’re also a structural speed limit. You cannot have infinite throughput AND human review. The verification/eval wall exists, and more monkeys don’t solve it.
Turing Award winner Yann LeCun formalized a related problem. If each generated token has some probability e of being wrong, then after n tokens, your probability of a fully correct answer is (1-e)^n. That’s exponential decay [3]. Even with a 1% error rate per token, which is optimistic, after 100 tokens you’re down to a 37% chance of correctness. After 500 tokens, you’re at 0.6%. And that math doesn’t care how impressive the demo looked.

As you can see above, scaling produces more output. It doesn’t necessarily produce more signal.
The Librarian
Let’s say you somehow solved the verification problem. You built the filter that could read infinite output and surface only the good stuff. You’ve still got a more fundamental issue.
LLMs are librarians.
Really, really good librarians. They’ve read everything, can cross-reference, summarize, retrieve, recombine, and give you the LOC identifiers. Ask for something in the style of Hemingway about the themes of Dostoevsky and they’ll hand you something impressive. Maybe a little trope-y, but they have the entire card catalog at their fingertips and they’re very fast.
But librarians are concerned with assigning a taxonomy to all of human knowledge. They generally don’t produce original research that’s not about that core pursuit.
This is the interpolation problem wherein LLMs operate within the boundaries of their (pre- and post-) training data. They find patterns, recombine elements, generate outputs that exist somewhere in the distribution of what they’ve seen. They interpolate, sure, but what they don’t do is truly extrapolate. They don’t generate genuinely novel solutions that exist outside the boundaries of what they’ve ingested, or that significantly “rhymes with” their known universe. (In other words: Pattern-matching at a higher level of abstraction doesn’t escape the pattern.)
Recent research from Stanford and Meta quantified this gap. In a study applying information theory to compare how LLMs and humans organize knowledge, researchers found that LLMs are optimized for aggressive statistical compression, making them extraordinarily efficient at pattern-matching within their training distribution [4]. Humans, by contrast, prioritize what the researchers call “adaptive nuance and contextual richness.” We sacrifice compression efficiency for flexibility. That flexibility is what lets us extrapolate, reason about novel situations, and recognize when we’re outside our domain of competence. LLMs have no such mechanism. They’ll confidently interpolate forever, even when the question requires something they’ve never seen. Until fairly recently, they didn’t have the guardrails to say, “You know, I just don’t know enough about that,” and guardrails are exactly what those are - installed in post.
You can make the library bigger and faster. You can furnish better tools for finding connections between sources, like introducing knowledge graphs instead of vanilla vector embeddings. What you can’t do is make them produce knowledge that isn’t derived from the collection, making the training data a not-artificial ceiling.
We’re running low on a mass of high-quality training data, though. The internet has been scraped, the books ingested. Surely that will continue to be refreshed, but how much more value is another pass at Reddit providing? Some companies are generating synthetic data to fill the gap, which is roughly like asking a librarian to write new books by recombining the old ones.
The Magician
So we’ve got a probability problem and a category problem. But there’s a third thing happening, and it’s about perception.
We’ve all seen the demos where the AI produces something genuinely surprising, well-crafted to create a specific impression. Like the first time somebody opens ChatGPT with a simple prompt, then gets to be delighted with an unexpectedly novel-seeming reply, except focused on a specific actual or perceived pain point for a business. These moments can be highly effective.
In effect, a magician pulls a few rabbits out a hat, and a lot of the audience wants a magic hat. Companies license a technology or solution, say the incantations, and the rabbits don’t make an appearance. Knowing which prompt to use, which context to provide, which outputs to favor are the magician’s unfair advantage. The impressive result was real enough, but the implication that the success was reproducible - or can be extended to pulling, say, a dove - are usually overstated.
I’ve seen this play out across multiple enterprise AI implementations now. The vendor demo looks transformative, the marketing hype is SUPER-consistent in promises of transformation, the pilot itself may show promise. But the production deployment is a very sophisticated autocomplete that requires extensive prompt engineering, careful guardrails, constant supervision, and significant human oversight to produce anything reliably useful. Then the company is in-for-a-penny, and spends almost as much in indirect expense making the AI work as it would have spent just optimizing a process.
The Bar
[Note: I had a hard time naming this section, so I went with a 3,000-year old joke structure made popular in the Vaudeville era. Jokes are made better with explanation. No further notes, please.]
The counterargument usually goes something like this:
What about agentic systems?
Multi-agent architectures?
RLHF-scaffolded reasoning?
World models?
And sure, those approaches might move the needle, increasingly so as you get to the latter (world models). If the answer to “LLMs won’t reach AGI” is “we’ll need to bolt on external planning systems, memory architectures, and verification layers,” then you’ve conceded the core claim. You’re no longer arguing that scaling LLMs is the path, but that LLMs might be one component of a much more complex system still being built. That’s a different bet, with different timelines, different capital requirements, and a much hazier finish line.
Some will argue that bolting components together is itself the mechanism of intelligence, that modularity is the point. That’s a deeper debate, and worth having, but it doesn’t change the investment calculus in front of us today. The most interesting frontier is not scaling the same architecture but hybridizing it via these very world models, planning systems, memory substrates, and agentic scaffolding. Those paths may move capability forward, but their cost structures, safety models, and governance requirements differ radically from “just scale LLMs.”
And to be fair, nobody serious believes infinite LLM scaling alone produces AGI. The strongest research communities already acknowledge this. The real debate is not whether new cognitive structures are needed, but how much existing architectures can contribute before hitting diminishing returns.
The scaling thesis assumes that more of the same approach yields qualitatively different results. It doesn’t, not really, not just with scaling LLMs. You get (quantitatively) more of the same results, with diminishing returns as you approach the structural limits built into the architecture itself. Whether you call it the verification wall, the training data ceiling, or just the category error between retrieval and creation.
The executives drawing hockey stick curves to AGI solely through LLM scaling are betting on a trajectory that bends in ways the underlying technology can’t support. Not because LLMs are bad, because they’re a specific tool with specific capabilities and specific limits, and pretending otherwise doesn’t make the limits disappear.
So this is a call for businesses to recalibrate, to ask harder questions about what we’re actually buying. Scrutinize demos, demand measurability of impact on the business, and stop assuming the line goes up forever just because it’s gone up so far.
If AGI eventually emerges, it won’t be because we scaled token prediction harder, but because we built fundamentally different coordination, reasoning, and embodiment layers on top of it. The next layer or layers will look less like a model and more like an ecosystem. We’re not going to get King Lear from scaling LLMs. But maybe, if we’re lucky, those monkeys will deliver a dystopian play about the corrupting nature of power that helps in a different way.
CREDITS: Claude Sonnet 4.5 for editorial, Google Gemini Nano Banana Pro for artwork.
References
[1] Hoffmann, J., et al. “Training Compute-Optimal Large Language Models.” DeepMind, March 2022. The study established scaling laws for pre-training, finding roughly 20 tokens per parameter as compute-optimal. Subsequent work, including a replication attempt by Epoch AI, found some statistical issues with the original estimates but confirmed the core ratio. Industry practice has since moved toward “overtraining” smaller models (Llama 3, Phi-3) to optimize for inference costs rather than training efficiency, a workaround that confirms rather than refutes the underlying diminishing returns.
[2] Berkeley RDI. “AgentX-AgentBeats Competition.” Launched October 2025. A two-phase competition challenging participants to build standardized benchmarks for agentic AI (Phase 1) and then develop agents to excel on them (Phase 2). The competition explicitly addresses the interoperability, reproducibility, and fragmentation problems in current agent evaluation. https://rdi.berkeley.edu/agentx-agentbeats
[3] LeCun, Y. “Towards Machines that can Understand, Reason & Plan.” Presentation at Santa Fe Institute workshop: AI and the Barrier of Meaning, April 24, 2023, slide 14. LeCun has maintained and strengthened this position, reiterating the argument at the AI Action Summit (February 2025) and declaring at NVIDIA GTC 2025 (March 2025) that he’s “not interested in LLMs anymore” because “they are just token generators.” The argument has been debated; critics note that error rates aren’t independent across tokens due to attention mechanisms and self-correction capabilities (Chain-of-Thought, etc.). The core concern about compounding errors in long-form generation remains relevant to production reliability. Slides available at: https://www.slideshare.net/slideshow/yann-lecun-20230424-santa-fe-institute-pdf/269726578
[4] Shani, C., Soffer, L., Jurafsky, D., LeCun, Y., & Shwartz-Ziv, R. “From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning.” arXiv:2505.17117, May 2025. The study applied Rate-Distortion Theory and the Information Bottleneck principle to compare conceptual organization across 40+ LLMs against human categorization benchmarks.

