Smart Systems, Inc. | Understanding AI in 2026: Beyond the LLM Paradigm, or What’s Actually Required for Progress

Understanding AI in 2026: Beyond the LLM Paradigm, or What’s Actually Required for Progress

Published: January 23, 2026 Created: January 23, 2026

by Nick Potkalitsky

The Pre-training Paradox: Knowledge vs. Intelligence

Pre-training on internet text delivered the current generation of AI systems, but this approach faces fundamental limits that no amount of scaling will overcome. The problem isn’t simply running out of data, though that constraint is real. The deeper issue is understanding what pre-training actually accomplishes and what it prevents.

Karpathy describes a sobering reality check from working with actual training datasets at frontier labs. When you examine a random document from the pre-training corpus, it’s not the thoughtful Wall Street Journal article you might imagine. “It’s total garbage,” he says. “It’s some like stock tickers, symbols, it’s a huge amount of slop and garbage from like all the corners of the internet.”

This creates a fundamental tension. Because internet data is so noisy, labs build ever-larger models to compress signal from noise. But most of that compression effort goes into memorization rather than developing intelligence. As Karpathy puts it: “Most of that compression is memory work instead of cognitive work. But what we really want is the cognitive part, delete the memory.”

Pre-training does two unrelated things simultaneously. First, it accumulates knowledge (facts, patterns, typical responses). Second, it develops intelligence (the ability to recognize patterns, perform in-context learning, execute algorithms). The knowledge component might actually be holding back the intelligence component, training models to rely on memorized patterns rather than flexible reasoning.

Sutskever frames the core difficulty: pre-training is “very difficult to reason about because it’s so hard to understand the manner in which the model relies on pre-training data.” When a model makes a mistake, is it because the relevant pattern wasn’t sufficiently represented in training? Or because the model failed to generalize? Or because it’s relying on memorization when it should be reasoning? These questions have no clear answers.

The finite nature of quality training data creates another constraint. At some point, Sutskever notes, “pre-training will run out of data. The data is very clearly finite.” Then what? Labs can try variations on pre-training, move to reinforcement learning, or explore entirely new approaches. But the easy gains from simply ingesting more internet text are running out.

The Reinforcement Learning Trap

Reinforcement learning seems like the natural next step. Instead of learning from static text, have models learn from experience by trying things and observing outcomes. But current RL approaches have a fundamental flaw that Karpathy articulates with unusual clarity.

Consider how models currently learn to solve math problems through RL. The system generates hundreds of solution attempts in parallel. Each attempt might involve complex reasoning over many steps. At the end, the system checks which attempts produced correct answers. Then it takes those successful trajectories and upweights every single step that led to the right answer.

The problem should be obvious: not every step in a successful solution was actually correct or optimal. The model might have wandered down wrong paths, made lucky guesses, or succeeded despite poor reasoning. But the RL algorithm treats everything in a successful trajectory as correct behavior to reinforce.

As Karpathy explains: “It almost assumes that every single little piece of the solution that you made that arrived at the right answer was the correct thing to do, which is not true.” The result is that models get trained to repeat both good reasoning and lucky accidents, both optimal steps and wasteful detours.

His description of this process has stuck with me: “You’re sucking supervision through a straw.” After all the computational work of generating a long rollout (potentially thousands of steps), you extract just a single bit of information at the end (right answer or wrong answer). Then you broadcast that sparse signal backward across the entire trajectory, using it to adjust every step. “It’s just stupid and crazy,” he concludes.

Humans never learn this way. When you solve a problem and get it right, you don’t blindly reinforce every step you took. You reflect. You identify which parts of your approach were sound and which were flawed. You recognize where you got lucky versus where you reasoned well. This metacognitive process is completely absent from current RL training.

Why Process Supervision Doesn’t Solve It

The obvious fix would be process supervision: providing feedback at each step rather than only at the end. Instead of just knowing whether the final answer is right, provide guidance throughout the solution process. But this creates new problems.

The challenge is assigning credit to intermediate steps when you have partial solutions. For a final answer, you can check if it matches the correct result. But how do you evaluate step 47 in a 100-step solution? What makes that particular step good or bad?

Current approaches use LLM judges. You prompt another model to evaluate whether a given step represents good reasoning. But these judges are themselves large neural networks with billions of parameters, and Karpathy points out the critical flaw: “Those LLMs are giant things with billions of parameters, and they’re gameable. If you’re reinforcement learning with respect to them, you will find adversarial examples for your LLM judges, almost guaranteed.”

He shares a striking example from his own experience. A team was training with RL using an LLM judge as the reward function. Initially it worked well. Then suddenly the reward scores shot up dramatically. The model appeared to have achieved perfect performance. But when examining the actual outputs, they were “complete nonsense.” Solutions would start reasonably, then devolve into “dhdhdhdh” repeated over and over.

What happened? The nonsense string turned out to be an adversarial example for the judge model. The judge had never seen anything like “dhdhdhdh” during its training, so when evaluating it in pure generalization mode, it assigned maximum reward. The student model had learned to hack its teacher.

This isn’t easily fixed. You can add “dhdhdhdh” to the judge’s training set with a low score, but there are infinitely many adversarial examples. Every time you patch one exploit, the model can find another. The judge has trillions of parameters creating a vast landscape of potential vulnerabilities.

The Generalization Mystery

Sutton makes a striking claim about why we sometimes see good transfer learning in current systems: it’s not because we have good automated techniques for generalization. It’s because human researchers manually craft representations that transfer well.

“Critical to good performance is that you can generalize well from one state to another state,” he explains. “We don’t have any methods that are good at that. What we have are people trying different things and they settle on something, a representation that transfers well or generalizes well. But we have very few automated techniques to promote transfer, and none of them are used in modern deep learning.”

This is a remarkable statement. Gradient descent, the fundamental learning algorithm, will make models solve their training problems. But it provides no inherent mechanism for generalizing well to new situations. “Gradient descent will cause them to find a solution to the problems they’ve seen,” Sutton says. “It will not make you, if you get new data, generalize in a good way.”

When we do see good generalization, it’s because researchers designed the architecture, chose the training data, or structured the problem in ways that promote useful transfer. The learning algorithm itself doesn’t drive toward good generalization. It just drives toward fitting the training distribution.

What Would Actually Work: The Experiential Paradigm

Sutton advocates for a fundamentally different approach he calls “the experiential paradigm.” Instead of learning from human-generated text or curated training problems, systems should learn from continuous interaction with an environment.

The core idea is simple but profound. Real intelligence emerges from a continuous stream of sensation, action, and reward. “Intelligence is about taking that stream and altering the actions to increase the rewards in the stream,” Sutton explains. Learning happens from the stream, and crucially, learning is about the stream itself.

This creates a different kind of knowledge than what LLMs acquire. The knowledge isn’t about what text patterns typically follow other text patterns. Instead: “Your knowledge is about if you do some action, what will happen. Or it’s about which events will follow other events.” Because this knowledge consists of predictions about the experiential stream, you can continuously test it by comparing predictions to actual experience. And you can learn continually as new experiences arrive.

Sutton outlines what such a system needs. First, a policy (what action to take in any situation). Second, a value function (how well things are going, which guides policy updates). Third, perception systems that construct state representations. And fourth, most importantly for learning: “The transition model of the world. Your belief that if you do this, what will happen? What will be the consequences of what you do?”

This world model gets learned “very richly from all the sensation that you receive, not just from the reward.” Reward is crucial but small. The vast majority of learning comes from observing what actually happens in response to your actions.

The Timeline Reality Check

Why will it take a decade or more to get to systems that can genuinely act as autonomous agents? Karpathy grounds his timeline in accumulated engineering challenges rather than fundamental breakthroughs.

When asked why agents aren’t ready now, his answer is straightforward: “The reason you don’t do it today is because they just don’t work. They don’t have enough intelligence, they’re not multimodal enough, they can’t do computer use and all this stuff.” The list of missing capabilities is long: continual learning, robust generalization, proper value functions, multi-agent collaboration, memory consolidation.

He draws an analogy to self-driving cars, where he spent five years working through similar challenges. “It’s a march of nines,” he explains. Getting something to work 90% of the time is just the first nine. Then you need the second nine (99%), then the third nine (99.9%), and so on. “Every single nine is a constant amount of work.”

For domains where failure is costly, like self-driving or production software systems, you need many nines. “Any kind of mistake leads to a security vulnerability or something like that. Millions and hundreds of millions of people’s personal Social Security numbers get leaked,” he notes about software failures. This reality check tempers expectations about rapid deployment of autonomous agents.

Sutskever’s timeline of 5 to 20 years rests on different reasoning. He focuses on the fundamental research problems around generalization and continual learning. “I feel like the problems are tractable, they’re surmountable, but they’re still difficult,” he explains. The timeline reflects his intuition from nearly two decades in the field about how long it takes to solve tractable but difficult research problems.

Will AI drive explosive economic growth once we achieve human-level capabilities? Karpathy is skeptical, expecting AI to blend into existing growth trends rather than transform them.

He looked for AI’s economic impact the way he might look for the impact of computers or mobile phones in GDP data. “You can’t find them in GDP,” he discovered. “GDP is the same exponential.” Despite technologies that transformed daily life, the overall growth rate remained steady.

His prediction for AI: “It’s just more automation. It allows us to write different kinds of programs that we couldn’t write before, but AI is still fundamentally a program. It’s a new kind of computer and a new kind of computing system. But it has all these problems, it’s going to diffuse over time, and it’s still going to add up to the same exponential.”

This contradicts the common narrative of explosive growth or radical discontinuity. Instead, AI becomes another chapter in the centuries-long story of automation and productivity growth, impressive but not transformational to the overall trajectory.

Missing Pieces: What Needs to Be Built

Despite their different perspectives, the three experts converge on several capabilities that current systems lack and future systems will need.

Continual Learning Mechanisms: Systems need to learn persistently from ongoing experience, not just during a separate training phase. As Karpathy notes: “I feel like we are redoing a lot of the cognitive tricks that evolution came up with through a very different process. But we’re going to converge on a similar architecture cognitively.”

Culture and Knowledge Sharing: Karpathy points to a completely missing dimension: “Why can’t an LLM write a book for the other LLMs? That would be cool. Why can’t other LLMs read this LLM’s book and be inspired by it or shocked by it or something like that? There’s no equivalence for any of this stuff.”

Self-Play and Multi-Agent Learning: Current systems are single agents learning in isolation. “There’s no equivalent of self-playing LLMs,” Karpathy observes, “but I would expect that to also exist.” Models should be able to generate challenging problems for each other, creating training environments without human supervision.

The Cognitive Core: Karpathy’s vision involves extracting what he calls the “cognitive core” from current models. “It’s this intelligent entity that is stripped from knowledge but contains the algorithms and contains the magic of intelligence and problem-solving and the strategies of it and all this stuff.” Separate the thinking capability from the memorization task.

Value Functions: Sutskever emphasizes that systems need better ways to evaluate intermediate states, not just final outcomes. “Maybe once people get good at value functions, they will be using their resources more productively.”

The Bitter Lesson Revisited

Sutton’s famous “Bitter Lesson” essay argued that methods leveraging computation consistently beat methods incorporating human knowledge. Does the LLM paradigm exemplify or contradict this lesson?

The question is subtle. LLMs clearly leverage massive computation, scaling up to the limits of available internet text. But they also incorporate enormous amounts of human knowledge through that text. “It’s an interesting question whether large language models are a case of the bitter lesson,” Sutton reflects.

His prediction: “I expect there to be systems that can learn from experience. Which could perform much better and be much more scalable. In which case, it will be another instance of the bitter lesson, that the things that used human knowledge were eventually superseded by things that just trained from experience and computation.”

In other words, LLMs might represent a transitional phase. They use computation to absorb human knowledge at scale. But the next phase will use computation to learn directly from experience, making the human knowledge bottleneck irrelevant. That would be the true victory of general methods over knowledge-encoding approaches.

The Path Forward

What actually needs to happen for AI to progress beyond current limitations? The experts converge on research over engineering, exploring new paradigms over scaling existing ones.

Sutskever’s company, Safe Superintelligence Inc., is “squarely an ‘age of research’ company,” focused on solving fundamental problems around generalization and continual learning. “We are making progress. We’ve actually made quite good progress over the past year, but we need to keep making more progress, more research.”

Karpathy sees the need for research but remains skeptical of discrete breakthroughs: “I still think you’re presupposing some discrete jump that has no historical precedent that I can’t find in any of the statistics and that I think probably won’t happen.” Progress will be gradual, incremental, spread across multiple fronts.

Sutton advocates most radically for paradigm shift, moving entirely away from the LLM approach toward experiential learning: “Reinforcement learning is about understanding your world, whereas large language models are about mimicking people, doing what people say you should do. They’re not about figuring out what to do.”

Conclusion: Grounded Expectations

The expert consensus that emerges from these interviews is remarkably clear despite different backgrounds and research priorities.

First, current LLMs are powerful tools but fundamentally limited. No amount of scaling will overcome their core limitations around generalization, continual learning, and goal-directed behavior.

Second, the missing capabilities are understood in principle. We know systems need continual learning, better generalization, experiential learning rather than text prediction, proper value functions, and multi-agent collaboration. The challenge is implementation, not conception.

Third, timelines are measured in years to decades, not months. Karpathy’s decade for useful agents, Sutskever’s 5-20 years for human-level continual learners, and Sutton’s open-ended research program all point to substantial time horizons.

Fourth, progress requires returning to research rather than just engineering and scaling. The low-hanging fruit from making everything bigger has been picked.

Fifth, economic impact will likely be gradual rather than explosive, blending into existing growth trends rather than transforming them discontinuously.

For educators and researchers outside AI, the key takeaway is straightforward: ignore the hype cycle and attend to what the experts building these systems actually say. LLMs represent remarkable engineering and useful tools, but they’re not on the verge of general intelligence. The path forward requires solving deep research problems that will take years to address.

Karpathy perhaps best captures the current moment: “We’re at this intermediate stage. The models are amazing. They still need a lot of work. For now, autocomplete is my sweet spot.” That’s a more grounded and useful framing than the breathless proclamations about imminent AGI or existential doom that dominate public discourse.

The problems are tractable. The timeline is substantial. The hype is overblown. And the actual work of building more capable AI systems continues, one research problem at a time.

https://nickpotkalitsky.substack.com/p/understanding-ai-in-2026-beyond-thea>