The Bitter Lesson

Richard Sutton’s 2019 essay “The Bitter Lesson” makes a simple argument: in AI, general methods that scale with compute beat clever engineering. Every time. It’s one of those observations that sounds reductive until you look at the track record.

The pattern

Researchers spend years encoding human knowledge into systems - handcrafted rules, expert heuristics, domain-specific tricks. Then someone comes along with a simpler approach that just throws more computation at the problem, and it wins.

The clearest example is chess. For decades, the best chess engines were built on human understanding. Grandmasters and computer scientists worked together to encode evaluation functions - piece values, positional heuristics, opening books, endgame tables. Thousands of rules, painstakingly tuned, representing centuries of accumulated chess knowledge. These engines were impressive. They could beat almost any human.

Then AlphaZero threw all of it away. No opening books. No endgame tables. No human chess knowledge at all. It started from the rules of the game, played against itself millions of times, and within hours it was crushing the best traditional engines. The patterns it discovered weren’t the same patterns humans had encoded - some of its play looked alien, sacrificing material in ways no grandmaster would. It had found something in the search space that human expertise had missed, because the search space was too large for humans to explore fully.

The same story played out in computer vision (years of hand-designed feature detectors, then CNNs learned better features from scratch), speech recognition (decades of linguistic models, then deep learning on raw audio won), machine translation, protein folding, code generation. Each time, the community was surprised. Each time, the lesson was the same.

Why it’s bitter

It’s “bitter” because it feels wrong. If you’ve spent years understanding the structure of a problem - really understanding it, deeply - you want that to matter. You want the insight to be worth something. And in the short term, it is. Domain knowledge gives you a head start. A well-crafted heuristic can outperform a general method when data and compute are limited.

But the head start doesn’t last. Compute gets cheaper over time. Human expertise doesn’t get cheaper. And it turns out that learning from data, given enough of it, finds patterns we wouldn’t have thought to encode. Not because the patterns are obvious, but because the search space is too large for humans to explore manually.

This is what makes it bitter rather than just interesting. It’s not that human knowledge is wrong. It’s that it’s eventually unnecessary. The handcrafted features in computer vision weren’t bad - they were genuine insights about how vision works. They just couldn’t compete with a system that had the capacity to discover those insights and more on its own.

Why it keeps happening

Sutton presents this as an empirical observation, but there’s a reason it’s not a coincidence.

The universal approximation theorem says that neural networks can approximate any continuous function given enough parameters. Which means any handcrafted solution a human can build, a learned model can in principle match - and potentially surpass, because it’s not limited to the patterns humans thought to look for. The question is never “can a general method solve this?” It’s “do we have enough compute for it to solve this yet?”

And the answer to “yet” keeps changing, because scaling laws show that model performance improves predictably with compute - a power-law relationship with no sign of hitting a ceiling. Every year, the set of problems where general methods have enough compute to win gets larger. The heuristics that outperform today are the heuristics that get outperformed tomorrow.

This is why the lesson keeps surprising people. At any given moment, the clever approach is winning on the problems that are just beyond what current compute can handle. It looks like insight is necessary. Then compute catches up, and it wasn’t. The frontier moves, but the pattern at the frontier stays the same.

Beyond AI

What I find most interesting about the bitter lesson is that it generalizes past AI. It’s really about the tension between human understanding and search - and that tension shows up everywhere.

We build heuristics because we can’t search exhaustively. That’s bounded rationality - Simon’s observation that we satisfice because we don’t have the information or processing power to optimize. Our shortcuts are good. Evolution and experience gave us useful ones. But they’re workarounds for a compute limitation. A chess grandmaster’s intuition about piece placement is a heuristic built from thousands of games. AlphaZero had the compute to play millions. The grandmaster’s heuristic isn’t wrong - it’s just a lossy compression of what exhaustive search would find.

That’s the connection to compression. Understanding is compression - finding the shorter description of a complex phenomenon. A handcrafted heuristic is a human-compressed version of the problem. Newton’s laws compress thousands of observations into F = ma. The chess grandmaster compresses thousands of games into “control the center.” These compressions are genuinely powerful. But they’re human-scale compressions, limited by what we can hold in our heads and what we’ve had time to observe. A system with enough compute can find compressions we’d never reach - not because it’s smarter, but because it can search a larger space.

In software engineering, the pattern is quieter but real. A developer spends weeks hand-tuning a query plan, and then the database gets faster hardware and a better query optimizer and the hand-tuning becomes irrelevant. A team builds a clever architecture to work around memory constraints, and then memory gets cheap and the cleverness becomes unnecessary complexity. The cost of abstraction is often a bet that a constraint will persist long enough to justify the engineering. When the constraint disappears faster than expected, the abstraction becomes debt.

There’s something humbling about this. The problems we find hard aren’t hard because they require human insight specifically. They’re hard because they require search through a large space. We happen to be good at certain kinds of search because evolution optimized us for it. But “good at certain kinds of search” is a weaker position than “has more compute for general search.” And compute keeps getting cheaper.

What to take from it

Sutton’s practical implication is: bet on approaches that scale with compute, not on clever tricks that don’t. This is why AI research has converged on large models and self-supervised learning, and why the people who internalized the bitter lesson early are the ones building the systems that are working now.

But I think the deeper lesson is about what human understanding actually is. It’s not magic. It’s compression under constraints. It’s the best search we can do with the hardware we have. That makes it valuable - especially now, when compute is still limited relative to the hardest problems. Domain knowledge accelerates learning. It points the search in useful directions. It matters in data-scarce settings where brute force can’t get enough signal.

It’s just not a permanent advantage. The long-run trajectory favors general methods and more compute, and the long run keeps getting shorter.