Why the Bitter Lesson Works

Sutton’s Bitter Lesson is usually presented as an empirical observation - compute beats cleverness, and it keeps happening. But there are two theoretical results that give the pattern real teeth.

The ceiling is compute, not capability

The universal approximation theorem says a neural network with enough parameters can approximate any continuous function. Any solution a human can handcraft, a learned model can in principle match. The ceiling on learned approaches isn’t capability - it’s compute. Given enough of it, the general method can find whatever the specialist found, and potentially more, because it isn’t constrained to the patterns humans thought to look for.

This is what makes the bitter lesson feel inevitable rather than accidental. The handcrafted solution isn’t reaching a region of the solution space that learning can’t access. It’s just getting there faster with less compute. That’s a real advantage - but it’s the kind of advantage that erodes as compute gets cheaper.

The ceiling keeps moving

The scaling laws are what turn this from a theoretical possibility into a practical prediction. Model performance improves predictably with compute, following a power law:

P(C)CαP(C) \propto C^{-\alpha}

This has held across vision, language, and games with no sign of a ceiling. We’re not bumping against fundamental limits - we’re bumping against compute limits. And compute limits keep moving.

What the theory doesn’t say

There are other results people cite in support of the bitter lesson - Kolmogorov complexity, No Free Lunch - and they gesture in the same direction, but they don’t quite land the argument. No Free Lunch says no algorithm is universally better across all possible problems, but we don’t face all possible problems. We face real ones with exploitable structure, which is exactly what heuristics are good at. Kolmogorov complexity says some functions are incompressible, but that doesn’t mean useful approximations require exhaustive search - humans find good approximations all the time without it.

I think it’s more honest to say: universal approximation and scaling laws give the bitter lesson theoretical weight. The other results are suggestive, not conclusive.

The short-term caveat

None of this means domain knowledge is useless. Heuristics win when compute is scarce - and for most practical problems right now, compute is still scarce relative to what brute-force search would need. A well-chosen inductive bias can get you 80% of the way there with a fraction of the compute. That’s valuable. That’s engineering.

But it’s a time-limited advantage. The set of problems where general methods have enough compute to win gets larger every year. What the theory adds is that this isn’t a coincidence - it’s the expected trajectory.