The Bitter Lesson: Theoretical Foundations
Sutton’s Bitter Lesson is usually presented as an empirical observation - compute beats cleverness historically. But there are theoretical reasons to expect this.
Universal approximation. Neural networks can approximate any continuous function given enough parameters. This means learned models can, in principle, match any handcrafted solution - you just need enough capacity.
Kolmogorov complexity. The optimal representation of a function may be incompressible - there’s no shortcut to finding it. If true, brute-force search (or learning, which is a form of search) is the only way to get there.
No Free Lunch. No algorithm is universally better than any other across all possible problems. Handcrafted heuristics optimize for specific problems; general learning methods explore more hypothesis space. Given enough compute, the general approach wins on average.
Scaling laws. Empirically, model performance scales predictably with compute:
More compute, better performance, following a power law. This has held across vision, language, and games. It suggests we’re not hitting fundamental limits - just compute limits.
VC dimension and generalization. Larger models can fit more complex functions. With enough data, high-capacity models that learn from scratch will outperform constrained models that bake in assumptions.
The implication: heuristics can win in the short term (less data, less compute needed), but scale favors general learning. Domain knowledge is a bet that your problem has exploitable structure. That bet might be wrong, or the structure might be learnable anyway.
This doesn’t mean domain knowledge is useless - it can accelerate learning and help with small-data regimes. But the long-run trajectory favors compute and general methods.