Skip to main content

Designing Data-Intensive Applications

ยท 2 min read

Martin Kleppmann's Designing Data-Intensive Applications is the best book I know on how data systems actually work.

Most resources on databases and distributed systems are either too shallow (marketing docs) or too deep (academic papers). DDIA hits the middle: rigorous enough to be useful, accessible enough to actually read.

The book covers:

Storage engines. How databases actually store and retrieve data - B-trees vs LSM trees, indexing strategies, when to use what. After reading this you'll understand why different databases perform differently.

Replication and partitioning. How data gets distributed across machines. Leader-based vs leaderless replication, the actual tradeoffs involved, what can go wrong.

Consistency and consensus. Linearizability, eventual consistency, Paxos, Raft - the concepts that underpin distributed coordination. Kleppmann explains these clearly without handwaving past the hard parts.

Batch and stream processing. MapReduce, Kafka, Flink - how data flows through systems at scale. The evolution from batch to streaming and what each is good for.

What makes the book good isn't just the coverage - it's the approach. Kleppmann focuses on why systems work the way they do, not just what they do. You come away with mental models that transfer to new systems you haven't seen yet.

If you build anything that stores or processes data, this book is worth your time.