Erasure Coding
Erasure coding is a way to store data redundantly without just copying it multiple times.
The naive approach to fault tolerance: store three copies of everything. If one fails, you still have two. Works, but costs 3x storage.
Erasure coding does better. You take your data, split it into k chunks, then use math (Reed-Solomon codes, usually) to generate m additional “parity” chunks. Now you can reconstruct the original from any k of the k+m chunks.
A common configuration is (10,4): 10 data chunks plus 4 parity chunks. You can lose any 4 chunks and still recover everything. That’s comparable fault tolerance to 3x replication, but with only 1.4x storage overhead instead of 3x.
Data: [D1] [D2] [D3] [D4] ... [D10]
Parity: [P1] [P2] [P3] [P4]
Lose any 4? Still recoverable from the remaining 10.
The tradeoff is compute. Replication is simple - just copy bytes. Erasure coding requires encoding on write and decoding on read (though reads are fast if no chunks are missing). Recovery after failures requires reading k chunks and doing math.
This is why it’s common in “cold” storage where data is written once and read rarely, and less common in hot paths where latency matters. HDFS, Ceph, S3, Azure Blob Storage - all use erasure coding under the hood for durability without burning 3x storage.