Synchronization in Distributed Systems
Synchronization in distributed systems is about getting independent nodes to agree on something - time, state, the order of events, who’s in charge.
It’s hard because:
- No shared memory - nodes can’t just read each other’s state
- No global clock - timestamps don’t mean the same thing everywhere
- Networks are unreliable - messages get lost, delayed, reordered
- Nodes fail - and you can’t always tell if a node is dead or just slow
Common approaches:
Logical clocks (Lamport, vector clocks) track causality without relying on physical time. Useful for ordering events when you don’t need wall-clock timestamps.
Consensus protocols (Paxos, Raft) get nodes to agree on a value or sequence of values. Essential for leader election, replicated state machines, distributed transactions. The cost is coordination overhead and latency.
Two-phase commit (2PC) coordinates transactions across nodes. Coordinator asks everyone to prepare, then tells everyone to commit. Simple but blocks if the coordinator fails.
Gossip protocols spread information probabilistically - each node tells a few random neighbors, they tell their neighbors, and state eventually converges. Scalable but only guarantees eventual consistency.
CRDTs sidestep the problem by designing data structures where order doesn’t matter - any sequence of operations converges to the same result.
The fundamental tradeoff: stronger synchronization guarantees require more coordination, which means more latency and reduced availability when nodes can’t communicate. Systems like Spanner use synchronized clocks (GPS + atomic clocks) to reduce this coordination cost, but that’s expensive infrastructure.
Most systems pick a point on the spectrum based on what they can afford to get wrong.