Ordering in Distributed Systems
Ordering is one of the hard problems in distributed systems. When events happen on different machines, what does “before” and “after” even mean?
The core issue: there’s no global clock. Each node has its own clock, and they drift. Network messages arrive out of order or get delayed. Two events that seem simultaneous might not be.
Why it matters: Many things depend on order. Database transactions, cache invalidation, event processing. If node A updates a value and node B reads it, we need to know which happened first.
Strategies:
Lamport clocks. Each node keeps a counter. Increment on local events, update to max(local, received)+1 on message receipt. Gives you a partial order - if event A’s timestamp is less than B’s, A might have happened before B. But equal or greater timestamps don’t tell you much.
Vector clocks. Each node tracks a counter per node. More expensive (O(n) space per event) but captures causality precisely. If A’s vector is strictly less than B’s in all components, A happened before B. Otherwise they’re concurrent.
Total order broadcast. Use consensus (Paxos, Raft) to agree on a single order for all events. Expensive but gives you strong guarantees - everyone sees the same sequence.
Sequencers. Designate one node to assign sequence numbers. Simple but creates a bottleneck and single point of failure.
Accept ambiguity. CRDTs and similar techniques design data structures that don’t need strict ordering - operations can be applied in any order and converge to the same result.
The tradeoff: stronger ordering guarantees cost more coordination, which means more latency and less availability. Pick based on what your application actually needs.