
The binary search of distributed programming - antirez
http://antirez.com/news/102
======
eternalban
Alright man, I think you need a public push to go and embrace TLA :)

Here is the bakery algorithm from the tlaps examples:

[https://github.com/cartazio/tlaps/blob/master/examples/Baker...](https://github.com/cartazio/tlaps/blob/master/examples/Bakery.tla)

------
throwaway41597
> Usually it is hard with distributed computations to say what happened before
> and what happened after. Using those IDs you always know the order of
> certain events.

Which "certain events"? The typical case I can think of wouldn't work:

    
    
        - at time t0, event e0 occurs on client c0, an id is requested
        - at time t1 > t0, event e1 occurs on client c1, an id is requested
        - id1 = 62 is generated and returned to client c1
        - id0 = 63 is generated after because of network latency
    

The IDs say that event e1 reached the ID generation servers before event e0,
but I don't see when this would be useful, e0 still happened before, it may
even be causally older than e1.

Am I missing something?

~~~
random42
My understanding is only server side events are considered. So for ordering
perspective, if due to network latencies e0 reaches the server later than e1,
id0 > id1 is correct behaviour.

~~~
throwaway41597
By "client" I meant client of the ID generation servers, may they be web
servers that run DB clients, or cellphones. When someone uses a distributed ID
generation cluster, they presumably want large throughput, like millions of
IDs per second, and latency will likely be a problem.

------
fernly
"Here fsync at every write is mandatory because if nodes go down and restart,
they MUST have the latest value of the “current” key."

Why could not a node that is restarting, simply do the algorithm and set its
own "current" from the $NEXTID it gets? In other words, a single failed-
restarting node does not need to rely on having a recent disk value for
"current", as long as it can call on a majority of non-failed nodes.

Only if _all_ nodes failed (a system-wide crash) and had to restart, the
consensus value of "current" might be less than some (possibly all, if it
reset to 0) IDs that were issued prior to the crash.

However there could be a protocol for recovery from a system-wide crash,
requiring all nodes, before they go on-line, to scan their most recent ID-
stamped transactions and start their "current" at the maximum seen. Then
following this disaster you might issue IDs that were less than some that were
issued pre-crash, but it would not matter if none of the "lost" IDs were never
recorded in transactions.

~~~
skybrian
The problem is a single node doesn't whether it can safely skip fsync or not,
because it doesn't know what's going on with the other nodes.

In the case of three nodes recieving an update and dying at once in a five
node system, the first node to recover needs to tell the other two what
happened.

------
epberry
I always look forward to reading these - it typically leads to a good hour or
two of Wikipedia and learning.

------
pjc50
fsync()ing on every increment is kind of required, but makes me wince; I feel
that a bit of NVRAM holding the incrementing counter would allow you to
greatly increase the throughput of this system.

~~~
ggreer
I agree. With custom hardware, one could drastically increase the performance
of many distributed operations. For example, Google's Spanner[1] uses GPS
receivers and atomic clocks to ensure global consistency.

The current state of distributed systems makes me think of 3D games before
Nvidia's GeForce 256. Without dedicated hardware, there was only so much you
could do.

1\.
[https://en.wikipedia.org/wiki/Spanner_(database)](https://en.wikipedia.org/wiki/Spanner_\(database\))

------
siliconc0w
Seems like you could use a combination of time and unique machine ids to
generate ordered sequences in a distributed system without the need for nodes
to talk to each other.

~~~
jdp
That's the intuition behind systems like Flake
([https://github.com/boundary/flake](https://github.com/boundary/flake)) and
its predecessors.

In systems like Flake you are able to get a useful roughly-ordered property
which is good for generating ID's and organizing things like activity feeds.
Things will be mostly sorted by time by encoding the time in the most
significant bits, but they will be fuzzy because of machine and sequence ID's
in the LSB's. You will get monotonically increasing ID's per-process because
each one spinlocks locally against backwards drift, but you're not guaranteed
to get monotonically increasing ID's globally across all processes. One
process' time might drift forward, or you might have two ID's at the same time
but the instance with the higher machine ID responded first.

