
Building a Distributed Log from Scratch: Data Replication - bhattisatish
https://bravenewgeek.com/building-a-distributed-log-from-scratch-part-2-data-replication/
======
majidazimi
Regarding Kafka replication: What if the following scenario happens?

1\. Leader replicates to followers

2\. Followers get the messages and send ACK to leader

3\. Leader gets the confirmation from followers, increments high watermark
(HW) and replies client that messages is commited.

4\. NOW: Leader fails before it could piggyback to followers that HW has been
incremented.

5\. The question is: Since potential leader is not aware of the fact that HW
has been incremented, it becomes the new leader and truncates the log to old
HW, which means we have lost an entry in the log that has been confirmed by
previous leader.

As a result, client has been confirmed that the message has been successfully
written, but it has been lost.

~~~
teraflop
There's a more accurate and detailed explanation of the Kafka protocol here:
[https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Repl...](https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication)

In step 5 of your example, the new leader _does not_ truncate its log; only
the new followers do.

Basically, every message that was committed is guaranteed to be in every in-
sync replica's log. (Some of these may be after the replica's HW.) But the
converse is not true: some replica's logs will contain messages that may not
have been committed.

So the new leader has to keep its entire log -- which includes all
acknowledged messages, plus some unacknowledged ones that become committed as
a result of the leadership transition. But the new followers have to discard
messages above their HW, because those messages might not be present on the
new leader.

~~~
majidazimi
The doc says:

> The replica that registers first becomes the new leader. The new leader
> chooses its LEO as the new HW.

Isn't it a better idea to elect a process who has the largest offset?
Basically during leader election, a process should get confirmation of
majority of correct processes. If another process observes that it has larger
offset in the log, rejects the election and starts another round of election.
Eventualy a correct process who has largest offset will become leader.
Regarding correctness, this sounds to me as a safer election.

~~~
teraflop
Yeah, I think that would be a reasonable alternate approach. It's not really
"safer", though.

None of the replicas are guaranteed to have all of the messages that reached
the old leader (some of those might never have been replicated anywhere). So
clients that didn't receive acknowledgements might need to retry on the new
leader, no matter what. Since the clients have to have retry logic anyway,
there's no safety issue with discarding some of the messages that weren't yet
acknowledged.

Your suggestion would try to minimize the number of necessary retries, but at
the cost of adding delay and complexity to the leader election process.

------
yazaddaruvala
Are there any distributed logs which operate more as a framework than a
service/platform?

Imagine Kafka but I am able to run my processors within the same JVM. You
might even call them servlets for Kafka.

~~~
fizx
Plenty of things available when you google "embedded kafka". Still, it's
mostly a bad idea.

------
chrisweekly
Outstanding material. Bookmarked.

~~~
Confiks
Please try to be more substantive in your replies. I would love to hear why
you think this is such outstanding material, or what your own connection to
the subject is.

------
jaequery
id like to hear if blockchain can be used here to simplify all of this,
anyone?

~~~
jteppinette
The blockchain supports untrusted nodes. This is typically not a requirement
in the typical distributed commit log usecase. Also, this wouldn’t be a
simplification. It would increase your costs and decrease performance.

~~~
jaequery
i was thinking if you roll your own fork of the blockchain and set the costs
to free, you just roll your own miners as the distributed nodes and it can
work as a distributed storage that takes care of the consensus.

~~~
teraflop
On top of the extra complexity, a blockchain has much weaker consistency
properties than Paxos/Raft/Kafka. Because of finite propagation delays and
network packet loss, there are normally multiple valid chains at any given
moment in time.

Even if you assume that all nodes are benevolent, and try their best to
converge on the single longest chain without needing mining fees as an
incentive, you still can't know whether the chain you're querying is the
"winner" unless you have an omniscient view of the entire system. So you only
get eventual consistency, as opposed to sequential consistency.

~~~
brian_cloutier
This comment is correct, I've heard the property described as "probabilistic
consensus".

However, this part is wrong:

> Because of finite propagation delays and network packet loss, there are
> normally multiple valid chains at any given moment in time.

How often forks occur is a property of a specific system, not of "blockchain"
in general. Bitcoin, for instance, usually does not have multiple chains. They
do sometimes happen but generally the orphan rate is quite low. Ethereum has a
much higher block rate so propagation delays are more important and forks
happen there much more often.

There's an improvement called Bobtail [1] which makes the network much better
at estimating how much work has been done. A blockchain which adopted Bobtail
would still have probabilistic consensus but would almost never have forks.

Casper, Ethereum's proposed Proof of Stake system, is designed to provide
"finality", what the rest of us call "consensus". Once adopted, and assuming
it works, we'll have a blockchain where you can be sure that every block you
see which you determine is valid is a block which the rest of the network will
also decide is valid (assuming less than 1/3 of the network is byzantine).

[1] [https://arxiv.org/abs/1709.08750](https://arxiv.org/abs/1709.08750)

