
Bizur: A Key-Value Consensus Algorithm for Scalable File-Systems - blopeur
https://arxiv.org/abs/1702.04242v1
======
rystsov
> Paxos-like algorithms which are used by existing distributed file systems,
> can have artificial contention points due to their dependence on a
> distributed log.

It's half true. Paxos is an ambiguous term. It includes Multi-Paxos and
Single-Decree Paxos (aka Synod). The first one is a log based algorithm, while
the second is a just a register. So your critic is valid only for the first
part of the Paxos family.

Actually, Bizur is very similar (almost indistinguishable) to Single Decree
Paxos - just compare it with the algorithm described in the video [1] by Heidi
Howard or in my post "In search of a simple consensus algorithm" [2]

[1]
[https://www.youtube.com/watch?v=s8JqcZtvnsM](https://www.youtube.com/watch?v=s8JqcZtvnsM)

[2] [http://rystsov.info/2017/02/15/simple-
consensus.html](http://rystsov.info/2017/02/15/simple-consensus.html)

~~~
ezrahoch
Single decree is a write-once algorithm. You can't use the same instance
again, for another update of the associated value.

So just replacing bizur' should "register" implementation with a single decree
paxos won't be sufficient. You'll need to do something more elaborate
(creating more instances as you go along).

This kind of instance management isn't required in bizur.

I'm not sure what's the overhead off all that, w.r.t. memory foot print. Since
this overhead is paid for each bucket, it might accumulate to quite a bit.

~~~
rystsov
Not really :) Please follow the links. "Paxos Made Simple" describes the write
once variant, but it's possible to choose a new value for the next ballot
cycles to make it rewritable.

Proof of the algorithm: [http://rystsov.info/2015/09/16/how-paxos-
works.html](http://rystsov.info/2015/09/16/how-paxos-works.html)

Moreover, I built a system on top of it and extensively tested it with fault
injections: [https://github.com/gryadka/js](https://github.com/gryadka/js).

It seems that GoshawkDB[1] uses the similar idea, Treode[2] used it too.

[1] [https://goshawkdb.io/](https://goshawkdb.io/)

[2] [http://treode.github.io/design/](http://treode.github.io/design/)

~~~
ezrahoch
Very nice work! Indeed we've done similar stuff :) I haven't read it all yet,
I'll comment back once I do.

------
rystsov
> We avoided reads (get operations) since the current version of etcd performs
> reads directly from the leader, without contacting the cluster, which
> doesn’t preserve the same consistency level as Bizur and ZooKeeper do.

It's wrong. Etcd provides "?quorum=true" option which makes reads go through
the Raft consensus algorithm archiving linearizability see
[https://github.com/coreos/etcd/issues/741](https://github.com/coreos/etcd/issues/741)

~~~
ideal0227
etcd v3 read is linearizable by default.

Also I believe the authors of the paper tested against the old etcd v2
API/backend although they use etcd3.

The new API/backend has significant high performance.

I work on etcd.

~~~
rystsov
I mean that without the ?quorum flag the stale reads are possible. I tested
the etcd v3 (http api) and the reads were incredibly fast but when I set the
flag, the read's latency became the same as write.

~~~
ideal0227
v3 does not have a native HTTP API. I suspect you still use v2 API. You need
the gateway to make v3 works with HTTP:
[https://github.com/coreos/etcd/blob/master/Documentation/dev...](https://github.com/coreos/etcd/blob/master/Documentation/dev-
guide/api_grpc_gateway.md)

But that might impact performance quite bit depending on your workload.

~~~
rystsov
I got it, I was using Etcd v3.1.0 but with v2 API.

------
sriram_malhar
Unless I have missed something, I don't believe Bizur is "linearizable at the
bucket level" (as the paper claims).

Suppose a server X was the leader, and subsequently a new leader was elected
without X's knowledge. X does not learn of the new election unless it attempts
a write. Until then, the read side may not detect the situation either since
EnsureRecovery may return true (the bucket being read has the same elect_id as
X.elect_id. Thus reads may return stale value. What we have here is
"sequential consistency at the bucket level". which technically is cache
consistency.

Separately, because the consistency claims are at a key/bucket level, and
since there is no claim to global ordering (between different keys), a
comparison to Zookeeper is not quite fair. Bizur cannot be used as a
coordination service because it isn't sequentially consistent overall.

------
tzury
Bizur in Hebrew mean distributing. I see all four authors[1] are Israelites,
then, I guess, this is where the name comes from

[1]: Ezra N. Hoch, Yaniv Ben-Yehuda, Noam Lewis, Avi Vigder

------
rystsov
What do you call by "queue depth". Is it the number of concurrent clients?

If it's true then it looks very suspicious that avg. latency decreases as the
number of concurrent clients goes up. How do you measure avg. latency?

Let me illustrate what the avg. latency usually means.

Imagine that we have two concurrent users: Alice and Bob. During the testing
lasting one second, Alice managed to send 3 requests per second, the first
request took 0.5 seconds while the rest were executed in a quarter of a second
each. In the same time, Bob did only 2 requests, each of them took 0.5
seconds.

So the average latency is the sum of execution time taken by each request
divided by the number of all requests, in our case it's
(0.5+0.25+0.25+0.5+0.5) / 5 = 0.4

Did you use the same definition?

~~~
mhf8o9a
Latency does increase with concurrency. Latency decreases with the number of
keys, as long as they are less than the concurrency. If the number of keys is
more than the concurrency, there is no additional benefit and latency remains
flat.

~~~
rystsov
Thanks, I didn't notice that the clients operate over the same key set, so I
didn't think about the contention. It makes sense.

------
unboxed_type
I am no academic, but I read a lot of papers and going to write one this year.
I skimmed the paper without getting into details about the protocol itself,
but made some observations about the paper in general. My personal very
subjective opinion on how the paper might be reinforced is presented below.

1\. Add a system model description under which your protocol is expected to
operate.

2\. Add description of fault-tolerant properties of the protocol.

3\. Do you plan to prove correctness of the algorithm more formally? Maybe you
could model the algorithm before publishing. If you prove a safety property
using model checking it could considerably reduce doubt about an algorithm
correctness.

~~~
mhf8o9a
Thanks for the feedback.

As for #3, unfortunately we don't currently have plans to publish a formal
proof (we're building a product). Some of our internal documentation is more
formal, but we decided on a more readable presentation for this paper.

~~~
Taek
You will probably find it valuable to write that proof. Sometimes an edge case
pops up or you end up proving that it doesn't work as expected. I've done this
a few times in my work.

Unless you already have it but don't plan to publish, which seems a little odd
to me. I get that you want readable documentation, but at least you could link
to it somewhere for people who go digging.

~~~
unboxed_type
Well personally I would rather suggest to write a model, not a rigorous proof.
A formal proof of any real protocol is both long and tedious journey;
production guys will hardly find it useful thing to do.

On the other hand a logical model can be encoded in a rather short amount of
time by someone who is doing it periodically.

This hypothetical guy could benefit from this activity by co-authoring the
paper. Original authors would benefit from it by reinforcing both the protocol
and the paper.

------
daviesliu
After reading the paper, I felt disappointed that: 1) Bizur is used for
building a scalable object storage rather than a filesystem, which does not
even support atomic rename for single key (the hash of key will change, it
will be moved to another bucket without any transactional guarantee) 2) Bizur
itself looks like a sharded Raft group, which is already used in TiDB (one
raft group per region).

Please correct me if read it wrongly.

~~~
ideal0227
This (like a sharded Raft group) is somewhat correct :P.

If you can sacrifice consistency (causal instead of external), you can get
EPaxos. If you can sacrifice more, you can get Single Paxos. Both of them can
increase concurrency to some degree, and can improve availability over a
leader based RSM system like Muti-paxos or Raft.

I believe there are other ways to make the trade-off, even making the trade-
off dynamically. Hopefully, I can talk about it more soon.

I work on etcd/raft, which is a shared Raft implementation by TiDB(TiKV),
Cockroachdb, etcd and many more.

~~~
rystsov
> If you can sacrifice more, you can get Single Paxos

What do you mean? Single Decree Paxos is linearizable so it's consistent. When
the multi-key updates are necessary - they can be implemented with client-side
transactions such as 2PC, RAMP or Percolator.

Even with Raft we in the same situation. Once the data overgrows the size of
one machine we need to shard them. As a result some keys can live on different
shards and we need cross shards transactions to do multi-key updates :)

~~~
ideal0227
Well... Why do you need 2PC if nothing is sacrificed? :)

If you choose things like Raft (or other RSM), you can get a consistent key
space to do txn over keys.

For a metadata store like etcd, one RSM group is OK. Performance/availability
is better than something requires 2PC.

For database systems, there are quite a few txns that can actually fall into
one shard and can bypass 2PC.

Of course, it is highly depend on the workload and use cases.

------
skyde
it seem that the performance gain of not having all update go to a single log
could be also achieved by having one log per bucket. My understanding is that
with Bizur all update for the same bucket also need to proceed sequentially.

Also with most system like Raft a failed machine trying to rejoin the cluster
does not need to download the complete transaction log but can instead
download a snapshot of the state machines then load log entry created after
this snapshot.

Another aspect is that if the bucket contain a large number of key. From the
Bizure paper it seem that the complete set of "key->value" need to be sent to
replica each time the bucket is updated instead of only the entry that
changed. This would produce much larger network traffic compared to Raft or
Zab when all replica are mostly in sync

~~~
ezrahoch
1) Keeping a log mechanism per bucket would be expensive, since it is a heavy
mechanism. It will cost memory & network traffic. The additional information
needed by Bizur for a bucket is minimal, basically, just a version.

2) In Bizur, rejoining the cluster includes the equivalent of copying the
snapshot. However, that node can start participating in the Bizur even before
it finished reading the entire snapshot, because of the key-value data model.
Once a bucket has been copied, it can start serving it, which is much faster
than log-based algorithms.

3) You are correct about the size of a bucket. In our implementation we aim to
have up to 8-10 keys in a bucket, to avoid the overhead you mentioned. That's
why it's especially important to be very efficient pet-bucket (see point #1).

~~~
skyde
Thanks a lot for clarification.

------
xanton
Am I right that in a system implementing algorithm explained in the article
there is no way to execute operations on two key-value pairs stored in
different buckets atomically? For example, two CAS operations that will either
both succeed or both fail?

~~~
solidsnack9000
Their application is file systems and no file systems offer that capability.
In order to provide transactionality for multi-file operations, you have to
have a file that represents "the transaction" and move it or create/delete it.

In most file systems, you can't even update a file atomically, actually --
what you can update atomically is a disk page, 4-16k. Database servers might
then maintain a separate, small file that contains only a transaction token,
which is atomically updated to contain the last transaction written to disk in
full.

~~~
xanton
I don't judge them, I was confused by the choice of systems to compare with
and thought that I didn't understand article. I think that comparison with
MongoDB or RethinkDB would be more appropriate in this case.

~~~
mhf8o9a
Keep in mind that the Bizur is being used for services _internal_ to the file
system. So the use case is more like etcd or ZooKeeper's, for example in one
role it is a cluster configuration database (with high availablity, fault
tolerance, consistency, etc.)

------
ccleve
Have you published the software anywhere? Do you have an open source
implementation? (The Elastifile Github account doesn't appear to have one.)

~~~
ezrahoch
It is not an open-source project, but rather a proprietary implementation.

