
Jepsen: RethinkDB 2.2.3 reconfiguration - emidln
https://aphyr.com/posts/330-jepsen-rethinkdb-2-2-3-reconfiguration
======
coffeemug
Slava @ RethinkDB here.

It's been a pleasure working with Kyle on doing an exhaustive analysis of
RethinkDB and tracking down bugs revealed in the second test. Our lead
engineer wrote a detailed analysis of our findings (linked from the blog post,
but I wanted to link it again):
[https://github.com/rethinkdb/rethinkdb/issues/5289#issuecomm...](https://github.com/rethinkdb/rethinkdb/issues/5289#issuecomment-175394540).
For anyone interested in the details, this is a good read!

------
GordyMD
It is very encouraging for myself, CTO of a company that uses RethinkDB, to
see the efforts that RethinkDB go to ensure quality in their product and the
willingness to allow Aphyr to potentially uncover problems lurking under the
surface.

Thanks again to Aphyr and the team at RethinkDB for going to such lengths to
discover, document and fix such failure conditions. Going to have to read
through this thoroughly tonight.

------
emidln
As someone employed in building complex systems, I appreciate the takeaways
listed during the discussion:

    
    
        RethinkDB’s engineers noted three takeaways from tracking
        down this bug. First, fuzz-testing—at both the functional
        and integration-test level, can be a powerful tool for
        verifying systems with complex order dependence. Second,
        runtime invariant assertions were key in identifying the
        underlying cause. A test like Jepsen can tell you that the
        cluster can exhibit split-brain behavior, but can’t tell
        you anything about why. The error messages from those
        leader and log order assertions were key hints in tracking
        down the bug. Rethink plans to introduce additional runtime
        assertions to help identify future problems. Finally, they
        plan to devote more attention to issues which suggest—even
        tangentially—consistency errors.

------
thomcrowe
Compose released RethinkDB 2.2.4 today.
[https://www.compose.io/articles/rethinkdb-2-2-4-available/](https://www.compose.io/articles/rethinkdb-2-2-4-available/)

------
overcast
Fantastic work on this. Seriously, I LOVE RethinkDB. I've used it for my last
three projects, and it's exactly what I've been waiting for, blending a
relational database with nosql one. So glad this is getting exhaustively
reviewed for quality purposes.

------
codemac
Whoaaaa. Membership is where things get nasty, this is awesome. Punting on
dynamic membership due to it's extreme difficulty to get right is usually a
good idea for many projects.

I'm still struggling to understand what happens in a RethinkDB "reconfigure",
especially around this epoch timestamp. Where does the wall-clock time come
from, the raft leader of the shard?[0] I also need to read more and understand
these value's necessity over any other generation id/sequence number type
thing, I'm not sure I follow why they are used given the linked bug report[1]

Also, after embarrassing myself earlier in the week[2], I'm trying my best to
read every link in a post before making absurd claims. It's my much deserved
penance. It seems there is one link that is broken in the post that I was
curious to see due to my ignorance around Rethink's epoch timestamp stuff[3].

[0]:
[https://github.com/rethinkdb/rethinkdb/blob/a90bb08051603621...](https://github.com/rethinkdb/rethinkdb/blob/a90bb0805160362125a03024eb9f309e578408c0/src/clustering/table_manager/table_metadata.hpp#L50)

[1]:
[https://github.com/rethinkdb/rethinkdb/issues/5289#issuecomm...](https://github.com/rethinkdb/rethinkdb/issues/5289#issuecomment-175394540)

[2]:
[https://lobste.rs/s/cnthlp/the_verification_of_a_distributed...](https://lobste.rs/s/cnthlp/the_verification_of_a_distributed_system/comments/poc7h7#c_poc7h7)

[3]:
[https://aphyr.com/data/posts/330/link](https://aphyr.com/data/posts/330/link)

~~~
danielmewes
A new epoch timestamp is generated when you either first create a new table,
or when you use the "emergency repair" operation to manually recover from a
dead Raft cluster (usually one that has lost a majority of servers).

The wall-clock time comes from the server that processes that query.

Whenever the epoch timestamp changes, replicas will get a fresh set of Raft
member IDs, and it's expected that they start with an empty Raft log.

Where exactly the epoch timestamps come from is not really relevant to this
bug. With the bug fixed, any given node will only accept multi_table_manager_t
actions that have a strictly larger epoch timestamp than what they have right
now. That is enough to guarantee that they never go back to a previous
configuration, and never rejoin a Raft cluster with the old member ID, but a
wiped Raft log.

~~~
codemac
I guess I'm just confused how this clock becomes a trusted source of truth for
forward progress? Is there a way of asserting that the clock makes forward
progress that I don't understand?

EDIT: Or is it that it's not required to show forward progress? Still reading
the rethinkdb source & docs, thanks for the information so far.

~~~
timmaxw
During normal operation, RethinkDB uses the standard Raft protocol for
managing configuration operations. The Raft protocol only uses the clock for
heartbeats and election timeouts. So we're not using the clock as a trusted
source of truth.

However, a Raft cluster will get stuck if half or more of the members are
permanently lost. RethinkDB offers a manual recovery mechanism called
"emergency repair" in this case. When the administrator executes an emergency
repair, RethinkDB discards the old Raft cluster and starts a completely new
Raft cluster, with an empty log and so on. However, some servers might not
find out about the emergency repair operation immediately. So we would end up
with some servers that were in the new Raft cluster and some that were still
in the old Raft cluster. We want the ones in the old Raft cluster to discard
their old Raft metadata and join the new Raft cluster.

The process of having those servers join the new Raft cluster is managed using
the epoch_t struct. An epoch_t is a unique identifier for a Raft cluster. It
contains a wall-clock timestamp and a random UUID. When the user runs an
emergency repair, the wall-clock time is initialized to max(current_time(),
prev_epoch.timestamp+1). When two servers detect that they are in different
Raft clusters, the one that has the lower epoch timestamp discards its data
and joins the other one's Raft cluster. The UUID is used for tiebreaking in
the unlikely event that the timestamps are the same.

So the clock isn't being used as a trusted source of truth; it's being used as
part of a quick-and-dirty emergency repair mechanism for the case where the
Raft cluster is permanently broken. The emergency repair mechanism isn't
guaranteed to preserve any consistency invariants (as the documentation
clearly warns).

~~~
codemac
Ahh ok! An interesting approach. For some reason I'm uncomfortable with it but
I need to spend more time reasoning about what failure mode rethinkdb uses
care about.

I'm still not sure I fully understand the cases where this could occur, as I'm
used to a very different approach to consistent operations that depends highly
on physical infrastructure.

------
VeejayRampay
What fantastic write-ups. Thanks to aphyr for all the hard work, it's an
invaluable service to the community.

------
headconnect
That was a beautiful and thoroughly enjoyable read, actually of the level
where moderately technical decision makers can start understanding the
underlying complexity of maintaining trustworthy distributed systems (and the
lengths to which one needs to go in order to truly start trusting something).
Just because you implement raft doesn't necessarily mean leadership is always
guaranteed under all circumstances - there are always considerations
influencing the system, and all parts are intertwined :)

------
nojvek
Rethink has been my gotodb since last year. Absolute love it. Just waiting for
your windows release.

