

The Minitransaction: An Alternative to Multi-Paxos and Raft - topher-the-geek
http://blog.treode.com/minitransaction/

======
mjb
The Sinfonia paper
([http://www.sosp2007.org/papers/sosp064-aguilera.pdf](http://www.sosp2007.org/papers/sosp064-aguilera.pdf))
is interesting, and section 4 does a better job of explaining how this
'minitransaction' model works than this blog post does.

The part of this to pay the most attention to is way that it handles node
failure:

> The traditional way to avoid blocking on coordinator crashes is to use
> three-phase commit, but we want to avoid the extra phase. We accomplish this
> by blocking on participant crashes instead of coordinator crashes.

That seems sensible, and is later justified by the fact that the memory nodes
can be replicated, and can be highly available in their own right. This brings
a significant limitation:

> When a participant memory node crashes, the system blocks any outstanding
> minitransactions involving the participant until the participant recovers.

So, in Sinfonia, the memory nodes HAVE to be highly available, or you've built
a system with many single points of failure instead of just one.

Sinfonia is actually a pretty clever system, and despite some interesting
edges (crash recovery and log compaction), is not particularly complicated.
Directly comparing its memory model to multi-Paxos, though, rings a little bit
hollow for me. One of these things is a non-blocking atomic commit protocol
which allows transactions across reliable nodes, the other is a consensus
protocol which replicates a log across unreliable nodes. They aren't really
solving the same problem.

~~~
topher-the-geek
> So, in Sinfonia, the memory nodes HAVE to be highly available, or you've
> built a system with many single points of failure instead of just one.

Yes, in Sinfonia that's true since an item resides on only one node. The
Scalaris project augments Sinfonia by performing operations on a majority of
replicas.

> the other is a consensus protocol which replicates a log across unreliable
> nodes

Interesting. I hadn't thought of it like that. Indeed, when you frame it that
way, it doesn't seem right to compare a minitransaction to multi-Paxos. I had
thought of multi-Paxos as a mechanism to serialize updates on the replicas of
a key-value store. In that framing, the comparison makes more sense.

------
monstrado
> The NameNode is a single point of failure.

The NameNode service supports high availability out of the box, and uses a QJM
(Quorum Journal Manager) to share edits between the active/standby node. To
say the NameNode is a SPOF is not only misleading, it's incorrect.

> the NameNode is also a scaling bottleneck. Services which build on Hadoop

What's indicating that the NameNode is a scaling bottleneck? There are
production Hadoop clusters with PBs of data, and the NameNode has not been an
issue for scalability. Just so we're clear, data written to HDFS does not flow
through the NameNode.

> As a result, many enterprises that use Hadoop and HBase do so only in non-
> critical services, such as supporting the business analytics group

This is the point where I point to the hundreds of companies which rely on
Hadoop to run mission critical applications (Facebook, Yahoo, Ebay, Box, ..).
Just google around...look at customers of Cloudera, Hortonworks, Pivotal, etc.

~~~
spullara
The name node is a significant vertical scaling challenge. It is one of the
reasons that Yahoo limits their cluster sizes.

~~~
ithkuil
That's right, it should be noted however that the need of horizontally scaling
the equivalent of the NameNode kicks in only when you really have a very large
storage system, where large can be defined as:

"You know you have a large storage system when you get paged at 1 AM because
you only have a few petabytes of storage left." [1]

Even if you are below that size you might have a large system, and even if
using a single metadata master node might be a sound solution you still have
lot of interesting problems to solve. Don't make systems more complex than
necessary.

More pointers to google public material on colossus in [2].

[1]
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/it//university/relations/facultysummit2010/storage_architecture_and_challenges.pdf)

[2] [http://www.highlyscalablesystems.com/3202/colossus-
successor...](http://www.highlyscalablesystems.com/3202/colossus-successor-to-
google-file-system-gfs/)

------
batbomb
So the acceptor is a replica-read-only machine and a client-propose-commit-
only machine.

Also, the minimum production setup for this cluster is 6+1 machines?

~~~
topher-the-geek
Ah, I see how you could interpret the diagram that way. The client and server
are clearly different machines. However, the replicas and acceptors can be on
the same machine, in fact in the same process. And the server can be on the
same machine as a replica or acceptor. So only three machines are necessary.

------
fasteo
In Aphyr's "Call me maybe" we trust

~~~
glibgil
topher-the-geek, will you be writing Jepsen tests for Treode?

~~~
topher-the-geek
Having Jepsen tests would be a good thing. The community trusts them, and it's
wise to have tests from someone other than the author of the main-line code.
If I'm so lucky as to find funding--my savings will only go so far--then I or
someone I hire will eventually apply Jepsen to Treode. In the meantime, there
are already tests a bit like Jepsen in the repository. They simulate dropping
and reordering messages, and crashing and recovering nodes. They have proven
their value as they have flushed out lots of early bugs.

My immediate concerns are finding help and finding a sponsor (investor or
adventurous customer). To that end, I'm less coding right now and more
promoting. Getting the word out will help me find developers interested in
improving or trying it.

When I do get to code, I may focus on load and performance testing. I have not
done enough work along those lines yet. Or, I may focus on an example that
sketches how the read/write side of the API integrates easily with Etags, CDNs
and web-clients, and how the scan API integrates with Spark or Map-Reduce.

~~~
fasteo
Maybe you can try to contact aphyr. Some of his "call me maybe" articles have
been funded (by Comcast if I remember well)

