

Reliability, availability and scale – an interlude - flapjack
http://afeinberg.github.com/2011/06/25/reliability-availability-scale-interlude.html

======
jorangreef
I have this idea in mind of a system where nodes are symmetrical and each node
contains a vertically integrated stack, all within a single, evented process.

So instead of having asymmetrical services (database, web server, workers,
logging, messaging) spread across clusters of symmetrical nodes, one would
have all of these services within a single process on a single core machine.

Each node would be responsible for itself: from the time it boots to the time
it dies, continuously integrating from the master trunk, accepting data
partitions from other nodes, checking up on the health of other nodes etc.

One advantage would be that data and workload are evenly distributed, for
example each node would have a few workers running which would only operate on
local data. So all services would automatically be sharded, not just the
database. If each node's workers only operate on local data this would
simplify locking across workers throughout the system.

There would be no master nodes, rather replica sets, all multi-master similar
to Dynamo. Expanding capacity across all services would mean booting a new
instance in EC2.

Also, having the services vertically integrated within a single evented
process would eliminate network overhead, deserialization/serialization
overhead, parsing overhead, queue contention, locking etc.

Everything could be written in the same language, to make for easier
extensibility, more in-house understanding of core components, and more
opportunities to reduce waste.

~~~
noonat
I think the main problem with this approach is that different parts of your
stack have different requirements. One part of your stack might require large
amounts of RAM for efficient response times, but may be used infrequently, for
instance. With a separate server, you can scale that part of the stack
independently. If the whole stack lives in every server, you must sacrifice
performance or money for that consistency.

------
IgorPartola
Somewhat OT: I have been lately toying with the idea of removing a part of the
cluster logic from the DB, and moving it the client's drivers instead.
Basically, it would work something like this:

You have 3 database nodes. Your application requests write access. The DB
driver opens an event loop and tries to connect to all three at once. As soon
as it has two live connections, it starts a distributed transaction (I know,
these are unreliable, but bear with me, and assume that they work for now.
This is the same thing as used on the server anyways). The application
completes the work, and it is committed to two servers, which is the majority.
The third server will replicate the changes as soon as it is able, verifying
that the majority of the servers agree that this is in fact a valid changeset.

I think that this approach, combined with a good locking system (e.g.: the
client would declare all locks upfront), could result in a robust system. It
also makes it easy to change where your system lies on the CAP triangle: just
tune your driver to require a connection to all servers, not just the majority
to make the system lean more towards consistency, and make downed nodes that
are recovering refuse any connections until they are caught up.

Does anyone have any thoughts on this?

~~~
Robin_Message
I pretty sure this has been tried (generalised to getting a write lock on more
than N/2 of the servers) but I can't find a citation right now.

~~~
IgorPartola
Any idea if there was any success with it?

~~~
Robin_Message
I've found a reference to it in Jean Bacon's _Concurrent Systems_. She calls
it "quorum assembly", but I can't find any other references with that name
(slides on cl.cam.ac.uk are by her, or influenced by her work.) I'm afraid it
is probably an idea that has been had many times.

The write quorum (number of nodes you must talk to to do a write) must be >
n/2. The read quorum plus the write quorum must be > n (otherwise you can read
from outside the write quorum).

So the n=3 case is simple, RQ=WQ=2 as you suggested. I think it wasn't very
successful in larger cases because n/2 isn't much better than n, and you have
to do some kind of synchronisation between the nodes which is tricky to get
write (consider how the write nodes know the other write nodes have done the
write.)

In practice, hierarchical schemes where you are reading from a slave which
might be behind the master are more common. There you can have 1 write node
and n read nodes. Since reads are generally more common than writes, this is
great.

~~~
IgorPartola
Thanks for the details and the explanation. Makes a lot of sense.

