
Call Me Maybe: MariaDB Galera Cluster - akerl_
https://aphyr.com/posts/327-call-me-maybe-mariadb-galera-cluster
======
teraflop
I don't find it surprising that so many distributed databases fail at ensuring
consistency; it's a very hard problem. What really gets me is that they
_deliberately, knowingly_ sacrifice consistency for performance, and then
claim the opposite in their documentation/marketing.

See
[https://github.com/codership/galera/issues/336#issuecomment-...](https://github.com/codership/galera/issues/336#issuecomment-136635018)
(linked from the article)

~~~
acconsta
>I don't find it surprising that so many distributed databases fail at
ensuring consistency; it's a very hard problem

It's not hard — it's slow. Sending reads and writes through Paxos or Raft
gives you sequential consistency. But not surprisingly, touching a quorum of
nodes for every operation is too slow to be practical for many workloads.

And that's usually _fine_ — most data aren't bank account balances.

~~~
superuser2
Banks are eventually consistent.

Online transaction processing creates "pending" transactions, and the data is
often inconsistent. Your charge may exist in the merchant's database but not
post to your online banking for several hours. Or it may be a wildly different
amount - i.e. gas stations will place a $100 hold on your debit card and it
will stay that way for days until it's settled for the actual (lesser) amount.
If you were accidentally double-charged, than rather than processing a
separate refund, the merchant may simply not settle the duplicate charge, and
it will drop off your pending transactions... eventually. The lag time may be
several days or weeks. If it's a debit card, you can't spend the money during
that time and you may be temporarily broke because of it.

If you make an ACH transfer, money will disappear from your account one night,
spend a business day in the aether, and then post to the recipient's account
on the third day. The system is in inconsistent state (i.e. money is missing)
for at least a day, possibly a whole weekend.

The actual transaction is settled and goes into "posted" state with a lag of
2.5 * 10^8 ms - i.e. 3 business days. That's if you're lucky. Banks do need
strong consistency, but not in anything approaching realtime.

Even ZooKeeper could probably handle the U.S.'s financial transactions faster
than current infrastructure.

~~~
baudehlo
And what would shock you most is some of the methods used to send those
transactions (eg ftping a text file, and if it gets corrupt, someone opens it
in vi to fix it - I have a friend who used to do exactly that).

~~~
kalleboo
For anyone who wants to read more, this blog post series goes into quite a lot
of detail on ACH (including the FTPing and the fixed-field file format)
[http://engineering.zenpayroll.com/how-ach-works-a-
developer-...](http://engineering.zenpayroll.com/how-ach-works-a-developer-
perspective-part-4/)

------
mysql_cass_dba
Thanks to aphyr for the testing. Certainly it will help focus efforts on
improving things.

I am a MariaDB contributor and operate Percona Cluster in production, so I can
talk a bit about Galera.

It's recommended that writes go to one master, rather than be distributed
across the nodes. That will help with isolation issues.

Also, some commenters have complained about year-old releases. PXC has
improved significantly in the past year regarding manageability, so you may
want to try again. For example, the startup script has a bootstrap option now.

For most people, vanilla async MySQL replication works best, esp. 5.6. But
Galera gives you another option when you need something else.

Having said that, it takes 5-10 years for a database or filesystem to mature,
so anybody using Galera now is an early adopter.

------
no1youknowz
I flirted with MariaDB Galera Cluster. This was on Amazon EC2. Frequently
mariadb would just go down, not the server mind you. Frequently the whole
cluster would go down as well. To bring it back up, I'd have to bootstrap the
whole cluster.

Essentially, what a pain in the backside to use.

This did not fill me with confidence and thankfully did not go into production
and later on went with Postgres/CitusDB.

The difference is day and night!

~~~
ZeWaren
Has a similar experience a year ago. The only difference being that I did go
into production. Man, what a mistake.

~~~
shaftoe
We've been using PXC, which is still Galera but a Percona variant of MySQL
instead. No problems at all.

I've been strongly considering MariaDB on Galera for encryption-at-rest. Is
there something about MariaDB that was not working well with Galera?

------
bsaul
Everytime i read one of those post (or other about mongodb failures), i keep
thinking about this talk
[http://youtu.be/4fFDFbi3toc](http://youtu.be/4fFDFbi3toc)

And why noone tried to repeat their strategies for building a robust db system
: start by building an extremely robust failure simulation and testing
facility. Then build your product.

Actually, i think what those guys at foundationdb did was so exceptional, that
by buying the company and killing the product, Apple harmed the software
industry for the next 10 years. The fact the foundationdb is mentionned in OP
as the only distributed db system one could recommend makes me more confident
making that statement.

~~~
jhugg
This might be overdoing things a bit. FoundationDB was a rigorously tested KV
store with strong consistency (uncommon for such a product). Yes, that talk
from StrangeLoop was great, but there's two common misconceptions here:

(What follows is my opinion / guesswork as a VoltDB employee)

1\. FoundationDB wasn't killed by Apple; it was rescued by Apple. The product
couldn't compete on just being a KV store and wasn't doing well in the market.
Apple saw a very bright and now experienced team and scooped them up for a
song.

2\. Before this happened, FoundationDB realized they needed a way to query
their system to compete, so they bought Akiban (a failing SQL db company) to
add SQL to their system. But they assumed they could do this without deep
integration, which was wrong. They added a SQL "layer" on top of the KV store
and it was way to slow to be practical. The benchmarks they published were
embarrassing.

I wrote a blog post about this: [http://voltdb.com/blog/foundationdbs-lesson-
fast-key-value-s...](http://voltdb.com/blog/foundationdbs-lesson-fast-key-
value-store-not-enough)

SUMMARY FoundationDB: Great Testing, Great Engineering, Not particularly good
product...

~~~
bsaul
Very interesting pov, thanks for sharing.

Actually, the thing that i find most impressive in their tech stack is the
approach they took for building it starting with the simulator + c++
extension. Those are the technology that i think would benefit all the
community, if they were ever open sourced.

As a voltDB engineer, how do you ensure your implementation doesn't compromise
the theorical correctness of your system ?

~~~
jhugg
I'm actually going to be discussing this at Strangeloop next month.

[http://www.thestrangeloop.com/2015/all-in-with-
determinism-f...](http://www.thestrangeloop.com/2015/all-in-with-determinism-
for-performance-and-testing-in-distributed-systems.html)

VoltDB is actually a bit simpler with what it promises, full serializable ACID
for all transactions. This is much easier to understand and verify than lesser
isolation.

We think what we've done is pretty clever too. We've built a determinism
checker into our replication engine, so that we can verify that each replica
has the same state at each logical point in time, and each operation makes
identical changes to that state.

Then we built test patterns that are designed to be as co-dependent as
possible and run them against a replicated VoltDB cluster. That VoltDB cluster
goes though one or more kinds of failure, including multiple simultaneous
failure, and then a checker ensures no data is ever lost, corrupted or run in
the wrong order.

It's different from the FDB thing. The simulation they did is certainly easier
to run on a pure KV store, but keep in mind we also have to test SQL that
queries millions of rows, along with upserts, materialized aggregations,
etc...

We're working on some blog content on this in addition to my talk. Stay tuned.

~~~
bsaul
Great, i'm looking forward to see that talk. To me, it seems that the only way
you can make sure that an implementation is correct, is either via extensive
testing ( and when talking about acid property over a distributed system, i
think the kind of simulator fdb built is a requirement), or run a theorem
proover ofer your source code, ala Coq. But i've never heard of any code
analysis tool that is able to guarantee properties over a distributed system.

But, since i'm absolutely not working on the field, i'm really looking forward
to see what professional people like you are finding to tackle those issues.

------
steveklabnik
I wish more companies supported people like Stripe does for Aphyr. Research is
so valuable, but there's so little incentive to be in academia. There's a
certain irony that undergrad left me with so much debt I couldn't really
consider joining a research university...

~~~
yid
> I wish more companies supported people like Stripe does for Aphyr. Research
> is so valuable, but there's so little incentive to be in academia.

If you're talking about incentives, don't forget that Kyle's research is some
of the most commercially useful research out there at the moment, and Kyle is
also skilled at implementing his ideas (a rare combination among researchers).
I'm actually quite pleasantly surprised that he's putting so much effort into
these blog posts rather than writing a string of journal papers.

~~~
panic
Is there a point to writing journal papers if you're not playing the academic
publication / citation game? Web sites are easier to publish and have wider
distribution than academic journals.

~~~
Jweb_Guru
The point is to receive peer review from domain experts (including those in
academia). But aphyr already gets that anyway :P

~~~
Rapzid
But there has been damning evidence coming out recently that much of the
historical "peer" reviewed studies are in fact wrong or have outright flaws.
Further, I'd say that with the tools being provided for free and the blog
posts Kyle's findings are most likely peer reviewed... And copany reviewed...
And product team reviewed..

~~~
Jweb_Guru
> But there has been damning evidence coming out recently that much of the
> historical "peer" reviewed studies are in fact wrong or have outright flaws.

Sure, I agree (well, I'd probably replace "much" with a less strong word, and
skip the "damning", but that's just semantics). Plenty of studies haven't been
demonstrated to be wrong, and plenty of others were wrong despite being
correctly designed experiments (something both peer reviewers and the
scientists performing the experiments could not have caught). Moreover, many
flawed studies _haven 't_ been accepted by peer reviewers, which is in fact
the purpose of peer review. The biggest problem with peer review is probably
publication bias against negative results, without which I suspect most
demonstrated scientific fraud wouldn't exist, but that doesn't mean the whole
process needs to be thrown out.

> Further, I'd say that with the tools being provided for free and the blog
> posts Kyle's findings are most likely peer reviewed... And copany
> reviewed... And product team reviewed..

Yes, Kyle does (hell, he's been cited in academic papers). The average person
publishing blog posts does not, though.

------
StavrosK
Is there _any_ data store that has come out of these tests looking good? Every
Aphyr post I've read had a pretty big failure at some point, even for
datastores I considered solid, like Riak or Cassandra.

~~~
acconsta
Zookeeper came out looking good:

[https://aphyr.com/posts/291-call-me-maybe-
zookeeper](https://aphyr.com/posts/291-call-me-maybe-zookeeper)

~~~
helper
Also riak with 'allow-mult' turned on:

[https://aphyr.com/posts/285-call-me-maybe-
riak](https://aphyr.com/posts/285-call-me-maybe-riak)

------
cpr
It's no mystery that the kinds of systems with massive reach--Google search,
Facebook, Twitter--are really _not_ data-consistency-critical applications.

Who'll ever know if a search result isn't perfectly up to date or perfectly
accurate?

Who'll ever know if you missed a Facebook feed entry because it "wasn't
relevant" or simply wasn't seen due to DB vagaries?

And who'll ever know about a few tweets going astray here or there?

In all cases, they're all likely "eventually consistent" (or close to it), but
it's no accident that it doesn't ultimately matter in those massive scale
examples.

And maybe that's the secret to massive scale--it can't ultimately matter.

~~~
vidarh
One of the startups I worked on was a classifieds aggregation engine pulling
data from external feeds.

We basically queued all retrieved items for processing with no attempts at
avoiding data loss whatsoever - including using in memory queues for lots of
things.

Our reasoning was that if a machine crashed, worst case was that a few
listings would take up to 24 hours to update, but generally much less (we
adapted crawling rate for our sources based on change frequency; so large
sources of listings would get re-indexed far more frequently, so if a feed
didn't update or 24 hours it'd be because it wasn't a source of much data),
and we could force refreshes of the data.

Some people were horrified at the approach because the idea of ensuring
consistency and not losing data is so ingrained. But the reality is that you
need to measure the cost of consistency up against the value it provides. And
often it's not very valuable, _especially_ when there is an authoritative
source of the data to recover from and when the data will be outdated quickly
anyway.

A lot of the time any notion of consistency is an illusion anyway - by the
time the page has returned, the results are outdated - and what matters is
maintaining the illusion (e.g. ensure that if a user makes an update it's
reflected in the page that's returned).

The key is you need to know the tradeoffs and apply them consciously rather
than get caught out by tradeoffs components you rely on makes without telling
you..

~~~
spdionis
I wonder what kind of people were those horrified by this very simple and
completely fine approach.

~~~
vidarh
People used to purely transactional systems where every piece of data they
received were important who were not used to think about data as potentially
disposable... Nothing they didn't pick up quickly enough, but a bit of a
mental adjustment (for my part, my job right before that one was running a
billing system; I was very happy not to have to worry about that any more...)

------
Karunamon
Can I point out that I'm utterly impressed and astounded by Kyle's ability to
remain detached and professional in the faces of what appear to be _outright
falsehoods_ stated by some of these companies?

If I had gone to his lengths to critically evaluate the safety of a database
system, and then it comes out that the marketing materials or the words of the
developers were.. _significantly misleading_ at best, my first response is
likely to be a profanity laden rant, not a cool recounting of how and why
they're wrong.

------
Erwin
Despite the reliability, that product sounds nice -- is there anything for
Postgres with similar ease of setup for replication etc. that someone can
recommend? E.g. EnterpriseDB ?

~~~
mrmondo
PostgreSQL replication isn't hard these days, since 9.x and beyond its just a
couple of lines of config, if you want active/active that's a different story
for any ACID compliant databases, I'd look at CitusDB / Postgres-XL

------
gregwebs
What about AWS Aurora? I have not used it, but I definitely would if I was
using MySQL.

~~~
samstave
I predict in 5 years... this will be the standard.

Its not so much about wanting to bolt onto amazon's eco for stuff -- but just
allowing an org to focus.

AWS allows for an ops team to completely have no concern for hardware. Lovely.

I also don't want an ops team doing DB cluster mgmt.

I want to deploy and delve into my data.

I want my data/eng/dev/ops teams to be pummeling the shit out of my data
without concern for the instances/hw/cluster.

aurora adds to the cloud fabric.

One thing amazon is moving to is to define compute capacity vs defining
instances at all.

Their vision of abstraction is amazing.

~~~
vidarh
> AWS allows for an ops team to completely have no concern for hardware.
> Lovely.

I split my time between managing a few racks worth of servers at one company,
and an AWS setup at another client. The amount of ops work per server/instance
is _higher_ for the AWS client than it is for the company I manage physical
racks for. That's despite the fact I do physical cabling and racking of
equipment as necessary.

The issue is that there are so many aspects of the AWS infrastructure that
takes extra effort because we don't have full control. E.g. we can stuff
whatever disk subsystem we want in the servers and not have to work our way
around the lack of any truly fast disk subsystems in EC2. So for every day I
don't physically move serves, I spend 3-5 working around AWS limitations.

I agree with you with respect to what you _want_ , but we're taking baby-steps
today - AWS is way too expensive for most people to move to it, even before
you factor in the ops complexities.

~~~
MichaelGG
Same on Azure. Their networking stack is ... strange and undocumented. It's
the worst IP network I've used in a while. Dealing with IP/TCP-level issues
should not be something that's taking up any noticeable amount of time on a
modern network. With Azure, it is.

