Mnesia and LevelDB: Liberating Mnesia from the limitations of DETS

rdtsc · on July 3, 2015

Klarna (the European payment processor) and WhatsApp architectures are built on Mnesia. Mnesia is Erlang's built-in distributed database. Due to its age it has a few bad corner cases, that could be fixed if it just had a better backend (the front-end API is nice, well designed, supports transactions, is built-in to the lanauge).

So this effort brought in the ability to have a better backend, and make Mnesia a better option as a general purpose distributed database.

Here is an talk on how WhatsApp uses Mnesia:

Erlang Factory 2014 - That's 'Billion' with a 'B': Scaling to the Next Level at WhatsApp

https://www.youtube.com/watch?v=c12cYAUTXXs

Here is an example of using Mnesia:

https://en.wikibooks.org/wiki/Erlang_Programming/Using_mnesi...

scurvy · on July 3, 2015

Raise your hand if you're ever actually run mnesia on a busy server in a distributed system. OK, show of hands of how many people will do that again? Thought so.

mnesia should not be in the OTP. Not by a longshot.

I love Erlang. I hate mnesia.

derefr · on July 4, 2015

I think people are trying to use mnesia for things it wasn't made for. mnesia basically has exactly one purpose:

• in a 1:1 master:hot-spare setup,

• where the nodes contain their own data in process-memory-space rather than relying on a separate "database" node,

• and you need to be able to fail-over to the hot-spare and promote it to master, without business logic being aware of this,

• and your system has time tolerances allowing you to manually fix the old master and bring it up as the new hot-spare,

then mnesia is perfect.

You know what system I'm describing?

Call switching!

You know what system I'm not describing?

Most things!

seiji · on July 4, 2015

You know what system I'm describing? Call switching!

Exactly right!

The "distributed" part of Erlang (including mnesia) was designed to run in a blade-like system where the networking was provided by a physical common backplane among all the compute cards in the chassis.

So, a lot of distributed erlang and mnesia falls apart when dealing with network partitions and resyncs and real-world scenarios that wouldn't really happen on a common physical substrate.

That's why most sane erlang people won't run distributed erlang (gotta love that epmd), they'll run their own TCP servers connecting to external DBs.

mononcqc · on July 4, 2015

Distributed Erlang on its own is fine to run in production (particularly as a control plane for metadata between nodes) because few assumptions about reliability of the network are baked in; most of them can be made by the application designer.

Distributed OTP applications (with the automated takeover/failover mechanism) see very little use in the real world because of their set of assumptions that network failures are rarer than software or hardware failure, which result in perceiving all netsplits as nodes going down (a great way to get split brains!)

derefr · on July 4, 2015

Yep. It feels like people want to use Erlang's distribution mechanism for communication between machines which have no pre-existing relationship, which is not what it's for. It's what regular sockets are for—and Erlang is great at everything to do with regular sockets.

I do wonder if you could get an interesting boost in fault-tolerance by writing an Erlang application to run distributed between, say, several EC2 instances in the same Placement Group. That gets you the analogous "backplane guarantee" in virtualized network-space, AFAIK.

lostcolony · on July 4, 2015

Actually, a lot of internal business style applications fit Mnesia's model too (if you're hosting them in a single datacenter). While it means you'll need to manually deal with netsplits (or write your own code to address them, or try to borrow https://github.com/uwiger/unsplit), if you're hosting it internally on your own servers, that might be a rare enough event (and with sufficiently minimal likelihood/consequence of things going down, coming back up, etc, such as to introduce problematic inconsistencies rather than downtime/ignorable inconsistencies in the event of major network/system thrashing) as to be worth it for certain things.

Something, like, say, file processing. You have a watch directory, you want to be able to process everything that lands in that directory in a scalable manner, but don't want to re-process the same thing (but it's okay if you do, just inefficient). Mnesia is probably fine to keep tabs of what you've processed already; in the event of a netsplit you can just let all sides of it keep going, until you get around to fixing the cluster. Your inconsistencies just lead to inefficiencies, rather than real data loss, and you have a clear path to fixing them (just dump the data on the partitioned nodes, and rejoin them to the cluster). As such, you have a more resilient, scalable system than you would if you just used a centralized database, while not having to configure and manage a separate DB.

That said, I like the idea of being able to swap Mnesia out for something a little less warty, if it's pretty seamless in operation.

angersock · on July 3, 2015

Why the hate for mnesia?

scurvy · on July 3, 2015

You'll get a lot of mnesia partitions on even the best network with a slightly busy system. It's not very sturdy, and resynchs are really slow.

toast0 · on July 4, 2015

I only see mnesia partitions when the network is having issues. Maybe my network is better than your best network? Maybe network issues a few times a year is a lot?

Resynchs are really slow, because "resynch" is copy data from the network. Mostly this goes close to the line speed for your network though; be sure to move the data files out of the way on the node that will be receiving the copies: otherwise mnesia first loads those, then throws them away sigh. It would be pretty nice if there was support for logging changes when a node is down, though.

lostcolony · on July 4, 2015

Yeah; running Mnesia in production, albeit for pretty low volume systems (maybe 3000 writes a day on average), in multiple clusters, in different locations, running from anywhere between half a year to 2 years, I think I've seen...two netsplits, total, both of them caused by a VM snapshotting process that caused the socket to hang. But we've got 24/7 ops keeping an eye on things in any case, and the data being stored is maybe ~1 gig, so resyncing is fast. It fits our use case.

scurvy · on July 4, 2015

>I only see mnesia partitions when the network is having issues. Maybe my network is better than your best network?<

Doubtful on the network thing. mnesia will partition then resync when the server gets really busy. As others have mentioned, it might have nothing to do with anything that Erlang is doing. It might be something external. It could just be a lot of traffic that causes it to fall behind. Either way, one missed message and then you're forced to fully resync.

>Resynchs are really slow, because "resynch" is copy data from the network. Mostly this goes close to the line speed for your network though<

Which means your other node has its interface maxed out, causing more service disruptions. I've never run mnesia on a 10gig network, but that definitely was the case with 1gig. I'm not really willing to test or run mnesia in a 10gig/40gig environment. Been burned by it too many times.

>Mostly this goes close to the line speed for your network though; be sure to move the data files out of the way on the node that will be receiving the copies: otherwise mnesia first loads those, then throws them away sigh. It would be pretty nice if there was support for logging changes when a node is down, though.<

Which again, is another sign that it's not robust enough for Internet application usage. Probably OK for some 1990's phone switching, but not for how distributed systems are built today. How are you going to manually move the files out of the way in today's world of systemd automatically restarting failed daemons? Manual operator intervention? Thought so, and this is why ops teams hate mnesia.

Ixiaus · on July 4, 2015

Plumtree is an excellent alternative to Mnesia: https://github.com/helium/plumtree

istvan__ · on July 3, 2015

Thanks for sharing this. Mnesia is a little bit quirky at the first glance but it is a pretty cool system. LevelDB also battle tested lots of systems build on it. I am curious how good is combining the two.