Hacker News new | past | comments | ask | show | jobs | submit login
Call Me Maybe: MongoDB Stale Reads (aphyr.com)
605 points by llambda on April 21, 2015 | hide | past | favorite | 144 comments



From 2009 to 2012 I had a distributed database startup that competed with MongoDB. We used Paxos for replication and built the database with on-disk consistency guarantees --- like the ones this article looks for and rightly obsesses over --- in mind.

https://github.com/scalien/scaliendb

Outcome: you've never heard of ScalienDB; MongoDB brilliantly won by winning the hearts and minds of hackers and coders who don't care about such issues, but were able to get started quickly with Mongo (and got cool free cups at meetups). It turns out that's most engineers out there, definitely the initial critical mass to target for a database startup like Mongo.

Btw. the story behind Oracle is similar: early versions were basically write-only; read Ellison's book 'Softwar'. Of course there are other ways to get started: for example DBs coming out of academic research like Vertica seem to avoid this problem; in that case initial funding is basically provided by the gov't and when they create the company to commercialize they're already shooting for Enterprise contracts, skipping the opensource/community building phase of Mongo.


It's very true that MongoDB was able to target developers who didn't know or care about a lot of issues in the databases world. However, I wouldn't attribute Vertica's success to coming out of the academic world and being funded by the government.

MongoDB targets a part of the market that is well served. There are loads of other options that work well like PostgreSQL. If you were targeting a truly distributed transactional database, you might have been trying to build something that has that order-of-magnitude-better characteristics, but MongoDB doesn't seem like that.

Vertica, on the other hand, really targeted an area that wasn't well served and a decade after C-Store this area still isn't well served by anyone other than Vertica. ParAccel/Redshift isn't really comparable, though many go with it because Amazon will manage it for them.

For MongoDB, the question is "how can we win over developers' hearts and minds?" Why use Mongo? Well, you like the query DSL and get annoyed by joins, and they have a great website and documentation, and we all know that joins are slow so without them everything must scale nicely. But realistically, there are a bunch of other options that could be substituted for it.

And you're right that Vertica skipped the community aspect. The reason why is that something like Vertica isn't interesting if you don't have a lot of data. MongoDB targeted people who might have GBs of data that they wanted to store. There's a lot of people and community relations is going to matter, especially because there are lots of battle-tested alternatives. Since Vertica is most interesting when you have more data than can fit on a single machine's hard drives and might take hours to run analytics queries against, they're going to be targeting a much different market.

Vertica also had the advantage of being associated with the biggest names in databases. Stonebraker has a Turing award. Even beyond that, the names you'd think of when you think databases were involved in C-Store and Vertica.

If you're really creating something that has order-of-magnitude-better characteristics, you probably shouldn't target people that have 10GB of data to store. Someone once said something to the effect of: once there were more than 200 websites on the internet, it made more sense to target web frameworks at the websites that weren't the top 100 than solving for the problems of the top 100 sites. Most sites aren't going to need a massively distributed database. They might have dreams that the same codebase will scale from 100,000 visits/day to Facebook-scale, but they're just dreams. The fact is that for 10GB worth of data, your database probably wouldn't be better simply because that's already well-handled. In fact, distributed consensus on such small data would probably make your database worse. You really need to target people that you're solving a problem for.

Vertica just isn't interesting technology for the vast majority of users and so they don't sell to the vast majority of users. MongoDB is interesting for those who want an alternative DSL and such.


I don't know, I got started with Mongo because ... it was so easy to start with, but come deployment time got bitten by a LOT of issues (which essentially negated any advantage in using Mongo).

As a result of this experience I almost exclusively use PostgreSQL, and I've never EVER been burned by taking this approach.

Sometimes I do use another DB but there has to be a seriously good reason for it.


Same thing. I was a software dev, then I went to the back end with MongoDB and NodeJS, and I learnt what ACID means...

Also about MySQL, many people forget it isn't ACID compliant either and just look at benchmarks or use MySQL "because Facebook uses it". I am sure that if any startup would use PostgreSQL it would avoid many problems on the road.


I think that PostgreSQL is, in almost every aspect, a superior product to MySQL.

But MySQL not being ACID compliant is flat out wrong (assuming that you're using InnoDB)


Yes with InnoDB it is.


we actually evaluated scaliendb extensively (even contributed patches) but decided against it because it was immature and notoriously instable.

Somewhat later, the open source repo scaliendb got pulled only to resurface years later. What happened?


Do you mean Keyspace, the tech we had before ScalienDB?

Anyway, the reason it was unstable was because we were early in the lifecycle, and as I described in the post, we couldn't get traction, so we also couldn't get investment, so it was just the two co-founders working on it. That's not enough, you need more manpower to do extensive testing of a distributed database which has its own storage engine, esp. when all of it is written in C++ for high-performance. Eg. we only had money to buy 3 servers, we used virtualization to test, but a virtualizated env. is _not_ a proper way to test a database.


If you are a database author and you get a bug report from Kyle, spend a long time thinking about it before closing the issue as invalid.


Especially since many people will judge the competence of the engineering based on the response. It really makes me nervous when a database company doesn't even understand the problem for many days and then pretends like it's expected behavior afterwards even though it flies in the face of their documentation, advertising, and tech talks.


For context (for those who didn't read the whole article): https://jira.mongodb.org/browse/SERVER-17975



Actually elasticsearch took their jespen results pretty seriously and even have an ongoing status of their resiliency. See my post on this:

https://news.ycombinator.com/item?id=9418318


It looks like that specific issue was closed because the other problems reported are being tracked in other issues.


Indeed, database vendors should aim to have their software "Jepsen certified".


Agreed!

As an example, HashiCorp included Jepsen test results in their Consul documentation: https://consul.io/docs/internals/jepsen.html


As Kyle himself mentioned (I think it might even be in relation to this document), Jepsen cannot prove that your software is safe, only prove that it isn't.

He even called them out for modifying timeout value to pass the tests[1].

[1] https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul


> Jepsen cannot prove that your software is safe, only prove that it isn't.

This doesn't just apply to Jepsen but to all software tests. You only ever test a finite set of scenarios, so you can't really ever guarantee your software is 'safe'/bug-free - only that it does not fail in common/expected scenarios.


Depending on how large "software tests" is in your mind this can easily be not the case.

Kyle often mentions the tool TLA+ which enables complete and total formal analysis of certain systems. This can be a test which is complete and therefore a positive proof of correctness.

It's also not difficult to test smaller components of your system if you note that the state space here is small enough to be exhausted. This is a very important thing to look in to in serious, important components of a software system.

Finally, though it's futuretech today still, dependent types offer a general system for embedding proofs of correctness directly into code thus elevating positive proof to a computational artifact just like any other.


Type systems and proofs have their limitations, if just because for a proof to be worth anything, I need a perfect description of what I want, and that doesn't make any sense outside of trivial cases.

I once worked with a relatively well known, prove everything developer that you might know by name. He's written books and everything. He built an algorithm to make a distributed system of equal peers figure out when it needed to start more nodes, or could shut some of them down. He wrote a proof, in code. He wrote a paper. When in production, the system would not work as advertised, and he blamed it on other pieces, because the algorithm was proven correct! So the problem stayed there for months.

After he left the company, I decided to figure the problem out, so I read through the proofs, the paper, and the code: All the single letter variables you could possibly want. I figured out that yes, the algorithm was flawless, as long as every operation in the system was atomic and instantaneous. Instead of the proof, I built a small simulator that didn't have such flawed assumptions, and got the exact same behavior as the production system. So the proof was perfect, as long as we made assumptions that are impossible in our universe. And the entire algorithm was less than 200 lines of code.

So whenever we have a reality that is difficult to model (and let me tell you, distributed systems fit the bill), dependent types will not save you, haskell or no haskell. Proofs will always be limited by your assumptions.

So while the tools you mention are nice. They hit the same limits as everything else we build. Whether to write a proof in idris, use generative testing, or just some example testing, is really all a tradeoff, but you will never escape from bad specification, as all specifications are bad.


Sure, never indicated otherwise I hope!

But that said, the style of analysis is totally different in each case. It's not true that you can never prove correctness. It's just a certain option with certain tradeoffs.

And there are success stories! Tons of older ones back when proof was a major component of compute programming. More modern ones like validated C kernels and compilers.


I don't know of any case in industry where TLA+ has been used to prove a spec correct. AFAIK it's only been used for model checking. Read the "Formal Methods at AWS" paper for details.


The most interesting lessons from the Jepsen series:

* You should never trust, and always verify, the claims made by database manufacturers.

* Especially when those claims relate to data integrity.

* Super-especially when every safety level provided by the manufacturer that includes the word "SAFE" is actually unsafe.


It seems easier to me to just go with a reputable database vendor.


That island of safety is a rapidly shrinking one. Jepsen covers more ground with each new post from Kyle. :)


Has there been any CP system passing a call-me-maybe test first time apart from Zookeeper [1]?

[1] https://aphyr.com/posts/291-call-me-maybe-zookeeper


FoundationDB. Kyle didn't bother running Jepsen against FDB because foundationdb's internal testing was much more rigorous that Jepsen. The foundationdb team ran it themselves and it passed with flying colors:

http://blog.foundationdb.com/call-me-maybe-foundationdb-vs-j...

Sadly, fdb has been bought by Apple[1], and you can't download it anymore. I sincerely hope foundationdb gets opensourced or something.

[1] http://techcrunch.com/2015/03/24/apple-acquires-durable-data...


I believe the real reason was that FDB was not open sourced.

Jepsen tests cannot be used to prove the system is safe, but to prove it isn't.

It looks like very often he looks into source code in order to figure out how the system operates, that way he can find weaknesses and write test to his testing framework to demonstrate the issue.

I wouldn't trust any company that uses Jepsen to show that their product is safe.


[deleted]


He has said [1] that each test takes literally months to do. I am not surprised that he picks the most popular databases, given the amount of work.

[1] https://github.com/rethinkdb/rethinkdb/issues/1493#issuecomm...


Actually, look at some of the things that @aphyr has tweeted about FoundationDB:

https://twitter.com/aphyr/status/542792308492484608

https://twitter.com/obfuscurity/status/405016890306985984



Postgres can survive a network partition? I wasn't aware that master-master or sharding+replication was in the box yet?


Did you read the article? It's not about PostgreSQL in a distributed setup:

> Even though the Postgres server is always consistent, the distributed system composed of the server and client together may not be consistent. It’s possible for the client and server to disagree about whether or not a transaction took place.


To me that's not surprising. If the client connection drops, you might not know whether the transaction committed or not... the only way to know is to reconnect and inspect to see what happened.

If you want to avoid that kind of problem, use 2PC.

Do you (or the author) see this as a bug, or just something that might surprise people who haven't thought through the guarantees?


It looks like you didn't read the article either -- it describes how Postgres is using 2PC, but 2PC doesn't stop the server from committed something that the client is unaware of through missing the final ACK.

Quoting the article:

> The 2PC protocol says that we must wait for the acknowledgement message to arrive in order to decide the outcome. If it doesn’t arrive, 2PC deadlocks. It’s not a partition-tolerant protocol. Waiting forever isn’t realistic for real systems, so at some point the client will time out and declare an error occurred. The commit protocol is now in an indeterminate state.


What is the significance of that though? As you say, it's not partition-tolerant, so while partitioned, the system is down. As soon as the network issue is resolved, you can determine the state of your transaction.

It would be foolish for a client to issue a COMMIT and then assume the transaction aborted because of a connection drop. The client should wait until the connection can be reestablished and determine the real transaction state before making a decision based on it.

It's the same issue as with a power failure during fsync. The durability of that transaction is indeterminate, but it doesn't matter because the system is down. Before the system comes back, it will go through recovery, and either find the commit record or not, thus getting back to a determinate state.


HBase. With the caveats that it's piggy backing off the success of Zookeeper, and the tests were not run by Kyle himself:

https://eng.yammer.com/call-me-maybe-hbase/


Note that Robert Yokota's addendum[1] points out that HBase "cannot achieve both consistency and availability." His earlier results did not take into account that HBase clients continuously retry failed ops. (According to Nicolas Liochon, upon its death a server's regions are moved to another node.) Failures start rolling in once network partition(s) extend beyond the configured timeout. Kyle [2] was not impressed:

"During the network partition, no requests are successful" is not the best result for a CP system, IMO."

HBase should provide partial availability in the face partitions.

[1] http://eng.yammer.com/call-me-maybe-hbase-addendum/

[2] https://twitter.com/aphyr/status/509841011816665088


No system can achieve 100% consistency and 100% availability in the presence of partitions. It's kind of wacky that Aphyr compared those replicated consensus tools and eventually consistent stores with HBase. HBase does not use a consensus protocol for replication, it uses HDFS. HBase is not eventually consistent. There is a single authoritative server for reads and writes of a lexicographic range of keys (a region) which writes immutable store files and a WAL to HDFS. Partial availability may be achievable for reads with a significant amount of effort and latency and limiting of total cluster size by allowing non-authoritative regionservers to read the HDFS WAL + Storefiles, but this really isn't realistic. I've personally been burned by the client retry thing though, and there's not really a better solution when you consider the types of workloads HBase is actually used for and it's incredibly variable latency (it aims only for consistency and very high AVERAGE throughput, at the cost of extremely high HIGHEST latency). One solution here could be the use of a configurable filesystem queue for clients. This is how you build resilient high-throughput pipelines. HBase is used in some places for OLTP, but only when the readload is very very low. So the effort to make reads more highly available would be in vain.


No database can "achieve both consistency and availability" during a partition.

Also, if you follow the rest of the twitter conversation you may realize, as they did, that only requests to the minority partition are unsuccessful - which is exactly what you want from CP.


I wonder why the jeppsen test for Cassandra was deleted (or moved) https://aphyr.com/posts/294-call-me-maybe-cassandra/


That link works for me--what's supposed to be missing?


My link works, but many of the links in the article itself no longer exist


From the looks of it, they're now under a branch called 'old'

https://github.com/aphyr/jepsen/tree/old


Is there actually that much difference between MongoDB and e.g. Cassandra? The tone of the Call Me Maybe posts about both is quite different, but the bottom line seems to be the same: use CRDTs for everything or you will lose data.


Actually the broader lesson would be to assume the worst in your application layer and try and remediate/verify wherever possible.

If you look at his articles: Redis, PostgreSQL, Cassandra, ElasticSearch etc all had data consistency errors. And none of those have vendors making any claims.

It's pretty sobering to say the least.


Um, this is the postgres article:

https://aphyr.com/posts/282-call-me-maybe-postgres

There were no acknowledged writes lost. The only unacked-but-successful writes resulted from a connection while a commit ack was in-flight. That doesn't qualify as a data-consistency error, it means the client has to check if the data is present after reconnecting.

But in no cases would the client reconnect to find that there were acknowledged-as-committed records that were missing or stale. In no cases would the client find that responded-as-rolled-back data was actually committed. This is very, very different than what is seen with MongoDB.


I don't think the results of Aphyr's MongoDB and postgres experiments are directly comparable. In the OP, MongoDB was run in a 5 node replicated configuration. In the post you reference, the experiment was run against a single postgres node.

Furthermore, the postgres experiment only checked that no writes were lost. As Aphyr acknowledges, MongoDB did not lose any writes with "majority" write concern. The postgres experiment did not include the verification of a linearizable history of reads, which is what the bulk of the OP is about.

I'd like to see a similar experiment run against a replicated postgres configuration with auto-failover.


Now try it again with PostgreSQL's built in sharding or replication functionality... oh, wait.


The parent post correctly pointed out that including Postgres in that list is misleading at least.


I was talking about data consistency more broadly. "However, two writes (215 and 218) succeeded, even though they threw an exception claiming that a failure occurred". This obviously isn't ideal behaviour from the perspective of a developer but isn't necessarily not correct.

My broader point was that you need to assume the worst from your database in the application layer. Which I think you missed.


I read your post as FUD regarding Postgres.

That article has various issues, for example, calling Postgres commit protocol as a special case of two phase commit is not really correct. Postgres has 2pc: http://www.postgresql.org/docs/9.2/static/sql-prepare-transa... but that was not tested.

The described behavior is "expected" and "understood". Saying that "you should assume worst from your database" is not something I would ever use for describing DB with ACID semantics.


I wasn't spreading FUD about anything. From the article there was issues with every databases expected behaviour. My point again was that you should expect and manage failure in your application layer. It's what sensible architecture looks like.

And ACID does NOT guarentee that you will not lose data. It is a theory not an implementation. I have lost data with both Oracle and Teradata due to bugs.


Well, you said that Postgres had data consistency errors (referencing Aphyr articles). This is not true (at least regarding that article).

Aphyr article about Postgres could be renamed to call-me-maybe-acid-db-over-the-network and could remain the same.


You misunderstand the results. PostgreSQL behaves as expected, and indeed, the only way it can behave. That's the Two General's Problem (http://en.wikipedia.org/wiki/Two_Generals%27_Problem). There is no way to solve it. PostgreSQL does as well as theoretically possible.

I understand you lost data with Oracle and Teradata.

1. Most big corporate vendors do not have systems which are very well designed. (1) The sale is made at a business level. (2) Most Oracle customers are not very tech companies, and have mixed quality employees. As a result, there is little pressure on building a robust, correct product, rather than one which meets a feature checklist. In addition, Oracle doesn't really recruit smart people (I know people who work there). It's just not very robust compared to something like PostgreSQL, which was written by Stonebraker, a legendary computer science professor and entrepreneur.

2. Still, more likely, the reason you lost data is because you didn't know what you were doing. Words like Eventual Consistency, ACID, Two-General's Problem, etc. are not just abstract. They have strict, formal meanings, and you need to understand what they do and do not guarantee. Otherwise, you will lose data again.


> the reason you lost data is because you didn't know what you were doing

Missed this reply and thought it was funny. I work for one of the world's largest retailers and we are one of both Oracle's and Teradata's most loved customers. We have 4 Teradata DBAs provided BY Teradata amongst a team of 20 SQL Developers. We aren't messing around.

What YOU don't seem to understand is that bugs in your database can cause data loss. ACID or Strong Consistency will not save you.


That's not exactly the sort of team where you'd expect MIT Ph.Ds to work. It's precisely big teams of mediocre people who run into issues based on not knowing exactly what the database is doing and how it's supposed to work that lead to data corruption due to misuse. It's almost always a boring problem (e.g. retail business software). It's almost always a clunky commercial "enterprise" solution (e.g. Oracle). It's almost always a big team. I'm not sure if I even need to go into application engineers -- you put your best and brightest into the core product, and solutions typically gets those that can't quite cut it there.

Regardless, your comment was about PostgreSQL, not Oracle. Oracle is a giant piece of software written by a corporation with over 100,000 employees. Something like that is bound to have bugs, and it has bugs indeed. Data corruption with Oracle certainly happens. PostgreSQL is written by a small, ultra-elite team. It's a much smaller codebase, so an expert developer can understand how the whole system works. There's a big difference in robustness between the two.

Of course database bugs can cause corruption. I've certainly had MonogDB eat my data. But the level of robustness of different databases is very different. There are many databases which are essentially bug-free. If you're losing data with PostgreSQL, odds are you're the one losing the data, not PostgreSQL.


Actually, just use Riak. Those people know distributed bases.

https://aphyr.com/posts/285-call-me-maybe-riak

Now I hear they support consistency as well.


I think at one point the employees of Basho knew how to write distributed DBs - Riak is the most advanced AP DB from a distributed systems theory perspective. However, in recent months, their CEO, CTO and Chief Architect have left, as well as many of their prominent engineers. Worryingly, the new CTO seems content to make inane comments about "Data Gravity" [0].

[0] - http://www.kdnuggets.com/2015/03/interview-dave-mccrory-bash...


I think Riak's theory is fine, but theory isn't enough. And they may have succumbed to the Osborne effect with Riak 2.0.

Here's what I mean. Think of all the nice things you expect to come out of Riak's theoretical basis -- bulletproof distributed writes, for example.

Well, the default last-write-wins writes aren't bulletproof. They clearly fail Jepsen [1]. And if you turned off last-write-wins, then you'd have to handle siblings -- and how you were supposed to handle them without introducing inconsistency was quite unspecified. Riak clients gave you no help there.

Or, as the Jepsen article says, you could use CRDTs. Before 2.0, Riak had exactly one CRDT, the counter.

Something that might be appealing to some is built-in MapReduce. On the forums, people would warn you not to actually use it unless you didn't actually want availability after all. I don't know if they ever sorted that out.

Another supposedly nice thing was Riak Search. A distributed DB with full-text search out of the box -- that sounds great, right? But there were two things called Riak Search, and the first one just plain didn't work. They deprecated it before it had a replacement, but the replacement was supposed to come in 2.0.

So, Osborne effect. When people gradually discovered that Riak 1.x was bad, and the response was "but Riak 2.0 will be great!", that's a great reason not to use 1.x. 2.0 took a very, very long time, enough time for customers and potential customers, including us, to find other solutions.

[1] https://aphyr.com/posts/285-call-me-maybe-riak


Ya, you have to handle siblings if you don't use LWW. If you would rather Riak execute a pre-defined merge strategy, use Riak's CRDT features.

The original Riak Search in vs 1.x was replaced with integrated Solr in Riak 2.x, and it works.

I'd love to hear more about your use case. Contact info in profile. I'm this username in all the usual suspects.

Disclaimer: I work at Basho.


FWIW I did write a bunch of documentation on sibling resolution in Riak: http://docs.basho.com/riak/latest/dev/using/conflict-resolut...


So it looks like the conflict resolution strategies being recommended are "pick one arbitrarily", or in an advanced section, "keep the longer list"?

Not that it matters to me anymore, but it does sound like having all data in CRDTs is the only way to pass Jepsen. (Or to have your data be immutable, in which case Jepsen doesn't apply.)


Indeed. We're evaluating Riak CS and Swift. In a vacuum, I'd choose Riak CS any day. Knowing all of the drama at Basho, we're in a holding pattern at best and leaning (reluctantly) towards Swift.

Basho needs to sell damn fast or just call it a day and open source the enterprise version.


This is kind of surprising to hear. I had no idea.

Well in that case have you heard about LeoFS? Wonder if it overlaps with any of the Riak CS features for you.

http://leo-project.net/leofs/


The drama has died down. Drama is generally isolated to pitched battles amongst engineers for feature implementations ;)

Contact info in profile, this username at all the usual suspects.

Disclaimer: I work for Basho.


Drama aside, Riak and Swift are not in the same league.


Riak CS is. I said Riak CS, not Riak.


That was a year ago and they seem to be doing fine since.


I had a bad experience with riak v2. I think something went quite wrong in the process when going from v1 to v2, especially around Riak Search.


Riak Search in v1.x and Riak Search in v2.x are completely different. Riak 2.x tightly integrats Solr.

If you wanna talk about it I'm this username at all the usual online places.

Disclaimer: I work for Basho


oooh, that is scary.


We now offer a strong consistency option.

I'm this username in the usual places online if you wanna talk about it.

Disclaimer: I work for Basho


Apache Solr's jepsen tests have had very good results.


Mongo absolutely nailed creating a database that is easy to get started with and even do things that are traditionally more 'hard' such as replication. It is still super attractive for me to pick it up for small projects, even after dealing with its (many) pain points both in development and operational settings.

Given this, it is so tragic to see how dismissive they have been in regards to the consistency issues that have plagued the db since the early days. Whether it was the stupidity of bad defaults in drivers to not confirm writes, or easily corruptible data in the 1.6 days, or now with not seriously looking at the results of jepsen, the mongodb organization has never taken the issues head on. It would be so refreshing to see more transparency and admitting to the faults rather than wiggling around them until eventually pushing a fix buried in patch notes.

I often feel like a mongodb apologist when I admit that I don't mind using mongo for small (and not important) projects and while the mongodb hate can be a bit extreme at times, the companies treatment of these sorts of issues may justify some of it.


I'm with you on this, I have a product that is based on MongoDB and, though customers haven't complained about these issues and it's easy to stick your head in the sand when you haven't had any issues, the response by MongoDB is troublesome.

For instance, compare MongoDB's response to elastic's response:

Initial response to "Call me maybe: Elaticsearch":

https://www.elastic.co/blog/resiliency-elasticsearch/

Their ongoing status on resiliency:

http://www.elastic.co/guide/en/elasticsearch/resiliency/curr...

That is how you respond to a negative jespen test. It's particularly illuminating since elasticsearch doesn't actually bill itself as primary storage and they take resiliency seriously as opposed to MongoDB who do consider themselves primary storage and they do not.


The reason why Mongo was able to "solve" those hard replication problems is because they simply ignored all those hard parts. That's the reason why Mongo is not reliable and most likely never will. The issues they have are due to fundamental design choices.


After MongoDB published their write speed benchmarks based entirely on unacknowledged writes (e.g. how fast can you write to a socket?), it's been a long downhill ride with an immense amount of inexplicable ignorant support.


If they are making money, why would they care? There's lots of shitty software raking in huge license fees based on misplaced reputation.


Can you post a link to these unacknowledged write benchmarks? I can't find them.


Need to find an archived version but it caused a lot of arguments in 2009/2010: e.g. http://rethinkdb.com/blog/the-benchmark-youre-reading-is-pro... references similar benchmarks

Can also link simply to the HN discussion from back then too: https://news.ycombinator.com/item?id=1496035

> Full disclosure: I work for 10gen.

> We did this to make MongoDB look good in stupid benchmarks.


From my own memory, I don't recall 10gen ever posting misleading benchmarks. There were a bunch of other people who did so, and 10gen did little or nothing to try to shut that publicity down.

That said, I think that the 'how fast can you write to a socket?' default setting was probably intentionally put in place to make benchmarks look good.


Hey, it worked for MySQL...


There's a lot going on here, but the summary is: "What Mongo actually does is allow stale reads: it is possible to execute a WriteConcern=MAJORITY write of a new value, wait for it to return successfully, perform a read with ReadPreference=PRIMARY, and not see the value you just wrote."

https://jira.mongodb.org/browse/SERVER-17975?focusedCommentI...


I'm so glad to see the Jepsen series re-instated. Thank you so much Stripe


Question: How do I actually run Kyle's tests to see this for myself? (Not that I don't believe him, I just want to play around a bit.)

When I run `lein install` and then `lein test`, I get:

    ╰─▶ ψ lein test
    Exception in thread "main" java.io.FileNotFoundException:
    Could not locate jepsen/db__init.class or jepsen/db.clj on classpath: ,
    compiling:(mongodb/core.clj:1:1)
	at clojure.lang.Compiler.load(Compiler.java:7142)
	at clojure.lang.RT.loadResourceScript(RT.java:370)
	at clojure.lang.RT.loadResourceScript(RT.java:361)


Update the version of the "jepsen" dependency in project.clj to 0.0.3. You'll need to do this for each of the test projects you want to run.

FWIW, this was filed the other day: https://github.com/aphyr/jepsen/issues/52


Ah, that explains it! Thanks.


Can't answer your question, but I'm curious how you managed to include an image in your comment. I didn't think embedded HTML was possible?


That's from my command prompt that I wrote -- the arrows are just Unicode characters. You can see it here if you like:

https://github.com/fj/dotfiles/blob/master/home/.config/shel...

Edit: On further reading, I think you seem to be thinking that the paste of my console is an image. It is just text. To make fixed-width text on HN, indent each line by four spaces, like this:

    This text has four spaces at the beginning of its line.


What image? The arrow and psi are characters.


People really underestimate the value of Occasional Consistency. Occasionally Consistent databases, like MongoDB, are great for approximation algorithms, sublinear time algorithms, and similar applications.


The issue isn't that MongoDB is eventually consistent, it's that the documentation claims that in some cases it's strictly consistent[1] while Kyle found that:

"MongoDB, even at the strongest consistency levels, allows reads to see old values of documents or even values that never should have been written."

1. http://docs.mongodb.org/manual/reference/glossary/#term-stri...


I didn't say Eventual Consistency. I said Occasional Consistency. MongoDB is has hard Occasional Consistency. Indeed, it is the most occasionally consistent database I know of. I once wrote a few million records into Mongo. It was consistent before the write, but never again after.

Great for sub-linear time algorithms! At that point, all my algorithms ran at less than O(n) on the size of the data I had written in.

From a business perspective, Occasional Consistency is also a very nice property if you are storing audit data for certain types of organizations. It gives complete plausible deniability about rule compliance.


Heh, Poe's Law, etc :)

I guess you could also call it Quantum Consistency.


Occasional Consistency sounds like a made up word.

I can't even google this term without it auto correcting to Eventually Consistency.

The wiki article, https://en.wikipedia.org/wiki/Consistency_model, doesn't even have such a term.

Doing a hard search term on google reveals nothing.


It was sarcasm...


Since Postgres added a JSON type and Docker made running it simple in development, I haven't had a need for anything else. Call me old school, but I prefer starting with a relational database and changing when it's no longer appropriate.


Old school? It's the best thing to do in my opinion. An ACID relational database that can do even more than that! I think it's one of the best DB for startups.


So what should users of MongoDB do? I'm asking because it is the main database used in Meteor and I'm very interested in Meteor.

Should the general advice just be "store in MongoDB everything that doesn't require consistency and use Postgresql for everything else"?


The general advice should be: use PostgreSQL in case you are uncertain what to use. Watch some Youtube video's with Michael Stonebreaker (2014 Turing Award winner) and start getting disillusioned by the NoSQL hype.

Then, try to understand the mess Edgar Codd tried to fix in the '60s and '70s.


Do you have any specific videos we should watch?


In the top-hit on Youtube, Michael starts to discuss the solution-space @31m15s (https://www.youtube.com/watch?feature=player_detailpage&v=OY...)


You can write a DDP backend which is backed by Postgres. In such a case it ought to feel like Mongo is just serving materialized views of the genuine, consistent data. If you treat the Minimongo data that way—just consistent enough to show an image once—and verify all the writes on the server then you ought to be able to get by.


I think Meteor is great, except for that one thing. I won't be touching it again until there if full support for one of the SQL technologies.


Have a look at ToroDB (https://github.com/torodb/torodb). It's open source, MongoDB-compatible database which uses PostgreSQL to store data. In a relational way (i.e., no jsonb, no blobs). It's still under heavy development, but worth a look (ToroDB dev here)


Simply put, make sure your app can handle inconsistent data.


I still don't get it. MongoDB can't possibly call itself a database. I can understand MongoScratchStorage, MongoPorbabilisticDataEngine but not MangoDB.


MongoWeakReference


Another instance of Kyle's amazing research! You may want to catch him on stage with other great minds at dotScale on June 8: http://dotscale.io


This article is too technically advanced for me. As a casual MongoDB user, how do these problems affect me?


In replicated Mongo scenarios higher write volume increases the probability of inconsistent reads. What this means is that there's a chance that on some data—no matter how safely you attempt to write it—you'll end up in a totally inconsistent state for your system.

The actual impact of an inconsistent state is very hard to judge. It could be as minimal as having two different users just see something weird on their screen for a moment and then it goes away. It could even be totally avoided if your application handles inconsistent data well.

At the same time, it could also cause complete and nearly untraceable complete corruption of all data in your system. Who knows?

It'd be a bit like building a bridge using metal with a known defect. It'll probably work fine for a long time and depending on how and where that metal was used you might be alright.

Or you might have a complete structural integrity failure at any moment once stress starts ramping up and you'll just have to blame it wholesale on using bad materials.


In very rare cases, you could see confirmed writes being rolled back, or reads returning data from before a confirmed write

How these problems will actually affect you depend on your applications. It could for example allow 2 users to be created with the same email address


At several points real world scenarios are described, so just skim ahead to those.

Users could end up seeing private information in each others' accounts, for example.


I seem to remember from a foundationDB talk that they first spent two years building a simulation environment to control everything from network to persistance for testing scenarios.

Does anyone know of any open-source project that would aim at doing the same, so that future NoSQL DB can finally be built on strong foundations ?


I knew something was funny with Mongo when all the api calls defaulted to writes not being guaranteed to sync to disk. Maybe for a use case like aggregate statistics gathering it would be ok to risk missing a few updates in a crash for the sake of speed, but to make that the default??


Would you want to miss the most important post-recovery data: when your system is under duress?


You know this configuration change was changed in November 2012.

Do you think it's still relevant to be bringing this up ?


Actually, the defaults are still unsafe, just the marketing language has changed. Take the Node.js driver for example, it defaults to w=null and j=false. http://mongodb.github.io/node-mongodb-native/2.0/api/Db.html


I think it would be great to see one of these done for RethinkDB :)


RethinkDB engineer here. RethinkDB currently doesn't support automatic failover, so this test couldn't be performed for RethinkDB yet. But when we implement automatic failover we're planning to test it against Jepsen. That will probably be sometime in the next few months.


DB engineers living in mortal fear of Kyle is where we want to be.


I must admit, I always feel like I am missing something in these discussions. Like I didn't get some memo... I just don't expect a DB like MongoDB to guarantee consistency. The whole story around NoSQL and the likes was to enable crazy horizontal scaling needed for the web. Phrases like "eventual consistency" flew around. It seems so logical - you lose consistency, gain scalability.

But somehow, people simply started using them everywhere? Assuming that these DBs are just like any other? And now, we're all bashing on MongoDB because it is - not consistent? What happened here? :)

NB that I do not wish to attack the OP - if MongoDB now claims to be consistent in any way, that deserves scrutiny. And these analyses are always a really interesting read. But the general tone in the developer community about MongoDB seems a bit irrational.


> It seems so logical - you lose consistency, gain scalability. ... And now, we're all bashing on MongoDB because it is - not consistent? What happened here? :)

There are ways to do "eventual consistency" responsibly. Mind you, it's obnoxiously tricky to do it right, even when someone has provided an underlying implementation that works exactly as promised. But if you design your data access patterns in the right way, the system can provide guarantees so that even if it doesn't have all your data at the moment, you can still ask questions about the the state of the data that is available, and get meaningful responses back that conform to a certain set of guarantees.

What happened here -- why we make fun of MongoDB -- is that it doesn't provide many promises like that, and even when it does, its implementation does a very, very bad job of delivering them ... and it doesn't even do a good job of delivering scalability. (It's basically a mmap()'d series of b-trees of BSON documents, so as soon as you run out of RAM, you're at risk of having the kernel swap out all your indicies instead of your data, whereupon performance craters. Oh, and the much-mocked global write lock has finally been replaced with a per-database write-lock in recent versions.)

In short, you sacrifice everything and gain... a modestly convenient API for document-storage, maybe.


"Eventual consistency" has a very particular meaning (when it's not being used as a buzzword). "Read uncommitted" doesn't even come close to the sorts of guarantees that people expect from an AP database. More importantly, MongoDB doesn't advertise itself as an AP database, it advertises itself as something you can use as the primary datastore for important information. Kyle has analyzed AP databases like Cassandra and Riak as well, and evaluates them according to their claims.


> if MongoDB now claims to be consistent in any way, that deserves scrutiny. And these analyses are always a really interesting read. But the general tone in the developer community about MongoDB seems a bit irrational.

It has been sold that way. People believe with write-concern majority, they can use mongo replicas for primary data stores and not lose anything. Embed your relational data and use it for everything.


So, now I'm wondering: why is Stripe using Mongo at all? Maybe they are planning to migrate to another DBMS?


Does anyone have any references on how you could write a distributed database that met all ACID properties? Surely there's an academic paper that says that if you do A then B then C, you are guaranteed a certain level of consistency.

We've developed a type of distributed database at my company, and I think it's pretty solid, but I need a broader familiarity with the available theory.


Aside from reading papers, it's a good idea to look through the syllabus of a distributed systems course to get a broad idea of what the problem space looks like.

Academic papers will talk about "minimal" problems like consensus, or desirable properties like sequential consistency, and expect you to already know why those concepts are important. If your experience is mostly hands-on, it may not be obvious how it all applies to real-world systems.

Say you have a complex distributed database. Forget all the bells and whistles: can it solve the problem of allowing a set of processes to reliably agree on a single Boolean value? If so, then you're trying to provide the same consistency guarantees as Paxos/Raft. So if your architecture is substantially simpler than Raft, then either you've come up with something really ingenious, or you've missed some edge cases.


Or you are obviously violating proven theory...


Consensus and atomic broadcast. Paxos, raft, zab. Formal methods.


For the record calling something a "distributed database" is not nearly enough. What part of it is distributed? On what operations do you want to provide the ACID guarantees? What do you promise in the face of partition?

I would be very skeptical of any database that was written by someone who didn't have a sound foundational understanding of distributed systems theory. This is quite simply one of the places in software engineering where subtle differences in promise, protocol and expectation can make a huge difference.


As well you should be. Fortunately, we're relying on some third-party libraries that are built on a solid theoretical foundation, and what we promise is limited. Our goal is to improve over time.


Consensus is the main hurdle - if you have multiple nodes that can be read from, then any values that are successfully written to the system must guarantee that those same values can be read sequentially.

Issues arise when network partitions interrupt communication between nodes; even if you require all nodes to send acks when writing, how do you deal with those acks not being received?


Would the use of wired tiger as a storage engine affect these results?


My guess is no. This is more about the behavior of the database as a distributed system and not just a storage engine.


Hmm, I can't think of any reason why it would make a difference in this case.


Definitely not


Apache Solr has done very well at Jepsen tests.


Maybe we should listen to Larry Ellison when he say "gimme my money!"


Has Oracle been run through this test battery?

Or would publishing the results of doing so bring down an army of Larry's lawyering henchmen? "You violated the EULA, now you must pay! You will only wish you were dead when we are done with you, bwahahhahahaha!"


I guess the first rule of "Benchmark Club" is that we don't talk about Benchmark Club?

Now off to deface a piece of corporate art... :-)


upvoted for the Look Around You link alone.


Shame this wasn't done with the latest version 3.0. Although given that improvements are scheduled for 3.1 I would imagine it might be still an issue.

Nice writeup either way though. Would like to see a similar article for Couch* and MySQL.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: