Hacker News new | past | comments | ask | show | jobs | submit login
NoSQL Benchmark Compares PostgreSQL, MongoDB, Neo4j, OrientDB and ArangoDB (arangodb.com)
70 points by sachalep on Oct 13, 2015 | hide | past | favorite | 53 comments

I really, really distrust these kinds of evaluations when they come from someone whose product is included in the comparison. Even if everything is above-board, they're not going to publish if it shows their product just completely sucks at it. That kind of publication bias makes these kinds of results a lot less trustworthy than independent benchmarks even if you assume the best of intentions from the people putting them out there.

Ingo from ArangoDB here. I agree that vendor tests are always biased, of course you want to show that your product is competitive.

But as there is no independent institution that compared our product and as we want to know where we stand with ArangoDB, Claudius did his own tests. And as the work is already done, why not share it.

We tried our best to do it as open as possible. PostgreSQL performed very well and we have a problem with memory consumption - have a look at the charts, we will try to improve there.

- Every database configuration is public

- All test scripts are available on Github

- We publish updates if we get pull-requests or comments with suggestions for improvements

We did that before and after the last test, some database vendors sent us improved snapshots of their databases which found their way into the latest products (OrientDB and Neo4j).

If you have suggestions for improvements, please let us know.

> PostgreSQL performed very well

Despite the fact that you crippled it by not using jsonb columns.

If they weren't going jsonb and the gin jsonb-path-opts index, they shouldn't have even included postgres.

Why so hostile? Personally, I assume this is just an instance of them not being PostgeSQL experts. So, before jumping to conclusions (namely that they deliberatly skewed their test, despite being very open in the way they described the setup), I'll instead wait and see -- perhaps they'll explain why they used json instead of jsonb, or perhaps (better!) they can updat the test to also include jsonb.

I think his hostility comes from the fact that if you google just "postgresql json" the second hit is the official documentation that goes over jsonb right in the second paragraph.


You can also see that the very first link outside of the official documentation:


Explains the benefits of jsonb and how to use indexes.

If you're going to include a database, at the very least do 20 minutes of research on it. If that can't be given, then just don't include something.

That's like saying "I've just installed MongoDB and read the intro documentation page, I'm now going to benchmark it against a MySQL cluster, which I have years of experience with, and that I helped develop." (E.g.: they developed ArangoDB, so they should be experts in at least that, right?)

> I assume this is just an instance of them not being PostgeSQL experts.

this illuminates the need for "experts" in the different components in the stack. it's intellectually dishonest to claim that a technology is not up to the task when it's not been properly treated in the first place.

If they're not Postgres experts then they shouldn't be including Postgres in the line-up. Or ask someone that ask someone that knows better.

That's why they welcome pull-requests.

> Why so hostile? Personally, I assume this is just an instance of them not being PostgeSQL experts.

Postgres' official documentation on json and jsonb is rather concise: http://www.postgresql.org/docs/9.4/static/datatype-json.html

The difference between the types is described in paragraph two. So, they either benchmarked Postgres while having no idea whatsoever what they were doing, or they were deliberately crippling the competition.

Neither option is confidence inspiring.

(And the first option seems sketchy, seeing how they then went and re-created the whole benchmark as classical RDBMS setup for the second postgres test.)

They wanted to play with the other kids, but no one invited them, so they threw their own party despite not knowing how to throw one.

Biggest problem of (small) German tech companies IMO.

I worked for one myself and they had really good core technology, but that was it. The angelo-saxon companies always outplayed them, because they're just so much better at PR. Even their developers are better at this than most marketeers that burn money on a daily basis in Germany...

The particular benchmark seems bullshit, too. For postgres they seem to intentionally use the less performant json column type instead of jsonb.

Not that I can verify it, because the code in the linked public "No magic, no tricks – check the code and make your own tests!" repository doesn't match the published results and doesn't even work at all with postgres…

EDIT: Okay, they pushed a new version containing the Postgres data now. They ARE using the cripplingly slow json columns, not jsonb columns recommended by the documentation.

And despite that Postgres destroys all the other solutions at everything other than the non-sync (lol) single write case and graph traversal.

If anything it just proves even after almost a decade of these "NoSQL" solutions being around they still can't compete even on basic queries with Postgres which is a fairly conservative SQL solution.

I think this is a classical case of "use the right tool for the right purpose". If you want to do "classic database" stuff, use a classic SQL database. But if e.g. graph traversal is a cruicial operation for your application, then looking at NoSQL solution seems to be quite interesting.

In other words: I wouldn't call a screw driver a bad tool, just because it's not as good at driving nails into wood as a hammer.

I think the issue is how many NoSQL solutions over recent years have billed themselves as the default start-with-us data store, and SQL as the more complicated niche product...when it's really the other way around.

Just to be a pedant, the JSONB format as I understand is marginally slower at inserts and orders of magnitude faster at everything else...

It's fractionally slower, true, because of the serialization hit (string to binary). The real juice comes from the GIN index - and if you apply it to specific columns instead of a complete document, you have a rocket ship on read.

> It's fractionally slower, true, because of the serialization hit (string to binary).

And with JSON columns you have to serialize on accesses, which is a lot slower in the read-mostly tests.

Stupid question: isn't that the typical result of indexing? Slower writes for (much) faster lookups?

this is more constructive that flooding the communications channels with hidden facts like Neo4j does.

WTF with VCs investing into all this crap? There are very few scenarios were you would be better off with NoSQL solution and there are established players serving those niches already.

I'm Claudius, author of the tests. I've been asked to include a lot of different databases into the test runs. The most requested databases were Postgres/JSON and RethinkDB. I started with Postgres. The Postgres manual states that JSONB might be faster, but some StackOverflow answers indicate that it takes more space than JSON, while JSON might be slightly more compatible with legacy code. I've shown the queries and setup to some local Postgres users. They did not point that JSONB will be much faster for the kinds of requests used in the test setup. For instance, we do not use special indexes apart from the primary one by choice.

I wanted to move on to RethinkDB next, but I see your point that a comparison between the different JSON formats of Postgres can also be very enlightening. This should replace guessing with hard facts. As always I will update the blog post and add this tests as well - as we did in the past, see https://www.arangodb.com/nosql-performance-blog-series/.

If you have any improvements concerning the configuration of Postgres or SQL queries, I'm will be more than happy to include them as well in the update. I will push the used configuration to GITHUB as well.

Please refer to the #postgresql channel on irc.freenode.net for any postgres inquiries, you will receive an answer from experts and core developers on the correct processes within minutes for almost any question. It is a very active channel full of knowledgeable folks.

> For instance, we do not use special indexes apart from the primary one by choice.

For instance, we didn't use the index that makes the database go fast to make our own database look good.

I am just going to say: have a try with the LDBC social benchmark http://ldbcouncil.org/ and http://ldbcouncil.org/benchmarks. Where you can even have audited results.

These are also graph database benchmarks that are synthetic, designed to look like real data and are quite hard to do well on.

As someone responsible for a public free to use deployment of a graph database with more than 2 billion nodes and 15 billion edges (sparql.uniprot.org) I must say this looks like a SPARQL benchmark from 10 years ago.

I wonder why there's not the equivalent to the Frameworks Benchmark[1] for databases. It seems we could all really benefit from that. Ideally it would get to a place where they would be able to simulate real-world worst case scenarios and test for problems. Each database would likely want multiple entries with different configs, but if you have some engineered failure scenarios and tests in the results it becomes obvious what the trade-off is. Sure, a specific setting may reduce consistency in the event of a failure for speed, but sometimes that's what you might want, and if the failure cases clearly show the problem, at least you aren't going in blind.

1: https://www.techempower.com/benchmarks/

Having benchmarks for different storage models : Relational/Document/Graph/Object/XML, would be a better solution.

Clicking the link got me "Error establishing a database connection." :/

I was kind of shocked how good PostgesSQL did.

I still think PostgresSQL and MariaDB are a better tool for most jobs considered big data.

Postgres was actually somewhat crippled in these tests since they used json rather than jsonb for storage, which stores the json in a binary format which doesn't need to be serialised on reads.

That's not quite correct. The jsonb requires that reads deserialize jsonb into textual JSON, whereas the json type can be sent directly to the client with no processing.

jsonb is superior when:

1. You want to use any of the built-in JSON functions, e.g. for extracting fields from the document.

2. You want to index the JSON (either the entire thing via GIN, or individual fields via ordinary B-tree indexes).

3. You want to save space; jsonb strips whitespace.

jsonb incurs an overhead on both reads and writes since it must serialize to/from textual JSON.

This is not a cluster test. NoSQL databases in general are optimized for scaling horizontally on commodity hardware. That's more tricky in RDBMS.

Most people I have spoken to are using NoSQL on single nodes.

I think a lot of people jump on NoSQL solutions when scale out isn't needed because of the permeating "do everything in the application / client" mentality these days.

Looking around, it seems that different graph engines pull ahead depending on the use case.


>Error establishing a database connection

Like others mention here, I'm skeptical of these types of comparisons. If I compare myself to my competitors, I won't publish results if they're better than me.

I tried ArangoDB about a year ago, I think I still have the branch that I tried it on. After spending a weekend porting some stuff from MongoDB to Arango, I ended up regretting doing that by Sunday evening. It'd be nice to fire things up, update the branch's code and see how it performs.

No RethinkDB?

Comparison of X1, X2, ... , Xn, Y, written by Y

=> suspicion

Hugged to death.


No, running on XXXXX Cloud. :(

We currently look into it. Thank's for the mirrored page.

and now a 10-node cluster

Would love to see the results for CouchDB in comparison to these.

Would like to see - Titan with Cassandra backend here.

Out of interest, which version of Titan are you on? I see that 1.0 was released recently, with little apparent fanfare.

or with http://www.scylladb.com/ backend. "ScyllaDB: world's fastest NoSQL column store database; Fully compatible with Apache Cassandra at 10x the throughput and jaw dropping low latency"

Why not include redis or rethinkdb?

redis and rethinkdb are not ACID across documents. So it's not the same usecase at all.

Are you implying that MongoDB and friends are ACID across documents?

The graph dataset is too small in size. It makes little sense for real-world usage.

Ingo from ArangoDB: Despite it's the whole dataset of a real-world use case. :)


But of course, you need to test and decide on basis of your individual requirements and use cases.

Ingo - SNAP has a bunch of other "real-world use case" graphs available for free, many of which larger than this 1M-node, 30M-edge toy.

I've done a bunch of related benchmarkings, and the smallest real-world dataset I've used is the largest one on SNAP: orkut.

I have looked at ArangoDB and really hope it takes off, it has some pretty nifty features I think just at this point the lack of integration with frameworks like Meteor.js is holding me back.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact