
Performance Comparison Between ArangoDB, MongoDB, Neo4j and OrientDB - Hoolyly
https://www.arangodb.com/2015/06/performance-comparison-between-arangodb-mongodb-neo4j-and-orientdb/
======
ThePhysicist
I did a lot of research on graph database technologies recently and read a lot
of these "let's compare X to Y" articles. What I found is that most benchmarks
- especially those done by people affiliated with a given product - often tend
to show a distorted and sometimes plain wrong picture.

For example, concerning the performance and scalability of graph databases the
main argument of proponents of this technology is the "join bomb" argument,
which states that you can't efficiently store a graph in a relational database
since it will require O(log(n)) time to lookup neighboring nodes from the
index when crawling the graph. However, this is of course only true for B-tree
indexes, whereas hash-based indexing would give you basically the same
performance (O(1)) on a graph implemented in a relational database.

Additional features like documents and deep indexes are nice of course but can
be (and often are) implemented using relational databases as well, so in the
end there really isn't such a large advantage to be gained from using a graph
database, especially when taking into account the immaturity of many solutions
in that space.

~~~
mcphilip
>Additional features like documents and deep indexes are nice of course but
can be (and often are) implemented using relational databases as well, so in
the end there really isn't such a large advantage to be gained from using a
graph database, especially when taking into account the immaturity of many
solutions in that space.

I've worked with graph data stored in rdbms in the medical informatics space.
As you say, there are ways to correctly handle complex graph data in rdbms.

I've also used neo4j as the backend for a wall street analytics app that's in
production. Could it have been done in rdms? Sure, but the ad hoc queries that
needed to be run against the data were much easier to express as graph
traversals than SQL.

There are some obvious downsides with using a graph database, mainly that it's
practically impossible to find programmers with non trivial production
experience, but it's been a great fit at the two startups I used it at since I
got to implement it from the ground up and didn't need a large team.

That being said, database pragmatism is the main lesson to be learned here.
Use the right tool(s) for the right jobs.

~~~
neunhoef
(Disclaimer: Max from ArangoDB here) I am all for database pragmatism.
Fortunately, the choice "graph database" or "non-graph database" is no longer
binary. We at ArangoDB are convinced that graphs in data modelling have their
merits (namely when you need "graphy" queries), but you do not want to be
locked in to the graph data model. Therefore we argue for multi-model
databases, which can give you graphs, but do not force you to use graphs for
all and everything. With "graphy" I mean queries that involve paths in a graph
whose length is not a priori known (e.g. ShortestPath).

------
NathanKP
Rant warning:

After working with Neo4j for about six months I would definitely NOT recommend
it to anyone. In my experience it has been the least reliable and most buggy
database solution I've ever worked with.

From the engineering point of view it has a host of core issues that make it
really hard to write code, such as constant deadlock exceptions. These are
caused by the fact that the database is largely incapable of handling two
simultaneous upserts if those upserts touch the same node. Where most mature,
decent DBs can handle this completely transparently Neo4j just panics and
returns an error. This means writing Neo4j queries ends up requiring tons of
boilerplate to wait for exclusive locks on nodes and/or retry upserts until
they succeed.

From the devops perspective managing a cluster is also extremely painful, as
there are frequent issues with replicas getting behind on syncing due to the
server's pathetically slow write performance even with SSD storage volumes. We
tried everything but the bottleneck was the server processes themselves, not
the storage volumes, network connection, CPU, etc. We threw some really nice
hardware at our Neo4j cluster but it still struggled to keep up with write
loads in the range of 500-2000 writes per minute.

The final straw was their latest version 2.2 which they were advertising as a
massive improvement in speed and reliability. When we upgraded it turned out
to be the exact opposite. A few of our queries got faster but overall most of
them got an order of magnitude slower. Their support basically told us that
we'd need to rewrite many of our queries or manually set a flag to use their
older query engine (and therefore miss out on the speed of the new query
engine). Needless to say we decided if we needed to rewrite queries we were
going to rewrite them to use a different storage engine entirely.

In my experience Neo4j was little more than a six month waste of time and dev
resources.

~~~
don71
Hi,

I'm Claudius, the author of the blog post. The intent of the blog was not to
show, that a particular product is not performing well. There are thousands of
different use cases and each database has its strengths and weaknesses. For a
different scenario the results might be different. Neo4J is a solid product
and is doing a good job. The aim of the blog was to show, that multi-model can
compete with specialized solutions. What I wanted to show, is that a multi-
model approach per se does not carry a performance penalty.

~~~
rspeer
> Neo4J is a solid product

Do you really believe this, in your expert opinion, or are you just trying not
to step on toes?

~~~
don71
We are a multi-model database, which is not in a strict sense a competitor but
is competing with Neo4J in some areas. Therefore I'm definitely not a Neo4J
expert. However, I'm now working in the field for over 15 years, developing
in-memory solutions, databases and application servers. Developing ArangoDB
for almost three years and I have talked to a lot of people in that area and
to people who are using Neo4J. There are always obstacle when moving to new
products. But most of the people who I met are quite happy with Neo4J.

------
mangeletti
IMHO, the community needs a set of specific tasks that can be achieved with
all databases (just like
[http://benchmarksgame.alioth.debian.org/](http://benchmarksgame.alioth.debian.org/)
has a series of algorithms for testing different memory/CPU strengths of
languages). Then, proponents of each database (e.g., their sponsors,
evangelists) can create code and config for running the tests on their
database. This could all be open source, and the tests could all be run on the
same host (or hosts) for comparison.

This seems to make sense, and is more akin to what
[https://www.techempower.com/benchmarks/](https://www.techempower.com/benchmarks/)
has done, IIRC.

~~~
bhauer
You have recalled correctly! That approach is precisely what we have taken
with the TechEmpower benchmarks.

------
bhouston
When I compare databases, I also search out the performance comparisons
created or sponsored by my preferred database provider, then I know that I can
trust the results to be complete and unbiased./sarcasm

Seriously, why is this one of the top stories on HN? These types of tests are
so easy to tweak in favor of a perferred database that they are completely
unreliable. Even neutral comparisons by third parties are rife with errors
like not adding proper indices to all the DBs or using query formulations that
avoid the indices on some DBs or other configuration issues (DBs are
unfortunately tricky.)

I think the only way to do this objectively is to have a test and then give
each DB vendor an opportunity to tweak the DB and queries to optimize
performance. Seeing how DB vendors optimized performance would actually be
very informative to potential users. Everything else is just a comedy of
errors (or worse) as normally people usually only have good expertise in one
of the DBs in question, if that.

~~~
porker
> I think the only way to do this objectively is to have a test and then give
> each DB vendor an opportunity to tweak the DB and queries to optimize
> performance.

AFAIK they are: the test is open source, the raw results are there, and
contributions welcomed. Hopefully the OrientDB team will step up and show how
theirs can perform.

~~~
lvca
Hey all, we sent a Pull Request 2 days ago to the author of the Benchmark, as
they used OrientDB incorrectly. Now OrientDB is the fastest in all the
benchmarks, except for "singleRead" and "neighbors2", but we know why.

We are still waiting for the Arango team to update the results...

~~~
porker
> Hey all, we sent a Pull Request 2 days ago [...] We are still waiting for
> the Arango team to update the results...

Let's see: 1\. You sent it at the weekend 2\. The Arango team have a life 3\.
It took you 9 days to send the PR 4\. The start of the week always is busy,
regardless of which company you work at...

Give over and stop trying to make out there is something suspicious in
whatever the Arango team do. It makes you (look like) a jerk.

------
segmondy
I would love to see postgres in the mix.

~~~
Roboprog
Likewise. I started some stuff at work recently comparing Mongo, Orient and PG
(using a JSON column). Alas, all the test suite does so far is insert small
"documents" (5000 docs w/ 3 name-vals excluding PK/ID) and time that. No read-
back tests of any kind yet, so no indices in place either.

For this little test (on my macbook), Mongo was the fastest. PG took 1 1/2
times as long, and Orient took 4 times as long as Mongo. All 3 were driven by
a Java client connected via a socket to the DB on localhost. (Orient could
have been "in process", but I wanted it external as if on a server)

Of course, the main use-case for Orient is reading back graph chains, so it's
a horrible test. However, what we need is a supplemental store to dump some
flat junk as the app runs.

------
matts9581
Worked with Orient back in 2012, had some issues regarding performance and
switched. Following the news about them, I thought they had made great
progress. This benchmark kind of shows the exact opposite.

------
jsc123
The conversation is continuing here:
[https://groups.google.com/forum/#!topicsearchin/neo4j/time/n...](https://groups.google.com/forum/#!topicsearchin/neo4j/time/neo4j/E5LsTfwROb8)

------
harunurhan
Are you sure that you optimized/tuned Neo4j or MongoDB as much as you did with
ArangoDB?

Also, I don't like when a company posts a comparison between its product and
others. Although some of them are arguably informative and objective, I
consider these posts as marketing/ad posts.

~~~
neunhoef
(Disclaimer: Max from ArangoDB) We have invested considerable effort to
optimize each database. Obviously, we know our own product better than the
others. However, we have asked people who know the other products better, and
we keep this investigation open for everybody to contribute and to suggest
improvements. As you can see from last week's post, there have been very good
contributions, we have tried them out and have published the improved results.

------
lvca
Hey all, we sent a Pull Request 2 days ago to the author of the Benchmark, as
they used OrientDB incorrectly. Now OrientDB is the fastest in all the
benchmarks, except for "singleRead" and "neighbors2", but we know why we're
slower there.

We are still waiting for the Arango team to update the results...

However, who is interested in running the tests themselves, just clone this
repository:

[https://github.com/maggiolo00/nosql-
tests](https://github.com/maggiolo00/nosql-tests)

------
nevi-me
Will be interesting to see the 'alpha' test results, interested in seeing how
the MongoDB 3.2 series would perform there

------
lobster_johnson
Anyone using ArangoDB in production who can speak about it? It looks
interesting, but like many of the newer databases coming out (Aerospike,
Blazegraph, Hyperdex etc.) there is precious little public information from
third parties.

