Hacker News new | more | comments | ask | show | jobs | submit login
ArangoDB: Multi-model highly available NoSQL database (arangodb.com)
95 points by giancarlostoro 37 days ago | hide | past | web | favorite | 69 comments



I like Arango but these days it's very hard for a database to compete with Postgres which is incredibly powerful.

Every time I've bet against Postgres and used some other data storage mechanism, I always come back to Postgres.


Disclaimer: I am one of the core developers of ArangoDB and I work for the company ArangoDB.

Yes, ArangoDB is young (6 years) in comparison to PostgreSQL (30 years).

Yes, PostgreSQL is a phantastic database with an amazing open source community and is not going away any time soon.

Yes, PostgreSQL is a good choice for a project in which you need a relational single server database.

However, ArangoDB actually has a different value proposition, so in a sense, it is not a direct competitor.

ArangoDB is native multi-model, which means it is a document store (JSON), a graph database and a key/value store, all in one engine and with a uniform query language which supports all three data models and allows to mix them, even in a single query.

Furthermore, ArangoDB is designed as a fault-tolerant distributed and scalable system.

In addition it is extensible with user defined JavaScript code running in a sandbox in the database server.

Finally, we do our best to make this distributed system devops friendly with good tooling and k8s integration.

Last but not least, ArangoDB is backed by a company which offers professional support.

Therefore any well informed decision for a project needs to look at the value propositions and capabilities and not only at the age and experience, which is of course a big argument, since people are - rightfully so - conservative with their databases.


>is designed as a fault-tolerant distributed and scalable system

What is the consistency model and have you validated that it actually works as designed (for example with Jepsen)? I didn't find anything detailed on your website.


The main thing to compete on is distributed scale-out and high-availability which PostgreSQL does not have a good answer for.


That's absolutely true. However, most systems that think they need scale-out are either worrying about a problem they don't have yet, suffering performance problems due to poor database design, or have not fully explored logical (dynamic or hash-based) sharding as a solution to their problems.

A huge proportion of modern software businesses can run on single-writer RDBMS instances, properly engineered, at a fraction of the operational and implementation cost of a scale-out solution. That applies to hosted and self-managed solutions equally, in my experience.


Sure, but scale isn't everything. Operations are far more important, especially when you're a small team.

Modern distributed databases scale better but also have better replication, high-availability with automatic failover and no downtime, easier upgrades, easier backups, and generally less maintenance. Removing the single-point of failure with efficient distribution while being able to run easily on docker/kubernetes makes a big difference over a single monolithic database server.


The desire is generally for high availability now and ability to scale out in the future; while a single write db is generally more than enough performance wise, it doesn't give you high availability which is the main desire.


CitusDB for sharding & replication. Patroni & Stolon for automated replication. Not native solutions, but are an answer.


CitusDB has a ton of limitations on your schema/table structure right now. Over time it has gotten better, so hopefully in the future it will be a true drop-in scale out solution. Specifically some types of relationships + indexes are not supported.


True, but as you said, the compatibility is quickly increasing. They actually did a great job and there is fairly little that is not compatible, usually on very advanced side (which is exactly what we would love to have), but the team is especially good at what they are doing and believe they will get there sooner than later.


Oh yeah totally; it's a slow march forward. MySQL Galera took some time before it was ready for serious use, but the fact that it's semi-usable is awesome.


Citus only does sharding, not replication. Either way, this is a fundamental design limitation with traditional single-master relational systems.



Yea, that's just normal Postgres replication, nothing to do with Citus.


Citus is doing sharding and replication: Citus will shard your data and place every shard on e.g. 3 servers. So one server dying is no problem because the data is replicated 2 times.


Ah I see that now. However that's not by default and it still only applies to "distributed" tables, not all the data in the database. Combined with a single-master, it's nowhere near the replica functionality provided by natively distributed storage systems.


Do they have ACID transactions around writing to all 3 servers? Doesn't that cause bad performance?

You might also end up with a LOT of servers unless you starting sharing servers for different shards (but still with different sets of 3 servers for every shard).


what about PostgresXL. it is a release v behind Postgres. But it has ACID and cluster. I do not have personal experience with it, but would be interested to hear what folks say.

Overall, I agree -- PG with JsonB is extraordinary powerful system, covering document oriented and typical row-oriented use cases.

PGs ecosystem with external foreign interfaces, and in memory streaming solutions (like PipelineDB) - must be an envy of any other db ecosystem.


Valid point, for 99% of startups scaling on one machine (todays SSD based multi core, GB memory hardware is very powerful) with read slaves and failover (Patroni) works just fine. Many I meet as a consultant though want to scale out when they have 100 customers.


My two wishes for Postgres (as a fanboy):

1. Unify JSON and SQL syntax, instead of user->>'email' make it user.email

2. Support sub tables, so I can decide if I want a sub table (schema) or JSONB (schema-less)


Problem is that JSON property names can contain '.', so it would be ambiguous.


Postgres' inability to handle multiple concurrent queries on a single connection makes it a pain when using it for multi-tenant services. With all the SAAS companies out there I'm surprised this issue doesn't get brought up more often


I'm not sure it's that critical. You can use connection pooling on client and/or you can outsource it to a proxy/bouncer that will do this for you.

We have tens of thousands of connections and no problems. We also forked pgbouncer to use multicore [0] which allows us properly utilize servers.

[0] https://github.com/Pexeso/pgbouncer-smp


PGBouncer doesn't do anything to magically prevent connections to postgres from being locked and idle for a pending query.

It's just an external connection pooler when your native language driver doesn't have a good one


pgb can timeout connections and close them if they are idle for too long, which is usually the way to deal with stale connections.

Not sure what solution are you imagining would be the right one.


Examine any other distributed system that implements pipelining. Allows for multiple pending requests on a connection (simply by having IDs on every request) which makes for efficient use of connections.


I do understand where are you coming from, but there is price to everything.

Most distributed systems have eventual consistency. Very few can achieve strong consistency. Postgres sacrifices a lot for a strong consistency and predictable performance.

I'm not saying that PG has the best model, but for the wast majority of users, out of connections is just not an issue. Most clients have now direct pooling and they are very easy to setup. There are a few who need more connections, but there are poolers for it.

What you gain but separating connections is much easier monitoring and debugging process. You can examine each connection, its impact on resources, state, what locks they need to acquired, ...

Additionally, because each connection opens a file descriptor, you gain a lot of operating security from the underlying OS and its kernel.

PG contributors spent almost 3 decades building on this system. I'm pretty sure the gain would be so minuscule in comparison to the effort that would have to be put in to rebuild the connection model.

I for one would appreciate server polling built in, better HA including simple discoverability, handoffs, ...


That's really cool! But why don't you contribute back this code to pgbouncer? :)


Oh, we tried. Believe me, we tried many times. Original pgb creator didn't bother to respond to any of my emails.


That's because connections (sessions, really, but in practice most things map those 1:1) are the primitives on which Postgres and most other RDBMSes divide their units of transactional consistency--one of their key guarantees. Connections are expensive to create and maintain largely (but not entirely) because of those properties.


This isn't a required design to achieve transactional consistency. See for example how this would be done in Erlang - single connection dispatching to isolated Erlang processes.

Fundamentally just needs isolation to be done at a different layer.


Why not just open another connection?


Each open postgres connection requires about 10mb of memory overhead IIRC


That's why you need a connection pool. That's true for all database systems.


Not all databases systems require a connection pool, though it is often a good idea.


I disagree. I've faced incidents on oracle/sql server/postgres/sybase instances when several hundreds of connections where spawned.

You definitely need to think about pooling fro the beginning of the design of your application.


Yes, all of those databases allocate query memory for queries at connection time, and consequently don't scale efficiently for large number of idle clients. That's not all databases. Some don't do that. Some don't even do connections period.

Connection pools are a necessary work around for a specific design decision with the system software. They can be helpful other problems as well, but they aren't necessarily a requirement.


Postgres is powerful, I want a row in a table to have a relationship to one or more rows in _any other table_ in the database, and Postgres can't help with that.

That to me is a compelling reason to use something other than Postgres.


There are a few solutions to what you're proposing which might help.

One solution is to have a generic reference property paired with another indicating which table to reference

    reference_id: '1234',
    reference_table: 'posts'

    reference_id: '4321',
    reference_table: 'comments'
You can still benefit from indexes for joins in the same way that you would with actual foreign key constraints, but the downside with this is that you can't actually apply a foreign key constraint.

Another solution you can look into, which I don't have much experience with myself, is table inheritance: https://stackoverflow.com/questions/3074535/when-to-use-inhe...

Another form of modeling the data in such a manner is called Exclusive Arc, which is that you just simply put all of the possible keys you might reference on the table, and add a foreign key on all of them. Then, when you need to make a reference, you just leave all but the one in use on that row as null values. If a "like" can go on a "post" or a "comment" or an "image", you would just have all 3 foreign key columns/constraints on the "like" table.

And lastly, and probably best of all, is to simply use 1-2 tables for your entire data model for vertices/edges, and just treat Postgres as a graph database.


> One solution is to have a generic reference property paired with another indicating which table to reference

> You can still benefit from indexes for joins in the same way that you would with actual foreign key constraints, but the downside with this is that you can't actually apply a foreign key constraint.

Thanks, I had not thought JOINs possible on this - how would a query look? I'm guessing it works only for one type of reference_table.

> Another solution you can look into, which I don't have much experience with myself, is table inheritance: https://stackoverflow.com/questions/3074535/when-to-use-inhe....

Been there, tried it, banged my head on most of these (particularly the FK limitation): https://www.postgresql.org/docs/11/ddl-inherit.html :)

If it was fixed by Postgres it'd be awesome. I've lost the wiki page which was around on the topic, but it was unchanged for ~10 years, so really should've been marked 'wontfix'.

> Another form of modeling the data in such a manner is called Exclusive Arc,

> And lastly, and probably best of all, is to simply use 1-2 tables for your entire data model for vertices/edges, and just treat Postgres as a graph database.

That's what I'm doing, with the 'node' table using JSON to store the node-specific data. I feel sooooooo dirty but it works. Yet to figure out schema evolution of the JSON in a controlled manner :)


> Thanks, I had not thought JOINs possible on this - how would a query look?

So this is for the structure with the two columns

    reference_id: '1234',
    reference_table: 'posts'

    reference_id: '4321',
    reference_table: 'comments'
So let's say that this is the `likes` table that can like either `comments` or `posts`, here are some queries:

    // Get comments and their respective likes
    SELECT * from comments
      JOIN likes ON comments.id = likes.reference_id;

    // Get posts and their respective likes
    SELECT * from posts
      JOIN likes ON posts.id = likes.reference_id;

    // Get likes and their referenced comments
    SELECT * from likes
      JOIN comments ON comments.id = likes.reference_id;
And so on.


Any resources you'd like to recommend to a beginner that comprehensively covers all aspects of Postgres?


PostgreSQL Tutorial covers many fundamentals:

http://www.postgresqltutorial.com/


I trialled ArangoDB for a short while approximately 2 years ago. I was looking for a graph database to tinker with and learn about.

I was pleasantly surprised with ArangoDB. It was really user-friendly and I liked how easy it was to setup when compared to other multi-model databases. Definitely consider it for a hobby project.


The entire offering looks very compelling. Unfortunately I couldn't get a sense of operational efficiency or performance of ArangoDB in the wild. I couldn't convince our CTO if we adopted this product, would we be trading the extra development agility of the mixed models: graph, aql and kv for an overly complex babysitting problem.

Can anyone here speak about their experience using ArangoDb in a multi tenant SaaS product? How is it to manage your own cluster, backups, etc?


I lead our database team for a while, and picked ArangoDB. We started to regret it pretty soon thereafter. While I was there, we found that our cluster would just die. I think they got put fix in for that, though my memory is hazy.

After I left, they came to really regret it. My team was working at scale, and basically found themselves doing QA work, trouble shooting with the Arango team. To their credit, the Arango crew was extremely responsive and helpful. Maybe they've fixed things up since then; it's been a year and a half.

At this point, I would hard pass on any database whose name doesn't start with PostgreSQL. Just got burned too hard.

Graph databases ostensibly let you write queries that would otherwise be unwieldy, but it turns out PostgreSQL's `recursive` keyword lets you achieve roughly the same things, sans having to learn a whole new query language.


Yes, we had cluster stability issues 1.5 years ago. That has changed, cluster stability and performance has the top priority and we invest a lot to improve the developer and devops experience with every release. Now, e.g. with K8s deployments or the arangodb starter, it's much easier to run and maintain clusters. Hope you find the time to give it a second try.


Thanks for the insight, that was my fear. Our decision came down to the notion of "no one ever got fired for recommending SQL Server or PostgreSQL"


Do you mind telling the issues you faced with ArangoDB? Coz we are evaluating few options currently and Arango is one of them.


I mentioned that the cluster would freeze. The other issue I remember vividly (less important, but irksome) was that the join query `for i in collection1: for j in collection2: filter ... ` executed way, way slower than the equivalent query: `col1 = for i in collection1 for j in collection2: filter ... `

I kept in touch with one of the directors, and after I left he mentioned a couple of things - they found that doing joins returning large amounts of data (maybe 10k records, IIRC) was prohibitively slow. They also found that under certain conditions, with a certain amount of data in the database, it would crash. He didn't ever describe the conditions.

They switched to couchbase, and have reported being happy with it.


Being one of the developers of ArangoDB, I would like to use the chance to reply to this as well.

I think there have been various issues with the cluster stability 1.5 years ago, and since then we have put great efforts into making the database much more robust and faster. Many man-years have been dedicated to this since 2017.

1.5 years ago we were shipping release 3.1, which is out of service already. Since then, we have released

* ArangoDB 3.2: this release provided the RocksDB storage engine, which improves parallelism and memory management compared to our traditional mostly-memory storage engine * ArangoDB 3.3: with a new deployment mode (active failover), plus ease-of-use and replication improvements (e.g. cross-datacenter replication) * ArangoDB 3.4: latest release, for which we put great emphasis on performance improvements, namely for the RocksDB storage engine (which now also is the default engine in ArangoDB)

In all of the above releases we also worked on improving AQL query execution plans, in order to make queries perform faster in both single server and cluster deployments. Working on the query optimizer and query execution plan improvements is obviously a never-ending task, and not only did we achieved a lot here since 2017, but we still have a lot of ideas for further improvements in this area. So there are more improvements to be expected for the following releases.

All that said, I think it is clear now that my intention is to show that things should have improved a lot compared to the situation 1.5 y ago, and that we will always be working hard to make ArangoDB a better product.


What release will be stable if version 3.1 has serious clustering issues? The old, bad days version 1.0 was considered stable :)


thanks, thats kind of reassuring.


> PostgreSQL's `recursive` keyword lets you achieve roughly the same things

Or you could run GraphQL on top of PostgreSQL. Just because you can write anything in SQL doesn’t mean you should.


GraphQL has nothing to do with graph queries though.

GraphQL is called that because it was created as a query language for Facebook's "social graph". It actually doesn't provide any graph operations or recursion (i.e. you explicitly tell it how many levels deep you want to go).

You can provide a GraphQL interface on top of any backend or database, though.


I used Arango for my hobbyist project and enjoyed it immensely. Aql is an absolute joy. I am hoping that they improve foxx (web server which runs on top of the db) because that would make arango the best tool for my kind of use cases


Why not use Nginx or Node.js alongside ArangoDB? The problem with integrating a full web server is that you don't want the database to wait until some external request carried out by the web server finishes (no blocking).

If there was an integrated web server, but fully decoupled to not block the database, then there would probably be no real benefit over running a separate web server on the side.


Hi, I'm part of the Foxx team. Out of personal interest: what improvements to Foxx are you looking for?


Same question I always ask, who is using it in prod? How huge the database size is? I’ve used postgresql with almost 200+ GB size of single table and it works like breeze! What are caveats, traps I should be aware of?


Yes. I'm also curious about what kind of tradeoffs I would be making by choosing arangodb over postgres, etc. For example, with Postgres you have acid but not much in the way of horizontal scale-out. With Redis you have speed (single thread, though), but data must fit in ram. What am I giving up by choosing this DB?


It's a broad band antibiotic. It's meant to be applicable to all purpose databasing. Horizontal scaling is given but you need to reshard your collections. Great, no exaggeration, overall performance and in some special cases better than specialized databases. My 2 cents.


Just take it and bombard it with 1TB of generated data. Good test data sets out there too?



This is where I went after RethinkDB shut down.

The built-in UI is really nice.

If you use documents and graphs, take a look.


I've been using it for personal projects and it's been quite fun !

The comments about Postgres and database XX vs ArangoDB are always the same, this is a choice depending on the use-case, if you have a vague scenario, ArangoDB has good performance on a wide array of use cases.

On a personal note, the main downside is not having any ORM for Golang for example, Node.js doesn't have any worth considering too.

Python seems to have arango-orm, which makes it simpler for small projects to integrate it.

Another improvement could be the graph visualization, simplifying the setup to use another solution would be nice.


At Sabre we're currently taking a really hard look at https://www.memsql.com/. Im not seeing enough strength in the NoSQL paradigme, at least for our use-case, but I'm definitely looking for better and faster management of an arbitrary number of leafnodes and consumernodes.


the cool thing about Arango is that you can use all data models also in a cluster setting and combine the models as you need. A pretty nice and helpful community around the project as well.


Is it connection based like MySQL and such?


no




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: