Hacker News new | past | comments | ask | show | jobs | submit login
ArangoDB (arangodb.org)
94 points by majidazimi on Dec 6, 2013 | hide | past | favorite | 67 comments

> As a relational database user we are used to treat the database as stupid and mainly use it to save and retrieve data. ArangoDB lets you extend the database using Javascript (production ready) and Mruby (experimental).

?!? A common complaint against relational database people are having "too much" logic in the database. (I clearly don't agree, using store procedures and custom extensions ;P.)

Personally I loathe stored procedures because often they include a lot of logic that shouldn't be on the database and they also generally involve SQL extenstions that are pretty terrible for the general-purpose computing you see them used for.

But if that layer is primarily used for providing, controlling, and optimizing access to the data I can see the appeal. And in that case? Being able to write the procedural parts in a language that's less-terrible than typical SQL extensions would be really nice.

"Being able to write the procedural parts in a language that's less-terrible than typical SQL extensions would be really nice."

Postgres allows writing functions in many languages, including C, python, and javascript.

Exactly, the problem with stored procedures isn't that they exist but that once a programmer is working in a given environment he won't change to another one until forced, with the result that once someone starts writing stored procedures they tend not to stop until the app is in the database. If there were a clear mechanism to stop app data from going in the SPs they'd have a much better reputation.

Yet, if you have to rollout a database server and an application server, it can be quite some overhead for certain kind of applications. I think this data-layer on top of the data store is especially interesting for e.g. backends in a network world, where your data is distributed, or e.g. when you need to aggregate data from multiple sources.

Or, what use case is your comment about?

IMHO nowadays quite a lot of logic is moved into the front-end (for example "single web application"). In these cases you only need to provide some API in the back-end. But normally you need a bit more then "just" the data:

- some access control (e. g. sessions) - compute some fields (e. g. "age" derived from "birthday") - combine and filter data before transfering it the client (e. g. for graph traversals)

I have a feeling, tell me if you share it.

"Wouldn't it be cool to have a multipurpose database which we would be able to query with a language, but not SQL, because SQL sucks, for some reason.".

Put it differently, what does ArangoDB, MongoDB, whateverDB bring that relational databases didn't bring 30 years ago?

Some things just aren't relationally shaped. You can model them relationally, but it can be a pain.

For instance, graphs are totally doable with a traditional rdbms, however it is painful. You end up joining a table against itself (or via an edges table) multiple times, or alternately bouncing many queries off the table as you iterate the graph. One common type of NoSQL db is the graph database that is designed with graphs in mind and you don't even have to think about this access. It is nice.

Another case that you can do with traditional RDBMS but is annoying is loose user defined fields, such as "tags" where you have to create a tags table and a join table to make it work, with a lot of potential inefficiency there (even with indexes). Or even worse, when you have user defined attributes - lots of custom table creation per user, or big joins against a star topology to do it properly. (Or if not, you end up with something that looks like an sql database to a bunch of frontend code).

Of course other times you'll find yourself doing something in a NoSQL database that is effectively doing a bunch lookups against a table then combining a result (at which point switching to RDBMS is the solution)...

I guess what I'm saying is that lately the data store is being looked at as a component more like a library than a subsystem. I'm not sure if this is good or bad, but it certainly has helped a few cases I've dealt with nicely - ripping out a horribly complex data layer and replacing it with a NoSQL solution where the data model is shaped like my data.

Basically it's a matter of the right tool for the job.

To name just a few: - being relaxed about schemas: no more long-running ALTER TABLE commands, no more up-front schema definitions that waste time when doing proof-of-concepts etc. - being friendly to variable and hierarchical data: no more entity-attribute-value patterns and necessity to store JSON etc. as BLOBs - integration of scripting languages such as JavaScript, so you can have one language for the full stack if you want - embracing web standards (HTTP, JSON) - no object-relational mismatch (there are no relations), as you can more easily map a single programming language object to a document

Relational databases partly offer solutions for this, too. But in a relational database, these things are (often clumsy) extensions and not well supported.

I do all of what you mention using SQL Express (I don't use JSON though, because binary serialisation is faster). Abstraction means I save my data as documents/blobs, can still do joins, don't have to alter tables (when de-serialised entities are self-describing), fully indexed entity content.

It means thinking about what and how you're going to do something before you start coding (design up front). It means creating a throw-away proof of concept before you start coding. But it is flexible, extensible, and changes to the schema, as it were, do not impact up-stream dependencies.

SQL Express guarantees consistency, but most NoSQL databases guarantee availability. Although many SQL databases can be used to implement the same functionality as a NoSQL database, that doesn't mean it's as easy. And easiness is what matters, because if something is easier, you may spend less time working on it and that saves money.

The term "NoSQL" database is a bit problematic, because the definition only says that the database is not relational, but the average NoSQL database has other differences with the average SQL database: using JSON and JavaScript, and perhaps queries with HTTP and so on.

The point is not "what is possible", but choosing the best tool for the job.

Indeed. Although I'd say most no-SQL databases guarantee partition tolerance (I can guarantee availability using a SQL database).

SQL Express runs as a single instance on a client so you get all three - consistency, availability and partition tolerance[1]. When running something else on more than one node you can choose two of those three. Availability and consistency are usually chosen because of business drivers. If partition tolerance is required (I've yet to encounter a scenario that makes a compelling case for it [2]) then eventual consistency is the price.

There are databases out there that will suite any combination of the three. Unfortunately the equation is somewhat more complex because databases that offer consistency and partition tolerance (for example) don't typically offer JOIN -like functionality.

Everything's a trade-off. Other considerations are tried and tested v. bleeding edge; painful v. painless; ideal v. affordable; and so forth.

[1] http://en.wikipedia.org/wiki/CAP_theorem

[2] I'm not saying there aren't scenarios where it makes sense (Google, for example) - just that I've not encountered one. I do run into many people that want Mongo and, not understanding what they're asking for would be better of with a relational database.

MemSQL relational database has extensive support for JSON and online ALTER TABLE.

Indexes can be created on fields within JSON documents for faster access.

Online ALTER TABLE doesn't take any long running lock, so it can run in the background without affecting your queries.

I think it's not to say that no relational databases has any of these features. But I think it's more exceptional than the rule. Relational databases are primarily designed for relations. And saving your objects as documents simply is a different model.

That doesn't mean one is better than the other, it's two alternative ways of achieving things.

Sometimes it is just a lot easier and more query efficient to store a bunch of related data as just a hash or even with embedded hashes. Like, if you want to store a list of key/value pairs alongside a bunch of other data. Yes, you could do that via a bunch of tables and relations and joins and things, but conceptually if all that data can be seen as one self contained record, why spread it across a bunch of separate tables?

Sometimes a document collection is just a whole lot easier to reason about.

I'd say that one of the biggest thing that many of these NoSQL data stores brings is auto-sharding and being able to query/map reduce a distributed data source. Relational dbs work pretty well so long as you can vertically scale your data and or don't need to query across databases. Horizontal scaling is the tricky part for relational dbs that NoSQL data stores market as their big selling point.

Indeed, an the AQL looks interesting for the use case where Mongo fails, e.g. http://www.sarahmei.com/blog/2013/11/11/why-you-should-never...

I remember reading that article. The article spoke like the lessons of an inexperienced developer, and not necessarily the problems of technology they were complaining about. BTW, what do you mean by AQL.

The idea of AQL (ArangoDB Query Language) is to bring an SQL query-like language to a document-store, most notably enabling joins between documents.

Being able to join documents in a query would be very useful in mongo. That is promising to hear that ArangoDB had that feature.

NOSQL databases in general (and multi-model databases like ArangoDB in particular) offer a greater flexibility in the choice of your data structures than traditional relational databases do. Furthermore, you can configure exactly the right compromise for your application between ACID and eventual consistency, and consistency/scalability.

"Transactions in ArangoDB are atomic, consistent, isolated, and durable (ACID)." "Collections consist of memory-mapped datafiles...". "by default, ArangoDB uses the eventual way of synchronization...synchronizes data to disk in a background thread."

So it's not ACID by default and practically not usable with immedaite sync turned on (huge amount of seeks due to use of mmap), just like mongo.

As in many databases, ArangoDB allows some choices regarding durability. Immediate disk synchronisation is turned off by default in ArangoDB. Synchronisation is then performed by a background thread, which is frequently executing syncs. By the way, several other NoSQL databases have immediate synching turned off by default, e.g. CouchDB, MongoDB.

In ArangoDB you turn on immediate synchronisation on a per collection level, or use it for specific operations only. So it's up to you how you want to use it. This gives the database user a fine-grained choice.

I remember using some relational databases in the past where we turned immediate synchronisation off as well to get more throughput. So it's probably not fully uncommon to do it, but I understand the expectation of relational users that everything is fully durable by default.

Memory-mapped files don't have anything to do with ACID. It's just a detail of the internal organisation of buffers. You can have full durability with memory-mapped files. You just have to use msync instead of fsync/sync.

There is a huge difference in the sophistication between what you describe and what a traditional SQL system like Postgres does.

The problem with just mmapping files is that, to sync, you have to do a bunch of random writes. To commit two transactions, you have to jump two places in the disk and do two writes. You can defer them, but then your transaction commit latency goes up dramatically. So the user is between a rock and a hard place: uncertain durability for extended periods of time, or long commit latency.

Compare that to a system based on a Write-Ahead Log (WAL). The log is 100% sequential (and often preallocated in large chunks), and a transaction is durable if the log is flushed up to some certain point. All transactions go into the same log, so under high concurrency, one flush to disk might commit several transactions. And even if you flush for each transaction, at least you don't have to jump around on disk (and, if using a controller with battery-backed cache to reduce latency, you can make do with a fairly small cache).

The writes to the main data area can be deferred for a long time (30 minutes might be normal), and syncing those is called a checkpoint. You can spread the checkpoint out over time (it's a continuous process, really) so that they don't cause transaction latency spikes. Deferring the writes for so long allows the writes to be scheduled more efficiently without sacrificing durability at all.

On top of that, if you are OK with small windows of time before the commits are durable, postgres allows you to choose on a per-transaction basis not to wait for the WAL flush before returning to the client. If you crash, you are guaranteed to be consistent still, and if a normal transaction comes along, it will of course force a WAL flush. You can control the window of time before postgres will force a WAL flush, where 200 milliseconds might be normal. It can be a small number and still gain you a lot because it's just writing a sequential log, so there's no need to defer it for multiple seconds.

In other words: Mmapping files gives you a choice between very short commit latency and long periods of uncertain durability; or long commit latency. WAL gives you a choice between very short commit latency and short periods of uncertain durability; or short commit latency.

I understand ArangoDB is new. The description sounds interesting, and I like some things about the goals. But I think it's way off the mark to tout the durability as offering a nice trade-off, when a much better method (at least for OLTP) has been known for two decades[1].

[1] Basic idea introduced by ARIES paper in 1992. Couldn't find a link to a PDF, but it is a well-known paper.

IM pretty sure you can yank the power cord from couchdb as soon as you get a (positive) response and the data will be saved.

Unlike mongo, couch has a sophisticated append only btree format for storing data, that is almost impossible to corrupt.

Saved (durable) and hard to corrupt are different properties of a database. For example elasticsearch uses the lucene index format in the background. It's write once per segment. Once a segment is written, data is save and it's (apart from disk corruption) impossible to corrupts, since the file is never opened for writing again. However, segments are not written immediately after a document is received - so when yanking the power cord right after a write to the cluster, you'll loose data - however without any danger of corruption since the last, partially written segment is discarded. Couchdb behaves in a similar fashion: If the last bit of the storage file contains corrupt data, it is discarded. I'm not absolutely certain atm about the default durability settings in couch, so I can't say if the write happens before or after the "ack" from the server. However, since disk controllers cheat and sometimes a "flush" to disk doesn't actually flush, you can get data loss regardless of the promises your database makes.

Durability is indeed hard to archive - as you pointed out disk controller sometimes simply tell you a lie.

With respect to corruption ArangoDB behaves similar: It uses an append-only log file with CRC checksums. So, if the last bit of storage contains corrupt data, it is discarded.

> "In typical applications with "complex" database operations there is often no clean API to the persistence layer when to or more database operations are executed one after each other which belong together from an architectural perspective."

Is that even a real sentence?

> Put it differently, what does ArangoDB, MongoDB, whateverDB bring that relational > databases didn't bring 30 years ago? (Let's leave MongoDB out here ;-) What I really love and what the relationals do not have are:

* Graphs as first class citizens! (try to view them in the web gui :-) * The tight V8 & JavaScript integration (FOXX is more then cool. Hope I will be able to use it from Clojure Script)

What you might find in earlier databases but not completely in others today is (my personal hitlist :-) : * The increadible amount of indices with even skip and n-gram! * MultiCore ready * Durability tuning (already mentioned by Jan) * AQL covering KV, JSON and Graphs! (Martin Fowler was quite sceptical that this model integration could work...) * And a MVCC that makes it SSD ready. * Capped Collections * Availablity on tons of OS versions as Windows, iOS, all UNIXes and even Travis-CI (how cool is that?!)

Try it. Might be fun in production compared to other famed NoSQL DBs.... (at least to me)

> There are driver for all major language like Ruby, Python, PHP, JavaScript, and Perl.

I chuckled at the absence of the most widely deployed language on the planet.

I dare say that there is a driver for Java - didn't look, because after browsing through a reasonable portion of their site, I still couldn't get a simple explanation of what this DB allegedly does and doesn't do.

There is a Java driver and even an object mapper based off jackson - https://www.arangodb.org/drivers

This seems to be a mongodb clone with some extra features added on to make it a bit closer to a relational db, I guess. Looks interesting but likely suffers from the same problems MongoDB suffers from (data safety, scaling difficulties, etc)

EDIT: Have to say, the idea of a mongodb database with graph operations built in is pretty attractive for small network oriented problems...

I jumped on arango, looking for a small graphdb I could run locally on my machine. They've since moved it to a much wider scope. Turns out I abandoned that project idea (like most... :D), but I think Arango is sufficiently unique to merit more investment.

I have tried to run ArangoDB under Node.js had some little successes, but in general their claim that they have drivers for many platforms is far from reality. Also why not to release node.js binary drivers instead of pushing people to use foxx. Also simple browsable docs would be nice instead of chunked documents covering hell knows what. Finding link to their query language was a big hassle :) Also their graph traversing is still in infancy as I understand.

I agree somewhat with the documentation part. The getting started is nice, but the gap between reading about concepts and the actual references, how to use them is a bit large at times. Best would be a reference based documentation, e.g. as in http://underscorejs.org/

I would call https://www.arangodb.org/manuals/current/UserManual.html a browsable doc (3 clicks from the main page). And a further click fires up the chapter about AQL...

I would preffer more or less the api style docs with short general examples. than a lot of unusable text.

There is a node js driver and even an integration into JugglingDB. See https://www.arangodb.org/drivers. And in the end, it's all bascially HTTP calls that you are making to query the database.

I would not call that a driver. Also all of them are 3rd party developed. as it seems by single developers. So no standartization possible. The database adoption goes as far as the capability to use it in your project. And I found that the node.js drivers in particular were outdated in npm. Had to nudge one of the developers to update npm repo. Also REST access library is called driver???

I like that they're sufficiently ignorant of MongoDB's implementation to misattribute the primary cause of excessive space usage.

Gives me confidence trust them with my data.

The big advantage of ArangoDB with respect to memory/disk usage is that despite the Schema-less-ness, the database automatically recognises common "shapes" of the documents in a collection and thus usually does not have to store all attribute names many times. In addition, the possibility of transactions makes it less necessary to keep many old revisions of documents, in comparison to for example MongoDB.

I'd be more interested in a document store that was historical and didn't mutate.

Failing that, I'll just keep using RethinkDB for this sort of thing.

What do you think is the primary cause of excessive space usage in MongoDB?


Well known and understood. Even well-documented by 10gen themselves, nothing to do with the bling you added to Arango.

> In ArangoDB, a transaction is always a server-side operation, and is executed on the server in one go, without any client interaction.

It doesn't seem to support interactive transaction. That means only simple batch read & write, no complex transaction. It seems in between CAS and real generic transaction. Doesn't seem to be much useful.

No - it goes way beyond simple batches of operations. Basically you have to write your transaction as JavaScript program. So, you can do anything you could do on the client-side - with the exception of waiting for another source (i. e. user interaction). You could read a document from one collection, chose different actions based on the attribute. Change documents in multiple collections. I think the PHP driver uses some kind of abstraction to hide the JavaScript from the developer.

This will be even cooler if they add 'turn-key' scaling. Their scaling approach is still a work in progress.


Anyhow, good job so far to ArangoDB team.

Thanks, and: you are right, scaling by sharding is important, and that is why we have made this our top priority for the coming three months.

Good to hear. I'd love to consider ArangoDB for analytics project at that time.

By the way, if someone gives this a try on a VM, I wrote an blog post about it here: http://thinkingonthinking.com/A-Data-Platform-in-15-minutes/

I'm really excited about the look of this. Being a big fan of Mongo et al. for the structurelessness I also sometimes miss the graph-like structure that you can easily create with SQL. Arango looks cool. I shall try it :)

There is a screencast by McHacki about the graph explorer he wrote: https://www.arangodb.org/2013/11/29/visualize-graphs-screenc...

This is awesome. I like the way the vertices are automatically moved in a way such that they do not overlap very much and that the edges are easily visible. It is also good that "similar" vertices are automatically collapsed into a "multi-vertex". I think this is very useful functionality to inspect a big graph locally.

What I particularly like is the functionality to process graphs and explore them interactively in the browser. This has been added in some recent version, and it makes working with graphs a lot easier than before.

Quickstart link doesn't exist: http://www.arangodb.org/quickstart

Looking forward to learning more when it's online!

There is a blog post on how to get started on a VM here: http://thinkingonthinking.com/A-Data-Platform-in-15-minutes/

What I like on ArangoDB is the speed of development, as well as its native support for building RESTful interfaces.

Last but not least, it is open-source!

Building a REST interface with access to your data is easy thanks to ArangoDB's Foxx framework. You can implement all your backend code in JavaScript and upload to the server. Thus you can do any sort of preprocessing on the server and make that available to frontends. And it's easy to integrate with a front-end because it's all about passing JSON around via HTTP.

And the tree v. table debate continues...

I wonder how it compares with RethinkDB?

Interesting! Haven't heard about RethinkDB - would be nice to get a post somewhere like this: http://www.rethinkdb.com/docs/rethinkdb-vs-mongodb/

also the query language of RethinkDB looks interesting: http://www.rethinkdb.com/docs/rethinkdb-vs-mongodb/ - compare with https://www.arangodb.org/manuals/current/Aql.html

Wrong link for ReQL, for anyone checking. ReQL reference is here: http://rethinkdb.com/api/javascript/

Is this named after Juan Arango?

Arango is a special sort of avocado - but Moenchengladbach (where Juan ARango currently is under contract) is next to Cologne, ArangoDB's head quarter :-)

Oh, he's so good...! And I say that coming from Cologne. You need to know the local club really dislikes Arango's current club (based 50 km away from here).

the stuff was initially named AvocadoDB, but we have to rename last year because someone else claimed the same name (for whatever reason)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact