> As a relational database user we are used to treat the database as stupid and mainly use it to save and retrieve data. ArangoDB lets you extend the database using Javascript (production ready) and Mruby (experimental).
?!? A common complaint against relational database people are having "too much" logic in the database. (I clearly don't agree, using store procedures and custom extensions ;P.)
Personally I loathe stored procedures because often they include a lot of logic that shouldn't be on the database and they also generally involve SQL extenstions that are pretty terrible for the general-purpose computing you see them used for.
But if that layer is primarily used for providing, controlling, and optimizing access to the data I can see the appeal. And in that case? Being able to write the procedural parts in a language that's less-terrible than typical SQL extensions would be really nice.
Exactly, the problem with stored procedures isn't that they exist but that once a programmer is working in a given environment he won't change to another one until forced, with the result that once someone starts writing stored procedures they tend not to stop until the app is in the database. If there were a clear mechanism to stop app data from going in the SPs they'd have a much better reputation.
Yet, if you have to rollout a database server and an application server, it can be quite some overhead for certain kind of applications. I think this data-layer on top of the data store is especially interesting for e.g. backends in a network world, where your data is distributed, or e.g. when you need to aggregate data from multiple sources.
IMHO nowadays quite a lot of logic is moved into the front-end (for example "single web application"). In these cases you only need to provide some API in the back-end. But normally you need a bit more then "just" the data:
- some access control (e. g. sessions)
- compute some fields (e. g. "age" derived from "birthday")
- combine and filter data before transfering it the client (e. g. for graph traversals)
"Wouldn't it be cool to have a multipurpose database which we would be able to query with a language, but not SQL, because SQL sucks, for some reason.".
Put it differently, what does ArangoDB, MongoDB, whateverDB bring that relational databases didn't bring 30 years ago?
Some things just aren't relationally shaped. You can model them relationally, but it can be a pain.
For instance, graphs are totally doable with a traditional rdbms, however it is painful. You end up joining a table against itself (or via an edges table) multiple times, or alternately bouncing many queries off the table as you iterate the graph. One common type of NoSQL db is the graph database that is designed with graphs in mind and you don't even have to think about this access. It is nice.
Another case that you can do with traditional RDBMS but is annoying is loose user defined fields, such as "tags" where you have to create a tags table and a join table to make it work, with a lot of potential inefficiency there (even with indexes). Or even worse, when you have user defined attributes - lots of custom table creation per user, or big joins against a star topology to do it properly. (Or if not, you end up with something that looks like an sql database to a bunch of frontend code).
Of course other times you'll find yourself doing something in a NoSQL database that is effectively doing a bunch lookups against a table then combining a result (at which point switching to RDBMS is the solution)...
I guess what I'm saying is that lately the data store is being looked at as a component more like a library than a subsystem. I'm not sure if this is good or bad, but it certainly has helped a few cases I've dealt with nicely - ripping out a horribly complex data layer and replacing it with a NoSQL solution where the data model is shaped like my data.
Basically it's a matter of the right tool for the job.
To name just a few:
- being relaxed about schemas: no more long-running ALTER TABLE commands, no more up-front schema definitions that waste time when doing proof-of-concepts etc.
- being friendly to variable and hierarchical data: no more entity-attribute-value patterns and necessity to store JSON etc. as BLOBs
- integration of scripting languages such as JavaScript, so you can have one language for the full stack if you want
- embracing web standards (HTTP, JSON)
- no object-relational mismatch (there are no relations), as you can more easily map a single programming language object to a document
Relational databases partly offer solutions for this, too. But in a relational database, these things are (often clumsy) extensions and not well supported.
I do all of what you mention using SQL Express (I don't use JSON though, because binary serialisation is faster). Abstraction means I save my data as documents/blobs, can still do joins, don't have to alter tables (when de-serialised entities are self-describing), fully indexed entity content.
It means thinking about what and how you're going to do something before you start coding (design up front). It means creating a throw-away proof of concept before you start coding. But it is flexible, extensible, and changes to the schema, as it were, do not impact up-stream dependencies.
SQL Express guarantees consistency, but most NoSQL databases guarantee availability. Although many SQL databases can be used to implement the same functionality as a NoSQL database, that doesn't mean it's as easy. And easiness is what matters, because if something is easier, you may spend less time working on it and that saves money.
The term "NoSQL" database is a bit problematic, because the definition only says that the database is not relational, but the average NoSQL database has other differences with the average SQL database: using JSON and JavaScript, and perhaps queries with HTTP and so on.
The point is not "what is possible", but choosing the best tool for the job.
Indeed. Although I'd say most no-SQL databases guarantee partition tolerance (I can guarantee availability using a SQL database).
SQL Express runs as a single instance on a client so you get all three - consistency, availability and partition tolerance[1]. When running something else on more than one node you can choose two of those three. Availability and consistency are usually chosen because of business drivers. If partition tolerance is required (I've yet to encounter a scenario that makes a compelling case for it [2]) then eventual consistency is the price.
There are databases out there that will suite any combination of the three. Unfortunately the equation is somewhat more complex because databases that offer consistency and partition tolerance (for example) don't typically offer JOIN -like functionality.
Everything's a trade-off. Other considerations are tried and tested v. bleeding edge; painful v. painless; ideal v. affordable; and so forth.
[2] I'm not saying there aren't scenarios where it makes sense (Google, for example) - just that I've not encountered one. I do run into many people that want Mongo and, not understanding what they're asking for would be better of with a relational database.
I think it's not to say that no relational databases has any of these features.
But I think it's more exceptional than the rule.
Relational databases are primarily designed for relations. And saving your objects as documents simply is a different model.
That doesn't mean one is better than the other, it's two alternative ways of achieving things.
Sometimes it is just a lot easier and more query efficient to store a bunch of related data as just a hash or even with embedded hashes. Like, if you want to store a list of key/value pairs alongside a bunch of other data. Yes, you could do that via a bunch of tables and relations and joins and things, but conceptually if all that data can be seen as one self contained record, why spread it across a bunch of separate tables?
Sometimes a document collection is just a whole lot easier to reason about.
I'd say that one of the biggest thing that many of these NoSQL data stores brings is auto-sharding and being able to query/map reduce a distributed data source. Relational dbs work pretty well so long as you can vertically scale your data and or don't need to query across databases. Horizontal scaling is the tricky part for relational dbs that NoSQL data stores market as their big selling point.
I remember reading that article. The article spoke like the lessons of an inexperienced developer, and not necessarily the problems of technology they were complaining about. BTW, what do you mean by AQL.
NOSQL databases in general (and multi-model databases like ArangoDB in particular) offer a greater flexibility in the choice of your data structures than traditional relational
databases do. Furthermore, you can configure exactly the right compromise for your application between ACID and eventual consistency, and consistency/scalability.
"Transactions in ArangoDB are atomic, consistent, isolated, and durable (ACID)." "Collections consist of memory-mapped datafiles...". "by default, ArangoDB uses the eventual way of synchronization...synchronizes data to disk in a background thread."
So it's not ACID by default and practically not usable with immedaite sync turned on (huge amount of seeks due to use of mmap), just like mongo.
As in many databases, ArangoDB allows some choices regarding durability.
Immediate disk synchronisation is turned off by default in ArangoDB. Synchronisation is then performed by a background thread, which is frequently executing syncs.
By the way, several other NoSQL databases have immediate synching turned off by default, e.g. CouchDB, MongoDB.
In ArangoDB you turn on immediate synchronisation on a per collection level, or use it for specific operations only. So it's up to you how you want to use it.
This gives the database user a fine-grained choice.
I remember using some relational databases in the past where we turned immediate synchronisation off as well to get more throughput. So it's probably not fully uncommon to do it, but I understand the expectation of relational users that everything is fully durable by default.
Memory-mapped files don't have anything to do with ACID. It's just a detail of the internal organisation of buffers. You can have full durability with memory-mapped files. You just have to use msync instead of fsync/sync.
There is a huge difference in the sophistication between what you describe and what a traditional SQL system like Postgres does.
The problem with just mmapping files is that, to sync, you have to do a bunch of random writes. To commit two transactions, you have to jump two places in the disk and do two writes. You can defer them, but then your transaction commit latency goes up dramatically. So the user is between a rock and a hard place: uncertain durability for extended periods of time, or long commit latency.
Compare that to a system based on a Write-Ahead Log (WAL). The log is 100% sequential (and often preallocated in large chunks), and a transaction is durable if the log is flushed up to some certain point. All transactions go into the same log, so under high concurrency, one flush to disk might commit several transactions. And even if you flush for each transaction, at least you don't have to jump around on disk (and, if using a controller with battery-backed cache to reduce latency, you can make do with a fairly small cache).
The writes to the main data area can be deferred for a long time (30 minutes might be normal), and syncing those is called a checkpoint. You can spread the checkpoint out over time (it's a continuous process, really) so that they don't cause transaction latency spikes. Deferring the writes for so long allows the writes to be scheduled more efficiently without sacrificing durability at all.
On top of that, if you are OK with small windows of time before the commits are durable, postgres allows you to choose on a per-transaction basis not to wait for the WAL flush before returning to the client. If you crash, you are guaranteed to be consistent still, and if a normal transaction comes along, it will of course force a WAL flush. You can control the window of time before postgres will force a WAL flush, where 200 milliseconds might be normal. It can be a small number and still gain you a lot because it's just writing a sequential log, so there's no need to defer it for multiple seconds.
In other words: Mmapping files gives you a choice between very short commit latency and long periods of uncertain durability; or long commit latency. WAL gives you a choice between very short commit latency and short periods of uncertain durability; or short commit latency.
I understand ArangoDB is new. The description sounds interesting, and I like some things about the goals. But I think it's way off the mark to tout the durability as offering a nice trade-off, when a much better method (at least for OLTP) has been known for two decades[1].
[1] Basic idea introduced by ARIES paper in 1992. Couldn't find a link to a PDF, but it is a well-known paper.
Saved (durable) and hard to corrupt are different properties of a database. For example elasticsearch uses the lucene index format in the background. It's write once per segment. Once a segment is written, data is save and it's (apart from disk corruption) impossible to corrupts, since the file is never opened for writing again. However, segments are not written immediately after a document is received - so when yanking the power cord right after a write to the cluster, you'll loose data - however without any danger of corruption since the last, partially written segment is discarded. Couchdb behaves in a similar fashion: If the last bit of the storage file contains corrupt data, it is discarded. I'm not absolutely certain atm about the default durability settings in couch, so I can't say if the write happens before or after the "ack" from the server. However, since disk controllers cheat and sometimes a "flush" to disk doesn't actually flush, you can get data loss regardless of the promises your database makes.
Durability is indeed hard to archive - as you pointed out disk controller sometimes simply tell you a lie.
With respect to corruption ArangoDB behaves similar: It uses an append-only log file with CRC checksums. So, if the last bit of storage contains corrupt data, it is discarded.
> "In typical applications with "complex" database operations there is often no clean API to the persistence layer when to or more database operations are executed one after each other which belong together from an architectural perspective."
> Put it differently, what does ArangoDB, MongoDB, whateverDB bring that relational
> databases didn't bring 30 years ago?
(Let's leave MongoDB out here ;-)
What I really love and what the relationals do not have are:
* Graphs as first class citizens! (try to view them in the web gui :-)
* The tight V8 & JavaScript integration (FOXX is more then cool. Hope I will be able to use it from Clojure Script)
What you might find in earlier databases but not completely in
others today is (my personal hitlist :-) :
* The increadible amount of indices with even skip and n-gram!
* MultiCore ready
* Durability tuning (already mentioned by Jan)
* AQL covering KV, JSON and Graphs! (Martin Fowler was quite sceptical that this model integration could work...)
* And a MVCC that makes it SSD ready.
* Capped Collections
* Availablity on tons of OS versions as Windows, iOS, all UNIXes and even Travis-CI (how cool is that?!)
Try it. Might be fun in production compared to other famed NoSQL DBs.... (at least to me)
> There are driver for all major language like Ruby, Python, PHP, JavaScript, and Perl.
I chuckled at the absence of the most widely deployed language on the planet.
I dare say that there is a driver for Java - didn't look, because after browsing through a reasonable portion of their site, I still couldn't get a simple explanation of what this DB allegedly does and doesn't do.
This seems to be a mongodb clone with some extra features added on to make it a bit closer to a relational db, I guess. Looks interesting but likely suffers from the same problems MongoDB suffers from (data safety, scaling difficulties, etc)
EDIT: Have to say, the idea of a mongodb database with graph operations built in is pretty attractive for small network oriented problems...
I jumped on arango, looking for a small graphdb I could run locally on my machine. They've since moved it to a much wider scope. Turns out I abandoned that project idea (like most... :D), but I think Arango is sufficiently unique to merit more investment.
I have tried to run ArangoDB under Node.js had some little successes, but in general their claim that they have drivers for many platforms is far from reality. Also why not to release node.js binary drivers instead of pushing people to use foxx. Also simple browsable docs would be nice instead of chunked documents covering hell knows what. Finding link to their query language was a big hassle :) Also their graph traversing is still in infancy as I understand.
I agree somewhat with the documentation part. The getting started is nice, but the gap between reading about concepts and the actual references, how to use them is a bit large at times. Best would be a reference based documentation, e.g. as in http://underscorejs.org/
There is a node js driver and even an integration into JugglingDB. See https://www.arangodb.org/drivers. And in the end, it's all bascially HTTP calls that you are making to query the database.
I would not call that a driver. Also all of them are 3rd party developed. as it seems by single developers. So no standartization possible. The database adoption goes as far as the capability to use it in your project. And I found that the node.js drivers in particular were outdated in npm. Had to nudge one of the developers to update npm repo. Also REST access library is called driver???
The big advantage of ArangoDB with respect to memory/disk
usage is that despite the Schema-less-ness, the database
automatically recognises common "shapes" of the documents
in a collection and thus usually does not have to store all
attribute names many times. In addition, the possibility
of transactions makes it less necessary to keep many old
revisions of documents, in comparison to for example
MongoDB.
> In ArangoDB, a transaction is always a server-side operation, and is executed on the server in one go, without any client interaction.
It doesn't seem to support interactive transaction. That means only simple batch read & write, no complex transaction. It seems in between CAS and real generic transaction. Doesn't seem to be much useful.
No - it goes way beyond simple batches of operations. Basically you have to write your transaction as JavaScript program. So, you can do anything you could do on the client-side - with the exception of waiting for another source (i. e. user interaction). You could read a document from one collection, chose different actions based on the attribute. Change documents in multiple collections. I think the PHP driver uses some kind of abstraction to hide the JavaScript from the developer.
I'm really excited about the look of this. Being a big fan of Mongo et al. for the structurelessness I also sometimes miss the graph-like structure that you can easily create with SQL. Arango looks cool. I shall try it :)
This is awesome. I like the way the vertices are automatically moved in a way such that they do not overlap very much and that the edges are easily visible. It is also good that "similar" vertices are automatically collapsed into a "multi-vertex". I think this is very useful functionality to inspect a big graph locally.
What I particularly like is the functionality to process graphs and explore them interactively in the browser. This has been added in some recent version, and it makes working with graphs a lot easier than before.
Building a REST interface with access to your data is easy thanks to ArangoDB's Foxx framework. You can implement all your backend code in JavaScript and upload to the server.
Thus you can do any sort of preprocessing on the server and make that available to frontends. And it's easy to integrate with a front-end because it's all about passing JSON around via HTTP.
Arango is a special sort of avocado - but Moenchengladbach (where Juan ARango currently is under contract) is next to Cologne, ArangoDB's head quarter :-)
Oh, he's so good...!
And I say that coming from Cologne. You need to know the local club really dislikes Arango's current club (based 50 km away from here).
?!? A common complaint against relational database people are having "too much" logic in the database. (I clearly don't agree, using store procedures and custom extensions ;P.)