That said I think that picking the good database is something you can do only with a lot of work. Picking good technologies for your project is hard work, so there is to try one, and another and so forth, and even reconsidering after a few years (or months?) the state of the things again, given the evolution speed of the DB panorama in the recent years.
While I'm at it I like to share that in this exact days I'm working at a Redis disk back end. I've already a prototype working after a few days of full immersion (I like to use vacation time to work at completely new ideas for Redis).
The idea is that everything is stored on disk, in what is a plain key-value database (complex values are serialized when on disk), and the memory is instead used as an object cache.
It is like taking current Redis Virtual Memory and inverting the logic completely, the result is the same (working set in memory, the rest on disk), but this implementation means that there are no limits on the data you can put into a single instance, that you don't have slow restarts (data is not loaded on memory if not demanded), and there isn't to fork() to save. Keys marked as "dirty" (modified) are transfered to disk asynchronously as needed, by IO threads.
If everything will work as I expect (and initial tests are really encouraging) this means that Redis 2.4 will exit in a few months completely killing the current Virtual Memory implementation in favor of the new "two back ends" design, where you can select if you want to run an in-memory DB or an on-disk DB where memory is just an LRU cache for the working set.
The new inverted logic for the VM you describe seems very interesting; I'm very much looking forward to see 2.4!
Redis is already more than perfect what we use it for -- keeping track of stock price data, and distributing it. The size of the DB is known in advance (the amount of stocks does not grow very fast), and the performance is perfect.
Keep up the good job! (And have a nice new year)
I think the main business of Redis is still as an in-memory DB / cache / messaging system and so forth. We have a decent implementation from this point of view, so the next logical step is making it working in a cluster.
On the other side it's really interesting to see what people can do with Redis data model if much larger datasets can be used without problems (at the cost of performances of course... can't be as fast as memory). VM was my first idea, but I need to admit, I don't like the design at this point. This new design can be much better, and we can have it production ready in a few months. So I'm curious about what will happen in 2011! :)
Thank you and have a nice new year as well.
This is probably the most important and relevant point I've seen in a while. Architects should take note...
Much below Stolen from their overview page (All needs to be confirmed): http://hbase.apache.org/
WRITTEN IN: Java
MAIN POINT: Hadoop Database
PROTOCOL: A REST-ful Web service gateway
This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.
HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. HBase includes:
Convenient base classes for backing Hadoop MapReduce jobs with HBase tables
Query predicate push down via server side scan and get filters
Optimizations for real time queries
A high performance Thrift gateway
A REST-ful Web service gateway that supports XML, Protobuf,
and binary data encoding options
Cascading, hive, and pig source and sink modules
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
HBase 0.20 has greatly improved on its predecessors:
No HBase single point of failure
Rolling restart for configuration changes and minor upgrades
Random access performance on par with open source relational databases such as MySQL
FOR EXAMPLE: Facebook Messaging Database
BEST USE: Use it when you need random, realtime read/write access to your Big Data.
There's also a region master re-election/recovery period that depends on the size of the database, network bandwidth, load, etc. It can be anywhere from 30 seconds to tens of minutes. An outage of a region node makes it's key range inaccessible. While that might not be a problem for some, especially in read-only situations, I can think of many applications where that would effectively translate into a total outage.
Does HBase now do a good job of random access? I was always under the impression that it did random access adequately, but it's real strength was with scans (based on ordering of keys).
I did benchmarks on a previous version, and the results were pretty miserable (600ms/lookup on a table with several million columns). It certainly sounds like they've improved.
It would be great to have a "more general" for-example, since noone outside Facebook meets the problem of "let's build Facebook's messaging database" :) Any suggestions?
Apples usually stay crispy unless baked. Good in pies.
Oranges can be sour (or sweet). Do not bake.
Strawberries are red. Good in pies, advise against baking.
Pineapples are rough on the outside. Good fresh, baked, grilled, fried, debatable on pizza.
Grapes come in many colors and sizes. Great fresh or turned into alcoholic beverages.
(Not the worst introduction to fruit, but perhaps superficial? Amirite?)
Pineapple Guava (Acca Sellowania) -- Small green fruit. Seeds soft and edible, skin optional. Turpentine flavor signals overripe. Cold hardy and grown in many parts of the US as an evergreen ornamental. Delicious eaten raw.
Strawberry Guava (Psidium cattleianum) --- Tasty small soft red fruit with very fragrant aroma and many small hard seeds. Skin edible, but seeds best avoided. Can be eaten out of hand, but low commercial use. Frost tolerant in mild climates.
White Guava (Psidium guajava) --- True tropical guava, thus barely if at all frost-tolerant. Large fragrant fruit with inedible hard seeds. Usually used for juice or puree, rarely eaten out of hand. Wonderful strong aroma increases with ripeness.
While obviously not of use to a producer of guavas, this sort of cheat sheet might be helpful to someone who happens to encounter one of these varieties in a grocery store or tree nursery. At the least, it might keep someone from breaking their teeth on the inedible hard seeds!
Proverbial Apples to Oranges problem
For example, CouchDB having a "Main Point" of "DB consistency" might be the case, as it is for Redis, when there is no replication. In replicated configurations, it is definitely not true. Further, the MVCC is weaker in many ways than in a Dynamo system like Riak as you have no way to influence or discover consistency between replicas.
I'm sure folks expert in other systems can identify similar errors in the rest of the post. Can someone explain to me who the target audience is for all these NoSQL comparison articles? They are universally poor, yet universally popular.
However, there is a fairly nice way of storing older versions of documents - hold older versions as file attachments on the document. See:
Where couchdb has some immense possibilities is in distributed applications, not only server side, but also mobile phones and browsers. Since you can write and contain an entire webapp inside of couchdb, you can technically replicate the entire app to your mobile phone, and it'd work offline or online. And if you need your app on another platform--as long as it has couchdb, you can just replicate it there.
I never see this mentioned in any overviews of comparison for couchdb.
The sticking point right now, though, is that couchdb isn't on very many mobile platforms. There has only been experiments with writing couchdb on top of HTML5's localStore, and jChris et al are working on Couchdb for android.
CouchDB does not version, period.
As I recall, the id field is just a string. It's just common to let it do the automatic "#-hash" representation.
It's been a while since I played with CouchDB though, so I could be off.
Keep up the good work!
If you need versioned records, you are likely better off identifying your versioning requirements and building to those than trying to piggyback off of something else poorly suited.
Just my $0.02 :)
This shit, AGAIN? Really? No, they are not.
SQL is being replaced in niches that strain its model. elsewhere, it remains steadfast.
I think nobody expects SQL's "market share" to fall to low levels, especially with noSQL requiring much deeper understanding of the data and it's planned use. NoSQL practically operates on a lower layer than SQL does.
Still, it's nice to see people thinking about data storage choices and not going blindly to MySQL/Oracle/etc!
I also don't understand the 15 years figure either, is that a reference to when MySQL was initially released? I hope the original poster understands that SQL is older than that.
I'm old enough to remember somebody being threatened with firing for using SQL, because of it's bad performance compared with seeking. :)
I mean, seriously, Zabbix keeping monitoring data in MySQL? Also Piwik? That's a sick solution, IMHO. :)
(BTW, I love Zabbix and Piwik. Use them both. Only I think that having no good alternate data store at the time of their writing, their data storage is suboptimal.)
Eg I have found out that deploying Tokyo Tyrant in a Rails project requires you to write some sčripts to ensure that things run properly. Also the db size has to be set in configuration in advance.
MongoDB OTOH is not designed for a single server environment, has a very small max document size, easily gets corrupted if process is stopped etc.
Both are schema free datastores. For me, this is the biggest, most useful difference between them and traditional SQL databases, because it makes things easy that are very, very hard (or inefficient) on an SQL database.
It's probably also worth noting that other NoSQL solutions don't share this advantage. For example, Cassandra requires all nodes to be restarted to apply a schema change, which can be quite a big deal.
That's no longer true. In 0.7, keyspaces and column families may be created, altered, or dropped live.
Anyway, you still need a schema with Cassandra.
A SQL query goes into a bar, walks up to two tables and asks,
"Can I join you?"
"No, but you can enjoy the view."
Also Redis's main selling point is it's extensive data structure/operations support. "Blazingly fast" really depends on what your workload is and what you're comparing it against.
A few other notable Riak features include a JS or Erlang MapReduce API and full-text search.
Blazing fast: I mean compared to the other four.
In memory operations are fast in many databases. Redis's default configuration (vm-enabled no) just only does operations in memory (with an occasional sync to disk). That's terrible durability but fantastic performance. Most databases, including Redis, can be configured for either that sort of high-performance/low-durability or the opposite. It's just that their default settings/behaviors vary widely.
- Facebook designed it for inbox feature -
- SOLR/Lucene is being integrated
- recently Sequoia backed Riptano - see http://www.riptano.com/
One major feature differentiator is something it doesn't really talk about, though - how conducive is each system to Massive Data?
For example, he kind of has a bone to pick with Cassandra, which is probably justified. But from what little I know, one of the features of Cassandra is that it's designed to scale pretty much to infinity. That may be true of a couple of the others, but for some (like CouchDB) it isn't a design goal at all.
actually it is, you might take a peek at BigCouch, it puts the C in CouchDB.
But sure thing, "infinite" scaling is probably best done with the Dynamo-like stuff like Cassandra and/or RIAK.
Using it in a recent project and it's been working great for us.
Does anyone have any user amounts about the different no-sql databases? Or just say two most popular ones? I guess some of them will rise above the other's in following years and some will drop. User amounts would indicate which ones have most potential to stay around and be accepted as standard no-sql databases.
What is your suggestion otherwise? Any distributed database that is going to be inexpensive, performant, scalable, and durable will need to use some kind of quorum read repair system. Riak, Voldemort, and Dynamo all use read repair with high levels of production success.
Clarified it, thanks!
Also of note, if the edits that caused the conflict are replicated to other nodes, each node will independently choose the same revision to use as the 'primary' document response.
Bottom line, the choice is deterministic and is guaranteed to be preserved, but the choice may not be the last written revision.
Also, bugs that result in corrupt dbs are treated as major bugs as opposed to a part of the design. I've also not seen reports of index corruption under load, if you have logs or any more information we'd definitely appreciate if you could put that info into a ticket on JIRA  or even just mail email@example.com with details.
Thanks for the feedback!