On the other hand if your workload is update/delete heavy then they can be much slower than row-based databases.
Just imagine that you're updating a dozen rows at a single record. In a row database that's one search, in a column database it's 12 searches.
They're complimentary to each other - that's why you often see say banks using row-based databases for the day-to-day workload (lots of updates) and column databases for analytics (lots of data being aggregated).
Column-Stores vs. Row-Stores: How Different Are They
This is a very interesting paper where the researchers have compared performance of Column Store databases with plain Row Store databases as well as Row Store databases emulating column stores by using vertical partitioning and other techniques.
There is also A Comparison of C-Store and Row-Store in a Common Framework [http://pages.cs.wisc.edu/~alanh/tr.pdf] where researchers demonstrate that the gain that Column Store databases give can be easily achieve in row stores by but by other techniques.
One of the main disadvantages (particularly on more naive implementations) is that incremental inserts can be slow, due to the number of disk seeks required to insert a row.
I do recommend the c-store paper, it's very well written: http://people.csail.mit.edu/tdanford/6830papers/stonebraker-...
Cassandra's implementation provides many insights into this question.
Not only data locality from the entity's POV, but also from the machine's POV.
Cassandra also stores the data on disk in a forward-only data structure (to avoid disk seeks), which is periodically compacted (deletes, updates) and employs bloom filters to ease up on cache/disk hits and misses.
I guess one more advantage of a practical column store like Cassandra is that you are being forced to shape your data to your queries (no relational calculus/algebra as in RDBMS) which requires much more thought.
I think this is wrong twice over. First, there are commercial column store RDBMS systems already, starting with Vertica. And secondly, in any large RDBMS, you have to shape your data to your queries and your queries to your data. There's a reason that people with very large databases want very skilled DBAs and DB developers. The notion that row stores allow you to build enormous data warehouses without giving any thought to how you layout data on disk, what sort of indexing you use, how to normalize, etc is just absurd.
This doesn't seem like an advantage to me at all. I'd like the system to be able to evolve regarding the sorts of queries that it answers to, and relational systems seem to be very good in this regard.
From my work there are 2 key advantages that using a column oriented approach bring:
1. There is far less storage required to represent data in a column oriented database due to the ability to compress like data. This storage benefit is really important when you get to tables, as it means you can hold more of your data set in memory at any one time.
2. For tables that have a lot of columns, but queries that only touch a few at a time, it means less work for the query to do, since there is no jumping from row pointed to column pointer.
MSSQL also allows you to create an index and include other columns with the index data while leaving them out of the index itself. I don't have time to check if this is standard SQL or an MSSQL extension.
"This simplistic view leads to the assumption that one can obtain the performance benefits of a column-store using a row-store: either by vertically partitioning the schema, or by indexing every column so that columns can be accessed independently. In this paper, we demonstrate that this assumption is false."
"We conclude that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of a column-oriented approach."
It is also noted in one of the papers (I forgot which one) that Column stores work better for read intensive use cases, probably because disk seeks would be less (appends/changes to one one section of a table and not 5 sections)
The two problems I keep running into are that run length encoding means you have to read the entirety of the bitmap to get to the value you want, and looking up a particular value for a particular row (where you have to scan all the bitmaps for that column) - which as you state makes high cardinality columns slow. I would be very interested in learning about ways to overcome these issues.
Hit me up at terence [at] siganakis [dot] com if you would like to talk more, or on twitter at @siganakis (since I can't find any contact information in your profile).
Some people here still thinks that row oriented databases can compete in aggregation speed, so if I do a MVP it will require a lot of teaching.
Also what's the difference between the bitmap part and the Bitmap Index (http://en.wikipedia.org/wiki/Bitmap_index)?
Also article describes a word-aligned bitmaps data structure, which uses RLE, so it will skip areas of all-zeros or all-ones.
Basically the patent covers the method of run length encoding. Each bitmap is stored as an array of words (32/64 bit ints), with each word being either a "fill word" telling it that the next x words are all 0/1 or a "literal word" containing an actual sequence of 0's and 1's.
Its a pretty sorry state of affairs when core, relatively simple data structures are patentable. Imagine if someone had patented the linked-list or binary search tree!
Interestingly, FastBit (which uses this IP) is released under the LGPL
[Edit: improved clarity]
Indexing is also improved with a column store. If you have a row-based storage mechanism, particularly one with variably sized entries, you're going to need a more complicated indexing scheme.