They're faster - often by factors - than row based database if your workload is mainly based on aggregates. That's why they're so effective for datawarehousing.
On the other hand if your workload is update/delete heavy then they can be much slower than row-based databases.
Just imagine that you're updating a dozen rows at a single record. In a row database that's one search, in a column database it's 12 searches.
They're complimentary to each other - that's why you often see say banks using row-based databases for the day-to-day workload (lots of updates) and column databases for analytics (lots of data being aggregated).
This is a very interesting paper where the researchers have compared performance of Column Store databases with plain Row Store databases as well as Row Store databases emulating column stores by using vertical partitioning and other techniques.
There is also A Comparison of C-Store and Row-Store in a Common Framework [http://pages.cs.wisc.edu/~alanh/tr.pdf] where researchers demonstrate that the gain that Column Store databases give can be easily achieve in row stores by but by other techniques.
From what I remember the advantage of Column-Oriented DB is its data locality - the data of a column are stored together so that more of them can be loaded into memory in one shot, less disk seeks. They are good when doing aggregation (sum,count,groupby) on a few columns out of a table with many columns since the rest of the columns are not touched at all. When the full rows have to be retrieved, there's no advantage or may be even slower than row-based db.
They also tend to have more effective compression. C-store (vertica) is also interesting because it takes advantage of compressed data inside the query engine to make queries faster, rather than taking the more traditional step of uncompressing the information and then processing it.
One of the main disadvantages (particularly on more naive implementations) is that incremental inserts can be slow, due to the number of disk seeks required to insert a row.
It can also be on the same _machine_, when you look at distributed data stores. One of the pains of "scanning" vertically is that you have to go through all of the machines in your cluster, where each takes a shard of your data.
Cassandra's implementation provides many insights into this question.
Not only data locality from the entity's POV, but also from the machine's POV.
Cassandra also stores the data on disk in a forward-only data structure (to avoid disk seeks), which is periodically compacted (deletes, updates) and employs bloom filters to ease up on cache/disk hits and misses.
I guess one more advantage of a practical column store like Cassandra is that you are being forced to shape your data to your queries (no relational calculus/algebra as in RDBMS) which requires much more thought.
you are being forced to shape your data to your queries (no relational calculus/algebra as in RDBMS)
I think this is wrong twice over. First, there are commercial column store RDBMS systems already, starting with Vertica. And secondly, in any large RDBMS, you have to shape your data to your queries and your queries to your data. There's a reason that people with very large databases want very skilled DBAs and DB developers. The notion that row stores allow you to build enormous data warehouses without giving any thought to how you layout data on disk, what sort of indexing you use, how to normalize, etc is just absurd.
This doesn't seem like an advantage to me at all. I'd like the system to be able to evolve regarding the sorts of queries that it answers to, and relational systems seem to be very good in this regard.
From my work there are 2 key advantages that using a column oriented approach bring:
1. There is far less storage required to represent data in a column oriented database due to the ability to compress like data. This storage benefit is really important when you get to tables, as it means you can hold more of your data set in memory at any one time.
2. For tables that have a lot of columns, but queries that only touch a few at a time, it means less work for the query to do, since there is no jumping from row pointed to column pointer.
For case 2, a good index in a normal RDBMS also gives the same benefit. If your query only touched the columns in the index, the query would be blasting fast. The full rows are not used at all. One trick to improve query performance is to create composite index that contains all the columns needed in the query, which is like a mini-table with only the columns in the query.
Yes but each additional index adds to the storage requirement of the data. With column oriented designs each column is often indexed with bitmap indexes which are very small (often the index + raw data is smaller than the raw data stored in an array thanks to compression). Index intersection is also much faster with bitmap indexes rather then b-trees.
One trick to improve query performance is to create composite index that contains all the columns needed in the query, which is like a mini-table with only the columns in the query.
MSSQL also allows you to create an index and include other columns with the index data while leaving them out of the index itself. I don't have time to check if this is standard SQL or an MSSQL extension.
Note that the first paper you cited is at least partially in response to the second paper you cited.
"This simplistic view leads to the assumption that one can obtain the performance benefits of a column-store using a row-store: either by vertically partitioning the schema, or by indexing every column so that columns can be accessed independently. In this paper, we demonstrate that this assumption is false."
"We conclude that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of a column-oriented approach."
Yes, there are some obvious benefits of column stores and simple vertical partitioning cannot serve the purpose. Compression and cache locality would still require a lot of changes to be made on lower layers. I think there is ample scope for a DB Engine that allows some tables to be column stores when required.
It is also noted in one of the papers (I forgot which one) that Column stores work better for read intensive use cases, probably because disk seeks would be less (appends/changes to one one section of a table and not 5 sections)
For a project where I needed to quickly partition a data tale of millions of rows at multiple levels, I wrote a column database with bitmap index for data storage. It is 100% c# code.
The queries and data manipulation is lighting fast the data loading is a little bit slower than traditional db because all the index that needs to be created on the fly, but once is done, is a sweet ride after.
Things that I learned to consider, the quality of the compression algorithm, there is a trade off between compressing and some specifics type of queries.
Also, the cardinality of the column ( number of different values in the column ) will impact the indices that need to be created and the query time response, if the cardinality is too high, b-tree might be a better alternative.
I was working towards making a product out of it, but then I saw SQL Server new column index and I drop it. I don't think anybody would be interested in a little dll, when the big boy is offering the same functionality.
I would be very interested in hearing more about that project, as it sounds very similar to what I am working on for my cancer research.
The two problems I keep running into are that run length encoding means you have to read the entirety of the bitmap to get to the value you want, and looking up a particular value for a particular row (where you have to scan all the bitmaps for that column) - which as you state makes high cardinality columns slow. I would be very interested in learning about ways to overcome these issues.
Hit me up at terence [at] siganakis [dot] com if you would like to talk more, or on twitter at @siganakis (since I can't find any contact information in your profile).
I'm pretty sure the new column store indexing in SQL server is enterprise edition only (not even in the new bi license) which in my experience prices out a lot of small businesses. Might be a niche in there somewhere to be served.
One question. Isn't this how B+ Tree index on a column work? A BT index is just the sorted values of a column with pointers to the row id of the actually rows. Note that BT index are often compressed as well since the sorted values have many similarity to its neighbors.
The difference is in the operations. With a b tree the software operates on 1 row at a time, but when represented as a series of bit, then in a 32 bit CPU you can process 32 rows in 1 instruction. Which leads to bitmap index over GPU...
Yes, the only problem I found when writing my own bitmap index is they become cumbersome when asking the question: what value has this column for this row. RLE does not help there.
But they are extremely fast when asking the question: what rows has this value on this column. And that is the whole concept, the ability to flip the question.
Basically the patent covers the method of run length encoding. Each bitmap is stored as an array of words (32/64 bit ints), with each word being either a "fill word" telling it that the next x words are all 0/1 or a "literal word" containing an actual sequence of 0's and 1's.
Its a pretty sorry state of affairs when core, relatively simple data structures are patentable. Imagine if someone had patented the linked-list or binary search tree!
Interestingly, FastBit (which uses this IP) is released under the LGPL
Column-oriented databases are faster because computers are built for arrays and not rows. It is a lot easier to pass through a simple column in memory or read a simple column from the disk. The pass through the memory benefits from all the caching functionality on the chip, and the pass through the disk benefits from the arm not having to bounce around. Essentially you move at the maximum speed allowed by the equipment. Row-based storage defeats all that.
Indexing is also improved with a column store. If you have a row-based storage mechanism, particularly one with variably sized entries, you're going to need a more complicated indexing scheme.
I have no direct experience with column oriented DBs, but I'd expect if you have a wide table and are getting all the columns in a query, then a column oriented DB's performance would suffer relative to a row oriented one's. So it's a trade-off, sometimes it's slower, sometimes it's faster, depends on the use-case.
Not accurate, you don't need to get all the columns, actually you don't need to operate over all the values of the column. If the data oes not change frequently, then column oriented is faster. The trade off is how many structures needed to be maintained when rows updates... Then row oriented is faster.
Column-stores are a big part of the technology described in Google's Dremel paper (http://research.google.com/pubs/pub36632.html); they used a column-oriented format to build a system to support interactive queries to big datasets.
If you consider column oriented databases as relational database normalisation taken to the extreme, this makes sense.
Apart from the semantic benefits, Codd and Date have argued normalisation gives an efficiency benefit.
If the physical benefit of normalization is related to sharing the representation of repeated values, then perhaps this reduces the physical benefit of normalization, because shared values are automatically captured just once.