

Why are column oriented databases so much faster than row oriented databases? - siganakis
http://siganakis.com/using-bitmap-indexes-in-query-processing

======
radicalbyte
They're faster - often by factors - than row based database if your workload
is mainly based on aggregates. That's why they're so effective for
datawarehousing.

On the other hand if your workload is update/delete heavy then they can be
much slower than row-based databases.

Just imagine that you're updating a dozen rows at a single record. In a row
database that's one search, in a column database it's 12 searches.

They're complimentary to each other - that's why you often see say banks using
row-based databases for the day-to-day workload (lots of updates) and column
databases for analytics (lots of data being aggregated).

------
mayanksinghal
There is a paper on a same topic:

Column-Stores vs. Row-Stores: How Different Are They Really?
[<http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf>]

This is a very interesting paper where the researchers have compared
performance of Column Store databases with plain Row Store databases as well
as Row Store databases emulating column stores by using vertical partitioning
and other techniques.

There is also A Comparison of C-Store and Row-Store in a Common Framework
[<http://pages.cs.wisc.edu/~alanh/tr.pdf>] where researchers demonstrate that
the gain that Column Store databases give can be easily achieve in row stores
by but by other techniques.

~~~
ww520
From what I remember the advantage of Column-Oriented DB is its data locality
- the data of a column are stored together so that more of them can be loaded
into memory in one shot, less disk seeks. They are good when doing aggregation
(sum,count,groupby) on a few columns out of a table with many columns since
the rest of the columns are not touched at all. When the full rows have to be
retrieved, there's no advantage or may be even slower than row-based db.

~~~
jondot
It can also be on the same _machine_, when you look at distributed data
stores. One of the pains of "scanning" vertically is that you have to go
through all of the machines in your cluster, where each takes a shard of your
data.

Cassandra's implementation provides many insights into this question. Not only
data locality from the entity's POV, but also from the machine's POV.

Cassandra also stores the data on disk in a forward-only data structure (to
avoid disk seeks), which is periodically compacted (deletes, updates) and
employs bloom filters to ease up on cache/disk hits and misses.

I guess one more advantage of a practical column store like Cassandra is that
you are being forced to shape your data to your queries (no relational
calculus/algebra as in RDBMS) which requires much more thought.

~~~
MichaelSalib
_you are being forced to shape your data to your queries (no relational
calculus/algebra as in RDBMS)_

I think this is wrong twice over. First, there are commercial column store
RDBMS systems already, starting with Vertica. And secondly, in any large
RDBMS, you have to shape your data to your queries and your queries to your
data. There's a reason that people with very large databases want very skilled
DBAs and DB developers. The notion that row stores allow you to build enormous
data warehouses without giving any thought to how you layout data on disk,
what sort of indexing you use, how to normalize, etc is just absurd.

------
jorgeleo
For a project where I needed to quickly partition a data tale of millions of
rows at multiple levels, I wrote a column database with bitmap index for data
storage. It is 100% c# code. The queries and data manipulation is lighting
fast the data loading is a little bit slower than traditional db because all
the index that needs to be created on the fly, but once is done, is a sweet
ride after. Things that I learned to consider, the quality of the compression
algorithm, there is a trade off between compressing and some specifics type of
queries. Also, the cardinality of the column ( number of different values in
the column ) will impact the indices that need to be created and the query
time response, if the cardinality is too high, b-tree might be a better
alternative. I was working towards making a product out of it, but then I saw
SQL Server new column index and I drop it. I don't think anybody would be
interested in a little dll, when the big boy is offering the same
functionality.

~~~
baconner
I'm pretty sure the new column store indexing in SQL server is enterprise
edition only (not even in the new bi license) which in my experience prices
out a lot of small businesses. Might be a niche in there somewhere to be
served.

~~~
jorgeleo
Actually I am observing this news... and how the licencing for SQL Server will
end up panning out.

Some people here still thinks that row oriented databases can compete in
aggregation speed, so if I do a MVP it will require a lot of teaching.

------
ww520
One question. Isn't this how B+ Tree index on a column work? A BT index is
just the sorted values of a column with pointers to the row id of the actually
rows. Note that BT index are often compressed as well since the sorted values
have many similarity to its neighbors.

Also what's the difference between the bitmap part and the Bitmap Index
(<http://en.wikipedia.org/wiki/Bitmap_index>)?

~~~
jorgeleo
The difference is in the operations. With a b tree the software operates on 1
row at a time, but when represented as a series of bit, then in a 32 bit CPU
you can process 32 rows in 1 instruction. Which leads to bitmap index over
GPU...

~~~
nivertech
It will be 64bit at a time using plain CPU instruction or 128-512bits at a
time using SIMD or GPU.

Also article describes a word-aligned bitmaps data structure, which uses RLE,
so it will skip areas of all-zeros or all-ones.

~~~
jorgeleo
Yes, the only problem I found when writing my own bitmap index is they become
cumbersome when asking the question: what value has this column for this row.
RLE does not help there. But they are extremely fast when asking the question:
what rows has this value on this column. And that is the whole concept, the
ability to flip the question.

------
jmah
Thanks, FastBit looks quite interesting, but the site is so... academic. So
much great research is lurking behind "unassuming" web presences. 'Someone
should do something!'

<https://sdm.lbl.gov/fastbit/>

~~~
siganakis
It is also worth noting that some of the technology underpinning FastBit is
actually patented (the Word Aligned Hybrid method of run length encoding).

<http://www.freepatentsonline.com/6831575.html>

Basically the patent covers the method of run length encoding. Each bitmap is
stored as an array of words (32/64 bit ints), with each word being either a
"fill word" telling it that the next x words are all 0/1 or a "literal word"
containing an actual sequence of 0's and 1's.

Its a pretty sorry state of affairs when core, relatively simple data
structures are patentable. Imagine if someone had patented the linked-list or
binary search tree!

Interestingly, FastBit (which uses this IP) is released under the LGPL

[Edit: improved clarity]

~~~
justincormack
That, er, seems obvious, and I would have thought has prior art...

~~~
jorgeleo
In the US prios art does not count anymore. There are papers since the 60s
about bitmap indices, but now is first to file not first to invent... Politics
screw us up again

~~~
mcherm
Legally, prior art still invalidates a patent. But that only matters if you
can afford to litigate.

~~~
jorgeleo
Link to proof this? I am very curious...

------
kcl
Column-oriented databases are faster because computers are built for arrays
and not rows. It is a lot easier to pass through a simple column in memory or
read a simple column from the disk. The pass through the memory benefits from
all the caching functionality on the chip, and the pass through the disk
benefits from the arm not having to bounce around. Essentially you move at the
maximum speed allowed by the equipment. Row-based storage defeats all that.

Indexing is also improved with a column store. If you have a row-based storage
mechanism, particularly one with variably sized entries, you're going to need
a more complicated indexing scheme.

~~~
Devilboy
wat

------
Maro
I have no direct experience with column oriented DBs, but I'd expect if you
have a wide table and are getting all the columns in a query, then a column
oriented DB's performance would suffer relative to a row oriented one's. So
it's a trade-off, sometimes it's slower, sometimes it's faster, depends on the
use-case.

~~~
jorgeleo
Not accurate, you don't need to get all the columns, actually you don't need
to operate over all the values of the column. If the data oes not change
frequently, then column oriented is faster. The trade off is how many
structures needed to be maintained when rows updates... Then row oriented is
faster.

------
ruggeri
Column-stores are a big part of the technology described in Google's Dremel
paper (<http://research.google.com/pubs/pub36632.html>); they used a column-
oriented format to build a system to support interactive queries to big
datasets.

------
emmelaich
If you consider column oriented databases as relational database normalisation
taken to the extreme, this makes sense. Apart from the semantic benefits, Codd
and Date have argued normalisation gives an efficiency benefit.

~~~
artsrc
If the physical benefit of normalization is related to sharing the
representation of repeated values, then perhaps this reduces the physical
benefit of normalization, because shared values are automatically captured
just once.

------
Jach
If you'd like to play with a SQL-standard column-store, check out the open
source LucidDB. <http://www.luciddb.org/html/main.html>

------
tantalor
See also, <http://en.wikipedia.org/wiki/Postings_file>

------
ThaddeusQuay2
<http://en.wikipedia.org/wiki/Column-oriented_DBMS>

