

Relational database pioneer says technology is obsolete - nickb
http://www.computerworld.com/action/article.do?command=printArticleBasic&articleId=9034619

======
brlewis
He advocates throwing out the _internal storage mechanisms_ of relational
databases and instead having a wide variety of mechanisms suited to different
ways data is used. Storing columnar data together rather than storing rows
together is one example. This does not make the database look any different to
an application programmer.

<http://www.vertica.com/product/overview>

Q: How are Vertica databases accessed?

A: Just like any other relational database, using SQL and/or JDBC and ODBC.
Vertica has partnerships with BI/Reporting software vendors such as Business
Objects, JasperSoft, Informatica, and Talend to help ensure Vertica is
compatible with popular reporting and ETL software.

------
alec
I interviewed at Vertica. They have some really cool stuff going. You can get
a lot of it in C-Store, a BSD-licensed academic project. The main C-Store
paper is very informative if academic papers are your thing.

The big advantage of changing the storage mechanism from row-based to column-
based is that queries over large amounts of mostly-static data go much faster
and take up much less space. MySQL, Postgres, and Oracle are all more or less
row-based - each row or set of rows is stored whole - while in a column-based
store, each column is stored individually, and the rows are held together in
some fashion.

Say you have a lot of columns and few tables, and that all your normal queries
heavily use a few columns - if you're doing marketing maybe that's zip code
and age range, or if you're doing inventory prediction maybe it's item id and
sales date. Pre-sort those columns. If you want to get addresses or sales
records of everyone in a certain zip code of a certain age - say, the
20-somethings that bought at the local Wal-Mart - that's now very quick,
because you can sort a table first on the zip code column, then on the age
column, and then on whatever else. It's like addressing an array, as opposed
to processing each row and testing a few columns.

That's not a big win on your PHP/MySQL website, but that's a huge win in your
10TB data warehouse.

Once you sort, you can do really simple compression like run-length encoding
to get the size down. Maybe you have a million people per zip code, but if
you've sorted on zip code, you only need to say "entries 1 through 1 million
in the zip code column have value XXXXX". You don't get as good compression on
the column used for the next-most-important sort, but it could still be pretty
good.

Vertica also does some cool things with replication and redundancy.

Is column-oriented revolutionary? C-Store has been out for a few years, and
there are a few commercial and free software implementations out there, and
it's in the textbooks, so no.

~~~
alec
I should add that I don't work for them (if that's not already clear) and that
you can read more on their website if you ask them nicely through some form.

------
trekker7
Correct me if I'm wrong, but if you tilt your head to the left by 90 degrees
and look at a database sideways, don't all the rows turn into columns?

Edit: But seriously, does anyone know if this column-oriented stuff is really
revolutionary, and if the article is correct in praising the technology?

~~~
mattculbreth
For datawarehouse applications you're (obviously) doing many more reads than
writes. The number of columns you return for any given ad hoc query is
typically low. Take something like "give me the names and salaries of
employees who missed more than 3 days last quarter and who also work in
California".

Since a single datamart might have 200 columns spread across dimension and
fact tables, and since this query is only touching 5-6 of them, it pays to
make it easy to get to individual columns, instead of reading entire rows. In
a row-store database you'd have to read the main fact table rows and the
Employee and Office dimension rows, and that might add up to 50 columns. If
you can only touch the exact columns you need you're better off.

All that said, I've never done it. I really need to get an open source column
store and whack at it for a bit.

~~~
Tichy
So "column based db" means just to denormalize the database schema?

~~~
brlewis
No, nothing about the schema changes.

What changes is the physical storage of the data. Instead of a disk read
picking up all the columns including unused ones, a given disk read gets many
values for a single column.

------
davidw
This interview with him is more in-depth:

<http://news.ycombinator.com/item?id=49540>

