
Design and Implementation of Modern Column-Oriented Databases (2012) [pdf] - behoove
http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
======
pfarnsworth
Do SSDs or completely in-memory databases obviate the need for column-oriented
databases?

~~~
jandrewrogers
No, the benefit of columnar organization is locality for query evaluation
which remains valid regardless of the storage media. However, they do impact
the type of columnar representation which is optimal -- "columnar-oriented"
includes very diverse types of organizational structures.

The classical "turn every column into an index" columnar-oriented model has
some serious drawbacks for many workloads. If you are using SSDs, you have the
ability to do random access, sub-page reads, et al very efficiently. This
allows you to use a _different_ family of columnar-oriented database designs,
particularly the various vector-structured storage models, which retain almost
all of the benefits of traditional columnar-oriented databases without most of
the drawbacks (given that you are using SSDs). So SSDs really just allow you
to use a better columnar model with fewer constraints.

In-memory is not a database architecture per se, but the benefits of locality
(e.g. reducing cache line misses) still apply. Most database engines operate
almost entirely in-memory for many workloads in any case.

~~~
nkurz
Are there any papers, descriptions, or examples you could point to that
exemplify the architecture that you think makes the most sense for SSDs?

Also, I was sad to see that SpaceCurve appears to be gone! I was about to
recommend it to a friend as a potential solution to her spatial database
needs. Can we hope that you are trying to continue the concept under different
structure?

~~~
jasonwatkinspdx
The papers from MonetDB/x100 have a lot of interesting info. Here's one to
start with:
[http://www.vldb.org/conf/1999/P5.pdf](http://www.vldb.org/conf/1999/P5.pdf)

~~~
nkurz
Great paper, but I was wondering particularly about the "optimizing for SSD"
aspect. The paper you cite seems to deal only with optimizing for "in RAM"
databases. While one might hope that the same techniques would work for both,
I think Andrew may have important insights into the differences.

~~~
jasonwatkinspdx
There's overlap, and the papers that cite that one are a good place to start
running things down. There's plenty of recently published stuff about ssd
specific database engine approaches, and I'd specifically suggest the "don't
stack your log on my log" paper for a really basic intro [1].

Andrew can speak for himself but I'll say I've not found anything substantial
that is similar to the marketing claims of SpaceCurve in open published work.
Some of the comments about memory sharding cut against the grain of what I see
as the trend in high performance published work, which generally obeys single
writer via CoW and or log structure but leverages shared reading of memory
heavily.

It may perhaps be unfair but given several years and tens of millions of
funding I'd expect they'd be able to present something more convincing or at
least some benchmarks. I suspect the approach is something smart people have
found promising when doing diligence on it, but that is proving difficult to
turn into a product due to drawbacks not covered in the marketing materials,
blog posts, and comments here.

[1]:
[https://www.usenix.org/system/files/conference/inflow14/infl...](https://www.usenix.org/system/files/conference/inflow14/inflow14-yang.pdf)

