Hacker News new | past | comments | ask | show | jobs | submit login
Design and Implementation of Modern Column-Oriented Databases (2012) [pdf] (csail.mit.edu)
90 points by behoove on May 30, 2016 | hide | past | favorite | 9 comments

Do SSDs or completely in-memory databases obviate the need for column-oriented databases?

No, the benefit of columnar organization is locality for query evaluation which remains valid regardless of the storage media. However, they do impact the type of columnar representation which is optimal -- "columnar-oriented" includes very diverse types of organizational structures.

The classical "turn every column into an index" columnar-oriented model has some serious drawbacks for many workloads. If you are using SSDs, you have the ability to do random access, sub-page reads, et al very efficiently. This allows you to use a different family of columnar-oriented database designs, particularly the various vector-structured storage models, which retain almost all of the benefits of traditional columnar-oriented databases without most of the drawbacks (given that you are using SSDs). So SSDs really just allow you to use a better columnar model with fewer constraints.

In-memory is not a database architecture per se, but the benefits of locality (e.g. reducing cache line misses) still apply. Most database engines operate almost entirely in-memory for many workloads in any case.

Can you explain a bit more? I'd like to know what the serious drawbacks are of the "turn every column into an index" model, how the random access properties of SSDs change the situation, and what you mean by "vector-structured storage".

Are there any papers, descriptions, or examples you could point to that exemplify the architecture that you think makes the most sense for SSDs?

Also, I was sad to see that SpaceCurve appears to be gone! I was about to recommend it to a friend as a potential solution to her spatial database needs. Can we hope that you are trying to continue the concept under different structure?

The papers from MonetDB/x100 have a lot of interesting info. Here's one to start with: http://www.vldb.org/conf/1999/P5.pdf

Great paper, but I was wondering particularly about the "optimizing for SSD" aspect. The paper you cite seems to deal only with optimizing for "in RAM" databases. While one might hope that the same techniques would work for both, I think Andrew may have important insights into the differences.

There's overlap, and the papers that cite that one are a good place to start running things down. There's plenty of recently published stuff about ssd specific database engine approaches, and I'd specifically suggest the "don't stack your log on my log" paper for a really basic intro [1].

Andrew can speak for himself but I'll say I've not found anything substantial that is similar to the marketing claims of SpaceCurve in open published work. Some of the comments about memory sharding cut against the grain of what I see as the trend in high performance published work, which generally obeys single writer via CoW and or log structure but leverages shared reading of memory heavily.

It may perhaps be unfair but given several years and tens of millions of funding I'd expect they'd be able to present something more convincing or at least some benchmarks. I suspect the approach is something smart people have found promising when doing diligence on it, but that is proving difficult to turn into a product due to drawbacks not covered in the marketing materials, blog posts, and comments here.

[1]: https://www.usenix.org/system/files/conference/inflow14/infl...

No, since the same principles apply. An in-memory database searching for a value or aggregating is easily constrained by the bandwidth between memory and processor, so column-oriented layout still makes sense if your workload means you are often scanning through a column as fast as possible.

No. Using a columnar format means you can get compression and vectorization that blows a row oriented format out of the water for scans.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact