Hacker News new | past | comments | ask | show | jobs | submit login

If you're interested in reading more about the general design of column stores, I can recommend [0] as a great place to start. It also compares design decisions made in MonetDB to other column stores.

If you want to look at simple code after the reading, I'm slowly implementing a naive column store over at [1] in golang. The overall structure isn't flexible but the individual columns are quite portable. It has a number of tests and benchmarks using a few months of MTG price data I had sitting around.

[0] http://db.csail.mit.edu/pubs/abadi-column-stores.pdf

[1] https://github.com/Everlag/naive-columstore




I'm working in something like this but only in memory (because I'm building a relational language).

I have wonder how much better could be row or columns store if only for the in-memory data build typically in a program execution. I wonder if exist some data about this - because I'm too early stage to have meaningful data for benchmarks and my intuition is that column-store could be better long-term because searching for few columns dominate all other things (and joins could be better?), but in other aspects, like iterating, things not look good.

BTW, this is in F#


The 'optimal' store really depends on the type of data and queries you're targeting. [0] is probably a good place to start for comparisons. I'd also recommend the other reading in my previous post, but that is much longer.

Sequential scans form the core of most column store queries, why would iteration be an issue? Do you mean iterations considering multiple fields of a structure or similar? In that case, you should be able to break that into a series of efficient, per-column queries which can be lazily materialized.

[0] http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf


Yeah, normal iterations that select most/all columns like

   for row in customers
and similar...

However I see that kdb+ is column-store (http://www.timestored.com/kdb-guides/kdb-database-intro) and supposedly the performance is ok?


Is there anything special about the 'store' part of column stores? As in, is their store better in some way than for example splitting your PgSQL tables into key-value tables?

And with regards to the constructing of SQL queries over the column stores as MonetDB/SQL does, is it more performant than what I described on PgSQL and doing the query with a bunch of JOINs between the various attribute tables?

I know there's a lot been written about column stores and there's some very interesting stuff out there (I particularly like MonetDB/XQuery, but sadly it was discontinued), perhaps a short answer to those questions might inspire some more interest in them.


The store part: A columnstore enables some pretty awesome compression to be put in place, especially in data warehousing workloads. For large, low-cardinality fields, a dictionary encoding can give great compression. Another common data pattern is large runs of repeated values (thousands of entries on the same date), which can be compressed with run-length-encoding very well.

These are two methods utilized in Microsoft's Vertipaq engine (behind SQL Server columnstore indices and SSAS Tabular models). I call it out specifically, because this is the technology I am familiar with, but I expect similarities elsewhere.

Decreasing the memory footprint decreases the hardware requirements and decreases the amount of time required to process the data, which is a boon when many of your queries are typified by a full table scan.

Typically we see 7x-10x compression of Vertipaq data compared to uncompressed SQL tables.

The use case where columnstore engines shine is not OLTP, so a lot of traditional RDBMS use cases don't see a lot of benefit.

I'll repeat one item - if you can describe your access patterns as consisting mostly of table scans (or at least large index scans), rather than seeks, you will likely be able to find benefit from a columnstore engine. This is not to say that there is no other use case.

Edit: I think Everlag summarized it better in a sibling post, but also sparked another thought about differences between row-based and column-based storage.

Adding columns to a logical SELECT statement in a columnstore imposes a much higher load than in a row-based table. In a traditional RDBMS, once you've gotten to the row's page on disk/in memory, it is essentially free to get 1 or N fields in your resultset. I.e. Accessing N fields is O(1).

Since a columnstore does not necessarily store each field contiguously on disk/in memory, it is closer to O(N) in the size of the field list in the resultset.

I wouldn't quote me on the algorithmic complexity of the two, but it's enough for intuition.


>>if you can describe your access patterns as consisting mostly of table scans (or at least large index scans), rather than seeks, you will likely be able to find benefit from a columnstore engine.

I think this is very accurate for first generation column stores, but some of the newer research systems are addressing this issue with technologies like behind-the-scenes indexing.


This is interesting to me. What is the indexing strategy that is utilized?

To me the challenge is that the values making up a row are stored non-contiguously. The best case scenario is that a row of N fields requires N data accesses. This is assuming that you can index directly into an array or similar structure with constant access time.

Row-based storage means that you get all fields for "free"when accessing a single row, since the DB engine accesses a page at a time.

This is my understanding of the problem - I am very interested to learn what I'm missing if you have any good articles to share.


HBase + Phoenix is the closest I know of. Though, the cost of covered indexes is duplication. Phoenix also provides local/global indexes, giving an optimization choice b/w read heavy vs write heavy patterns


Interesting. Do you have any articles to point to explaining the strategy? If not, I can go digging myself now with direction.

Thanks!


Thanks for sharing your thoughts. What's the difference in use-case between OLAP and tabular?


For reference for other readers, OLAP (On-Line Analytical Processing) is both the name for a type of database workload and a former designation of a Microsoft product, SQL Server Analysis Services, designed to fill that role.

As of SQL Server 2012 (arguably 2008R2), SSAS offers two modes of operation, Multidimensional (formerly OLAP), and Tabular. Both of these modes are designed to fit the OLAP workload.

SSAS Multidimensional is the same engine and query language that has been around forever in SSAS - a dimensional data store on-disk, scripted in MDX.

SSAS Tabular is the newer engine built on top of Vertipaq. The query and expression language is DAX.

Presently, either language can be used to query a cube/model operating in either mode, though development is still Multidimensional-MDX and Tabular-DAX.

The way I like to characterize the two is that Multidimensional mode is what your pedantic uncle would put together if he read and loved the Kimball books and the only language he knew was SQL. It looks like SQL, but is semantically VERY different. Tabular is designed to capture the Excel analyst market. DAX is designed to look like Excel formulas, but semantically DAX and the Tabular engine are very similar to SQL.

Neither of these is intended as a negative description.

Multidimensional is excellent for huge datasets, and works great for anything that can map cleanly onto the abstraction of dimensions and measure groups.

Tabular, like I mentioned, maintains many semantic similarities to SQL. The primary abstractions are tables and relationships, with an addition of the idea of filter context that can be automatically propagated. Since the primary operations are on tables and relations, the model is much more general and I find that I'm often able to map DAX to SQL in my head in real time as I write code. There are of course things that are easier in one language than the other, and more difficult.

DAX is not a general-purpose query language, though. It is designed to support analytical queries, and so it has a great deal of nuance around manipulating filter context and aggregating data.

Multidimensional allows much more fine-grained specification of cube behavior, but consequently demands a slower development cycle. For enterprise scale BI and semantic modeling, I think it is absolutely best in class.

Tabular is a very powerful engine that allows you to get pretty close to Multidimensional in some dimensions, with a much lower development overhead. The Tabular storage engine does demand that all data be resident in memory. There is a pass-through query mode which acts essentially as a DAX->SQL translation layer, called DirectQuery, which can remove this requirement, but typically it demands an absolute beast of a RDBMS (not limited to only MS SQL Server) on the back end.

Even though Tabular offers a more flexible semantic model, it still works best with something approximating a dimensional model. The further you depart from a dimensional model, the more expertise you need in DAX to handle it. It's not bad, but the population of people comfortable enough to work at that level is vanishingly small.

This has been a collection of random thoughts. There isn't a single better mode between the two. Both have strengths, but without a specific use case or set of constraints it is difficult to give concrete advice.

The two modes fit the same general use case, but have nuances.


> Tabular is designed to capture the Excel analyst market. DAX is designed to look like Excel formulas

And, in fact, Excel's "Power Pivot" feature (and uses DAX for calculations) is powered by Excel running its own instance of SSAS Tabular.


Yes. Power Pivot is the reason that I said Tabular arguably came to SQL Server 2008R2, which can run Analysis Services in a mode to support a Power Pivot workbook hosted on SharePoint. There was no standalone SSAS Tabular, but the Vertipaq engine and DAX were run on a SQL Server instance.

Power BI Desktop (and hosted data models on the PowerBI.com cloud service) is also powered by the Tabular engine and utilizes DAX as a query and modeling language.

Microsoft is pushing Tabular hard. It's probably the right choice for most organizations, and at worst serves as a great prototyping tool for a Multidimensional model (when we have clients who need Multidimensional for whatever reason, we often still do initial development and pilots on Tabular).


Thanks for the answers. Really appreciated! I am analyst and I deal with databases, but in a very specific way. The way I tried to explain to myself the difference is that OLAP is used if you have a lot of fine-grained hierarchies accessed by a big number of people. In this scenario it would make sense to make all those aggregations ahead of time because a columnar store would have to calculate them each time somebody slices and that would be waay to expensive, even if you have state of the art caching.


That is a useful distinction to make, but as in all programming tasks, profiling will get you further than intuition when tuning for performance.


(As someone who was pretty familiar with PgSQL internals)

1. I think from the 'storage' point of view, a column-store will use much less space than a collection of key/value PgSQL tables:

A) Each PgSQL table comes with several system columns, so each key/value pair will have the large overhead of (key-size + system column size), which will most likely be larger than the actual size of data. This will defeat one of the goals of columnar stores which is to load less data in-memory (to achieve less disk I/O, assuming more frequent columns will get eventually cached), unless you have so many columns and use only very few of them.

B) PostgreSQL doesn't compress each column. Column-stores usually use some kind of light-weight compression to use less disk. When I worked on cstore_fdw, this was one of the desirable features that attracted users.

2. I haven't done PostgreSQL benchmarking in last ~10 months, but unless selectivity of your query is high, fastest of joins won't be even close to a sequential table scan. If the selectivity is low and your query only needs to scan small sub-set of rows, then joins+indexes should be faster, but then again you have overhead of indexes.

One method I found useful in PostgreSQL is to create an index of columns of very frequent queries, and then tune the system to use index-only-scan for these queries.


"and then tune the system to use index-only-scan for these queries."

Caaareful there. I hope you're monitoring your query times for outliers.


The 'store' portion buys you a lot of raw performance because its modern architecture friendly. tl;dr, frame it as the row store being an array of structures while the column store is a structure of arrays.

Then, your basic way of running a query is a sequential scan over columns. That means queries are going to be more cache friendly while also spending much more time in a trace.

If you sort your data into so-called projections, then the columns you sorted by become the equivalent of sorted arrays, which are extremely compressible. At that point, you can cheaply run-length or delta-encode the sorted columns to achieve incredible compression ratios. Even further, any query using those sorted columns can multiply the effective storage bandwidth available to it by the compression ratio.

I'm not a researcher, just an interested developer, so please feel free to yell at me if anything seems wrong :)

EDIT: Turns out I was actually working on an unpushed branch, its been merged to master. The repo now actually has some interesting work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: