Hacker News new | past | comments | ask | show | jobs | submit login
MonetDB – Column-store database (monetdb.org)
113 points by PeCaN on June 13, 2016 | hide | past | web | favorite | 41 comments

If you're interested in reading more about the general design of column stores, I can recommend [0] as a great place to start. It also compares design decisions made in MonetDB to other column stores.

If you want to look at simple code after the reading, I'm slowly implementing a naive column store over at [1] in golang. The overall structure isn't flexible but the individual columns are quite portable. It has a number of tests and benchmarks using a few months of MTG price data I had sitting around.

[0] http://db.csail.mit.edu/pubs/abadi-column-stores.pdf

[1] https://github.com/Everlag/naive-columstore

I'm working in something like this but only in memory (because I'm building a relational language).

I have wonder how much better could be row or columns store if only for the in-memory data build typically in a program execution. I wonder if exist some data about this - because I'm too early stage to have meaningful data for benchmarks and my intuition is that column-store could be better long-term because searching for few columns dominate all other things (and joins could be better?), but in other aspects, like iterating, things not look good.

BTW, this is in F#

The 'optimal' store really depends on the type of data and queries you're targeting. [0] is probably a good place to start for comparisons. I'd also recommend the other reading in my previous post, but that is much longer.

Sequential scans form the core of most column store queries, why would iteration be an issue? Do you mean iterations considering multiple fields of a structure or similar? In that case, you should be able to break that into a series of efficient, per-column queries which can be lazily materialized.

[0] http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf

Yeah, normal iterations that select most/all columns like

   for row in customers
and similar...

However I see that kdb+ is column-store (http://www.timestored.com/kdb-guides/kdb-database-intro) and supposedly the performance is ok?

Is there anything special about the 'store' part of column stores? As in, is their store better in some way than for example splitting your PgSQL tables into key-value tables?

And with regards to the constructing of SQL queries over the column stores as MonetDB/SQL does, is it more performant than what I described on PgSQL and doing the query with a bunch of JOINs between the various attribute tables?

I know there's a lot been written about column stores and there's some very interesting stuff out there (I particularly like MonetDB/XQuery, but sadly it was discontinued), perhaps a short answer to those questions might inspire some more interest in them.

The store part: A columnstore enables some pretty awesome compression to be put in place, especially in data warehousing workloads. For large, low-cardinality fields, a dictionary encoding can give great compression. Another common data pattern is large runs of repeated values (thousands of entries on the same date), which can be compressed with run-length-encoding very well.

These are two methods utilized in Microsoft's Vertipaq engine (behind SQL Server columnstore indices and SSAS Tabular models). I call it out specifically, because this is the technology I am familiar with, but I expect similarities elsewhere.

Decreasing the memory footprint decreases the hardware requirements and decreases the amount of time required to process the data, which is a boon when many of your queries are typified by a full table scan.

Typically we see 7x-10x compression of Vertipaq data compared to uncompressed SQL tables.

The use case where columnstore engines shine is not OLTP, so a lot of traditional RDBMS use cases don't see a lot of benefit.

I'll repeat one item - if you can describe your access patterns as consisting mostly of table scans (or at least large index scans), rather than seeks, you will likely be able to find benefit from a columnstore engine. This is not to say that there is no other use case.

Edit: I think Everlag summarized it better in a sibling post, but also sparked another thought about differences between row-based and column-based storage.

Adding columns to a logical SELECT statement in a columnstore imposes a much higher load than in a row-based table. In a traditional RDBMS, once you've gotten to the row's page on disk/in memory, it is essentially free to get 1 or N fields in your resultset. I.e. Accessing N fields is O(1).

Since a columnstore does not necessarily store each field contiguously on disk/in memory, it is closer to O(N) in the size of the field list in the resultset.

I wouldn't quote me on the algorithmic complexity of the two, but it's enough for intuition.

>>if you can describe your access patterns as consisting mostly of table scans (or at least large index scans), rather than seeks, you will likely be able to find benefit from a columnstore engine.

I think this is very accurate for first generation column stores, but some of the newer research systems are addressing this issue with technologies like behind-the-scenes indexing.

This is interesting to me. What is the indexing strategy that is utilized?

To me the challenge is that the values making up a row are stored non-contiguously. The best case scenario is that a row of N fields requires N data accesses. This is assuming that you can index directly into an array or similar structure with constant access time.

Row-based storage means that you get all fields for "free"when accessing a single row, since the DB engine accesses a page at a time.

This is my understanding of the problem - I am very interested to learn what I'm missing if you have any good articles to share.

HBase + Phoenix is the closest I know of. Though, the cost of covered indexes is duplication. Phoenix also provides local/global indexes, giving an optimization choice b/w read heavy vs write heavy patterns

Interesting. Do you have any articles to point to explaining the strategy? If not, I can go digging myself now with direction.


Thanks for sharing your thoughts. What's the difference in use-case between OLAP and tabular?

For reference for other readers, OLAP (On-Line Analytical Processing) is both the name for a type of database workload and a former designation of a Microsoft product, SQL Server Analysis Services, designed to fill that role.

As of SQL Server 2012 (arguably 2008R2), SSAS offers two modes of operation, Multidimensional (formerly OLAP), and Tabular. Both of these modes are designed to fit the OLAP workload.

SSAS Multidimensional is the same engine and query language that has been around forever in SSAS - a dimensional data store on-disk, scripted in MDX.

SSAS Tabular is the newer engine built on top of Vertipaq. The query and expression language is DAX.

Presently, either language can be used to query a cube/model operating in either mode, though development is still Multidimensional-MDX and Tabular-DAX.

The way I like to characterize the two is that Multidimensional mode is what your pedantic uncle would put together if he read and loved the Kimball books and the only language he knew was SQL. It looks like SQL, but is semantically VERY different. Tabular is designed to capture the Excel analyst market. DAX is designed to look like Excel formulas, but semantically DAX and the Tabular engine are very similar to SQL.

Neither of these is intended as a negative description.

Multidimensional is excellent for huge datasets, and works great for anything that can map cleanly onto the abstraction of dimensions and measure groups.

Tabular, like I mentioned, maintains many semantic similarities to SQL. The primary abstractions are tables and relationships, with an addition of the idea of filter context that can be automatically propagated. Since the primary operations are on tables and relations, the model is much more general and I find that I'm often able to map DAX to SQL in my head in real time as I write code. There are of course things that are easier in one language than the other, and more difficult.

DAX is not a general-purpose query language, though. It is designed to support analytical queries, and so it has a great deal of nuance around manipulating filter context and aggregating data.

Multidimensional allows much more fine-grained specification of cube behavior, but consequently demands a slower development cycle. For enterprise scale BI and semantic modeling, I think it is absolutely best in class.

Tabular is a very powerful engine that allows you to get pretty close to Multidimensional in some dimensions, with a much lower development overhead. The Tabular storage engine does demand that all data be resident in memory. There is a pass-through query mode which acts essentially as a DAX->SQL translation layer, called DirectQuery, which can remove this requirement, but typically it demands an absolute beast of a RDBMS (not limited to only MS SQL Server) on the back end.

Even though Tabular offers a more flexible semantic model, it still works best with something approximating a dimensional model. The further you depart from a dimensional model, the more expertise you need in DAX to handle it. It's not bad, but the population of people comfortable enough to work at that level is vanishingly small.

This has been a collection of random thoughts. There isn't a single better mode between the two. Both have strengths, but without a specific use case or set of constraints it is difficult to give concrete advice.

The two modes fit the same general use case, but have nuances.

> Tabular is designed to capture the Excel analyst market. DAX is designed to look like Excel formulas

And, in fact, Excel's "Power Pivot" feature (and uses DAX for calculations) is powered by Excel running its own instance of SSAS Tabular.

Yes. Power Pivot is the reason that I said Tabular arguably came to SQL Server 2008R2, which can run Analysis Services in a mode to support a Power Pivot workbook hosted on SharePoint. There was no standalone SSAS Tabular, but the Vertipaq engine and DAX were run on a SQL Server instance.

Power BI Desktop (and hosted data models on the PowerBI.com cloud service) is also powered by the Tabular engine and utilizes DAX as a query and modeling language.

Microsoft is pushing Tabular hard. It's probably the right choice for most organizations, and at worst serves as a great prototyping tool for a Multidimensional model (when we have clients who need Multidimensional for whatever reason, we often still do initial development and pilots on Tabular).

Thanks for the answers. Really appreciated! I am analyst and I deal with databases, but in a very specific way. The way I tried to explain to myself the difference is that OLAP is used if you have a lot of fine-grained hierarchies accessed by a big number of people. In this scenario it would make sense to make all those aggregations ahead of time because a columnar store would have to calculate them each time somebody slices and that would be waay to expensive, even if you have state of the art caching.

That is a useful distinction to make, but as in all programming tasks, profiling will get you further than intuition when tuning for performance.

(As someone who was pretty familiar with PgSQL internals)

1. I think from the 'storage' point of view, a column-store will use much less space than a collection of key/value PgSQL tables:

A) Each PgSQL table comes with several system columns, so each key/value pair will have the large overhead of (key-size + system column size), which will most likely be larger than the actual size of data. This will defeat one of the goals of columnar stores which is to load less data in-memory (to achieve less disk I/O, assuming more frequent columns will get eventually cached), unless you have so many columns and use only very few of them.

B) PostgreSQL doesn't compress each column. Column-stores usually use some kind of light-weight compression to use less disk. When I worked on cstore_fdw, this was one of the desirable features that attracted users.

2. I haven't done PostgreSQL benchmarking in last ~10 months, but unless selectivity of your query is high, fastest of joins won't be even close to a sequential table scan. If the selectivity is low and your query only needs to scan small sub-set of rows, then joins+indexes should be faster, but then again you have overhead of indexes.

One method I found useful in PostgreSQL is to create an index of columns of very frequent queries, and then tune the system to use index-only-scan for these queries.

"and then tune the system to use index-only-scan for these queries."

Caaareful there. I hope you're monitoring your query times for outliers.

The 'store' portion buys you a lot of raw performance because its modern architecture friendly. tl;dr, frame it as the row store being an array of structures while the column store is a structure of arrays.

Then, your basic way of running a query is a sequential scan over columns. That means queries are going to be more cache friendly while also spending much more time in a trace.

If you sort your data into so-called projections, then the columns you sorted by become the equivalent of sorted arrays, which are extremely compressible. At that point, you can cheaply run-length or delta-encode the sorted columns to achieve incredible compression ratios. Even further, any query using those sorted columns can multiply the effective storage bandwidth available to it by the compression ratio.

I'm not a researcher, just an interested developer, so please feel free to yell at me if anything seems wrong :)

EDIT: Turns out I was actually working on an unpushed branch, its been merged to master. The repo now actually has some interesting work.

I think one of the most interesting parts of this project is that the query language is actually Monet Assembly Language, and their SQL implementation just compiles down to that. There's enough documentation to write your own query language for it—including other paradigms like K and Datalog.

We used this in production at my previous role and it was fantastic. There are a few rough edges but it became markedly more reliable over the 2 years or so we used it. It handled huge (well over 100KB) queries generated by SQLAlchemy with ease and was lightening fast.

We were using it as part of a custom BI tool for portfolio risk analysis. The data was mastered in postgresql and loaded into MonetDB for OLAP stuff.

The developers are pretty responsive to questions and bug reports.

Do you know if it's possible to expose an MDX or similar interface to MonetDB for a BI purpose?

I see many references to being able to define your own language, so I wonder if this has been done. I don't see anything explicit when looking through the site.

I ask because many popular visualization and reporting tools can natively generate MDX queries.

It can be done through Mondrian OLAP. See mondrian-rest [0] for an example implementation.

[0] https://github.com/jazzido/mondrian-rest-demo

Cool! Thanks.

I had heard of Mondrian, but didn't think to check from that end if it could use MonetDB as a back end.

SQLAlchemy dialect maintainer here. I'm happy to hear somebody is using it, I've been working on it for a while but in the last 2 years I only received one bug report. And I don't think that is because the code is flawless.

If anybody else has been searching for the license, it's Mozilla Public License 2.0. See http://dev.monetdb.org/hg/MonetDB/file/tip/license.txt

I tried doing a proof of concept with this and some production data a couple years ago. Reading in a CSV file gave error after error after error. I eventually gave up.

Sorry to hear that! We have recently overhauled the CSV importer and added a "BEST EFFORT" flag to it. Perhaps its time to try again?

I tried it a couple of years ago. They were some rough edges, but something I was really eager to see working was their sub projects Datacell ( you can see a pretty cool demo of it there https://www.youtube.com/watch?v=Z36ftaItX1c ). Unfortunately it was not at all ready and this sub project was abandoned. A distributed version of MonetDB with Datacell working would be an absolute killer in the OLAP oriented databases.

Also checkout Actian vector, a super fast analytics database built by a team under Peter Boncz

I got drawn to it due to integration with R, but that area is still fairly lacking and buggy. Installation was unsuccessful. Would be nice to see improvements on this.

Could you be more specific? What went wrong?

I was drawn specifically to the integration. It's been a while, but I tried to get db integration with some files I was loading. Really didn't pan out as the tutorials showed. Will look into my code at work to see if I have examples.

Yeah, I'm also interested because I got very excited about the R integration too. Serverless instance for development and full in-database processing sounds very nice.\

I used this as part of an Analytics BI project years ago. The biggest problem for me was dealing with the optimistic concurrency model.

We ran into enough issues with partial writes during rollbacks due to failed transactions that we eventually had to abandon it in favor of Infobright. Having said that queries were ridiculously fast for OLAP workflows involving millions of records.

Normally for OLAP workflows you want your updates to either be idempotent or be tagged with an id that identifies the ETL job they generated then (very helpful for auditing and related functions), or have a kind of "build and swap" model to ensure availability. That, and performance concerns, mean a lot of OLAP servers consequently often have very limited support for transactions.

"column store technology as pioneered in MonetDB" ???

wtf? MonetDB pioneered column store technology? what else have they pioneered?

Martin Kersten recently got the SIGMOD systems award for MonetDB: http://sigmod.org/sigmod-awards/people/martin-kersten-2016-s...

Are you disputing their claims?

Setting all the details aside, there are about 30 years between the first column-oriented database and MonetDB. Databases are a really well and early researched area of computer science and many ideas existed at least in the academic area for decades before they saw widespread adoption. Another example is what we now label NoSQL databases, they also already existed in the 1960s.

MonetDB in its current form was first created in 2002 by doctoral student Peter Alexander Boncz and professor Martin L. Kersten as part of the 1990s' MAGNUM research project at University of Amsterdam. [1]

TAXIR was the first application of a column-oriented database storage system with focus on information-retrieval in biology in 1969. [2]

[1] https://en.wikipedia.org/wiki/MonetDB

[2] https://en.wikipedia.org/wiki/Column-oriented_DBMS

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact