
MonetDB – Column-store database - PeCaN
https://www.monetdb.org/Home
======
Everlag
If you're interested in reading more about the general design of column
stores, I can recommend [0] as a great place to start. It also compares design
decisions made in MonetDB to other column stores.

If you want to look at simple code after the reading, I'm slowly implementing
a naive column store over at [1] in golang. The overall structure isn't
flexible but the individual columns are quite portable. It has a number of
tests and benchmarks using a few months of MTG price data I had sitting
around.

[0] [http://db.csail.mit.edu/pubs/abadi-column-
stores.pdf](http://db.csail.mit.edu/pubs/abadi-column-stores.pdf)

[1] [https://github.com/Everlag/naive-
columstore](https://github.com/Everlag/naive-columstore)

~~~
tinco
Is there anything special about the 'store' part of column stores? As in, is
their store better in some way than for example splitting your PgSQL tables
into key-value tables?

And with regards to the constructing of SQL queries over the column stores as
MonetDB/SQL does, is it more performant than what I described on PgSQL and
doing the query with a bunch of JOINs between the various attribute tables?

I know there's a lot been written about column stores and there's some very
interesting stuff out there (I particularly like MonetDB/XQuery, but sadly it
was discontinued), perhaps a short answer to those questions might inspire
some more interest in them.

~~~
greggyb
The store part: A columnstore enables some pretty awesome compression to be
put in place, especially in data warehousing workloads. For large, low-
cardinality fields, a dictionary encoding can give great compression. Another
common data pattern is large runs of repeated values (thousands of entries on
the same date), which can be compressed with run-length-encoding very well.

These are two methods utilized in Microsoft's Vertipaq engine (behind SQL
Server columnstore indices and SSAS Tabular models). I call it out
specifically, because this is the technology I am familiar with, but I expect
similarities elsewhere.

Decreasing the memory footprint decreases the hardware requirements and
decreases the amount of time required to process the data, which is a boon
when many of your queries are typified by a full table scan.

Typically we see 7x-10x compression of Vertipaq data compared to uncompressed
SQL tables.

The use case where columnstore engines shine is not OLTP, so a lot of
traditional RDBMS use cases don't see a lot of benefit.

I'll repeat one item - if you can describe your access patterns as consisting
mostly of table scans (or at least large index scans), rather than seeks, you
will likely be able to find benefit from a columnstore engine. This is not to
say that there is no other use case.

Edit: I think Everlag summarized it better in a sibling post, but also sparked
another thought about differences between row-based and column-based storage.

Adding columns to a logical SELECT statement in a columnstore imposes a much
higher load than in a row-based table. In a traditional RDBMS, once you've
gotten to the row's page on disk/in memory, it is essentially free to get 1 or
N fields in your resultset. I.e. Accessing N fields is O(1).

Since a columnstore does not necessarily store each field contiguously on
disk/in memory, it is closer to O(N) in the size of the field list in the
resultset.

I wouldn't quote me on the algorithmic complexity of the two, but it's enough
for intuition.

~~~
ljw1001
>>if you can describe your access patterns as consisting mostly of table scans
(or at least large index scans), rather than seeks, you will likely be able to
find benefit from a columnstore engine.

I think this is very accurate for first generation column stores, but some of
the newer research systems are addressing this issue with technologies like
behind-the-scenes indexing.

~~~
greggyb
This is interesting to me. What is the indexing strategy that is utilized?

To me the challenge is that the values making up a row are stored non-
contiguously. The best case scenario is that a row of N fields requires N data
accesses. This is assuming that you can index directly into an array or
similar structure with constant access time.

Row-based storage means that you get all fields for "free"when accessing a
single row, since the DB engine accesses a page at a time.

This is my understanding of the problem - I am very interested to learn what
I'm missing if you have any good articles to share.

~~~
neeleshs
HBase + Phoenix is the closest I know of. Though, the cost of covered indexes
is duplication. Phoenix also provides local/global indexes, giving an
optimization choice b/w read heavy vs write heavy patterns

~~~
greggyb
Interesting. Do you have any articles to point to explaining the strategy? If
not, I can go digging myself now with direction.

Thanks!

------
PeCaN
I think one of the most interesting parts of this project is that the query
language is actually Monet Assembly Language, and their SQL implementation
just compiles down to that. There's enough documentation to write your own
query language for it—including other paradigms like K and Datalog.

------
yangyang
We used this in production at my previous role and it was fantastic. There are
a few rough edges but it became markedly more reliable over the 2 years or so
we used it. It handled huge (well over 100KB) queries generated by SQLAlchemy
with ease and was lightening fast.

We were using it as part of a custom BI tool for portfolio risk analysis. The
data was mastered in postgresql and loaded into MonetDB for OLAP stuff.

The developers are pretty responsive to questions and bug reports.

~~~
greggyb
Do you know if it's possible to expose an MDX or similar interface to MonetDB
for a BI purpose?

I see many references to being able to define your own language, so I wonder
if this has been done. I don't see anything explicit when looking through the
site.

I ask because many popular visualization and reporting tools can natively
generate MDX queries.

~~~
jazzido
It can be done through Mondrian OLAP. See mondrian-rest [0] for an example
implementation.

[0] [https://github.com/jazzido/mondrian-rest-
demo](https://github.com/jazzido/mondrian-rest-demo)

~~~
greggyb
Cool! Thanks.

I had heard of Mondrian, but didn't think to check from that end if it could
use MonetDB as a back end.

------
perlgeek
If anybody else has been searching for the license, it's Mozilla Public
License 2.0. See
[http://dev.monetdb.org/hg/MonetDB/file/tip/license.txt](http://dev.monetdb.org/hg/MonetDB/file/tip/license.txt)

------
bjt
I tried doing a proof of concept with this and some production data a couple
years ago. Reading in a CSV file gave error after error after error. I
eventually gave up.

~~~
hfmuehleisen
Sorry to hear that! We have recently overhauled the CSV importer and added a
"BEST EFFORT" flag to it. Perhaps its time to try again?

------
julienmarie
I tried it a couple of years ago. They were some rough edges, but something I
was really eager to see working was their sub projects Datacell ( you can see
a pretty cool demo of it there
[https://www.youtube.com/watch?v=Z36ftaItX1c](https://www.youtube.com/watch?v=Z36ftaItX1c)
). Unfortunately it was not at all ready and this sub project was abandoned. A
distributed version of MonetDB with Datacell working would be an absolute
killer in the OLAP oriented databases.

------
AtlasLion
Also checkout Actian vector, a super fast analytics database built by a team
under Peter Boncz

------
IndianAstronaut
I got drawn to it due to integration with R, but that area is still fairly
lacking and buggy. Installation was unsuccessful. Would be nice to see
improvements on this.

~~~
hfmuehleisen
Could you be more specific? What went wrong?

~~~
IndianAstronaut
I was drawn specifically to the integration. It's been a while, but I tried to
get db integration with some files I was loading. Really didn't pan out as the
tutorials showed. Will look into my code at work to see if I have examples.

------
matburt
I used this as part of an Analytics BI project years ago. The biggest problem
for me was dealing with the optimistic concurrency model.

We ran into enough issues with partial writes during rollbacks due to failed
transactions that we eventually had to abandon it in favor of Infobright.
Having said that queries were ridiculously fast for OLAP workflows involving
millions of records.

~~~
cbsmith
Normally for OLAP workflows you want your updates to either be idempotent or
be tagged with an id that identifies the ETL job they generated then (very
helpful for auditing and related functions), or have a kind of "build and
swap" model to ensure availability. That, and performance concerns, mean a lot
of OLAP servers consequently often have very limited support for transactions.

------
shitgoose
"column store technology as pioneered in MonetDB" ???

wtf? MonetDB pioneered column store technology? what else have they pioneered?

~~~
blahi
Are you disputing their claims?

~~~
danbruc
Setting all the details aside, there are about 30 years between the first
column-oriented database and MonetDB. Databases are a really well and early
researched area of computer science and many ideas existed at least in the
academic area for decades before they saw widespread adoption. Another example
is what we now label NoSQL databases, they also already existed in the 1960s.

 _MonetDB in its current form was first created in 2002 by doctoral student
Peter Alexander Boncz and professor Martin L. Kersten as part of the 1990s '
MAGNUM research project at University of Amsterdam._ [1]

 _TAXIR was the first application of a column-oriented database storage system
with focus on information-retrieval in biology in 1969._ [2]

[1]
[https://en.wikipedia.org/wiki/MonetDB](https://en.wikipedia.org/wiki/MonetDB)

[2] [https://en.wikipedia.org/wiki/Column-
oriented_DBMS](https://en.wikipedia.org/wiki/Column-oriented_DBMS)

