
The design and implementation of modern column-oriented database systems - olooney
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
======
manigandham
Modern column-oriented databases are rather incredible at what they do, and
I'm surprised by how little they are used or known about. Redshift, BigQuery,
Snowflake, Azure SQL DW, SQL Server, MonetDB, Vertica, Greenplum, MemSQL,
kdb+, Clickhouse, even Druid are all column-stores that can do sub-second
queries on massive amounts of data.

I also want to note that time-series databases are basically obsolete at this
point because time-series data is very well handled by these OLAP column-
stores. Create a table with time as a primary or sort key and you'll get fast
queries with full SQL and joins.

~~~
alexhutcheson
> Create a table with time as a primary or sort key and you'll get fast
> queries with full SQL and joins.

Be careful with this. You can easily end up with pathologically bad
performance, resource usage, contention etc. if most of your requests end up
being directed to the same workers or shards in a distributed system.
Partitioning by time can cause this because most queries tend to be for recent
time windows.

Time can be a decent primary key withing a shard, but is normally a really bad
choice for any key that's used to partition data between shards. Make sure you
understand how your system partitions data so you don't fall into this.

Apologies if that came across as pedantic - you probably already know this,
but hopefully some other readers can be saved from learning this through
painful experience.

~~~
georgewfraser
Column stores typically don’t use the primary key for anything except
metadata. I think you’re thinking of distributed row stores that use the
primary key to shard.

In a column store, you would apply the sort/cluster key to your time column,
which would partition the table within nodes, not across nodes. This will
speed up queries that specify a narrow time range, and will not cause any of
the pathological behavior you describe.

Some column stores allow you to chhose a distribution key, which specifies how
data is distributed across nodes. It can be used to optimize joins but it’s
dangerous—it can cause exactly the problem you describe.

~~~
alexhutcheson
Good clarification, thanks. I'm not familiar with the exact partitioning
schemes of the various systems.

------
walterbell
[http://www.timestored.com/time-series-data/column-
oriented-d...](http://www.timestored.com/time-series-data/column-oriented-
databases)

 _> Open Source Column-Oriented Databases: There are three available open
source column databases, all were based on works of research groups that later
saw commercial spinoffs. C-Store produced vertica, MonetDB spawned Vectorwise
and LucidDB was DynamoBI. Each project has stagnated as the team around them
moved on to commercial endeavours. Only MonetDB appears to be still actively
developed, it's also the one that seems most feature complete as we'll see
later._

Is the above an accurate characterization of OSS column-oriented databases?

There is also MapD which uses GPUs, [https://github.com/mapd/mapd-
core](https://github.com/mapd/mapd-core)

 _> MapD Core is an in-memory, column store, SQL relational database that was
designed from the ground up to run on GPUs._

~~~
threeseed
Cassandra also counts as a column orientated database.

~~~
manigandham
It's actually not.

Cassandra is a (badly named) wide-column database, which is more accurately
described as an advanced key/value store, specifically a sorted hash map with
multiple levels of values, which can also be part of the primary and secondary
keys.

------
amelius
So what kind of operations does a column store typically allow? Which of these
operations are optimized for:

\- Appending cells to a column

\- Removing a range of cells from a column

\- Searching for data in a column (exact search or complicated text search).

\- Performing a map over a column (applies user defined function to each cell,
producing a new column or makes changes in-place)

\- Filtering a column using a user-defined function.

\- Combining two or more columns through some user-defined function

\- Sorting

\- more ?

~~~
kwillets
He didn't clarify it in the article, but deletes are basically flagged by a
"deleted" column, and updates are implemented as delete/insert. The overriding
idea is that a row is never overwritten; storage is write-only. Containers are
periodically merged into larger containers (an LSM tree), and deletes are
purged at that time.

------
toolslive
Slightly different topic, but there is value in using a column based ORM
strategy where each attribute has it's own table, and objects are dispersed
over the tables. It's very flexible and most of the time you just want a few
objects with a certain attribute having a certain value anyway.

~~~
phaedrus
Sounds a lot like the array-of-structs vs struct-of-arrays design decision for
game engines. One of the features Jonathan Blow put into the programming
language he has been working on (JAI) is better support for this dichotomy.
IMO most programming languages have optimized the language only for array of
structs and historically all but ignored the struct of arrays case.

~~~
toolslive
yes, and it has the same performance implications (struct-of-arrays performs
better as it yields better locality when iterating over the fields)

------
neonate
Does anyone know how much of this analysis does or doesn't apply to kdb? I
wish it had been included.

~~~
manigandham
All of it applies. kdb+ is one of the most advanced and has unique features
like special "as of" joins and complex math expressions because it's designed
for heavy fintech uses.

------
olooney
I'm curious about this comment:

> The ultimate performance boost from compression comes when operators can act
> directly on compressed values without needing to decompress. Consider a sum
> operator and RLE encoded values. It suffices to multiply the value by the
> run length.

It sounds very impressive and I can see in theory how it could benefit COUNT(
_), COUNT(DISTINCT_ ), SUM(), AVG(), MIN(), and MAX()... in other words,
pretty much all of the common aggregation functions. For example,
COUNT(DISTINCT *), MIN(), and MAX() can literally just ignore run length as
those are known duplicates, while the others simply need to use the run length
as a factor.

Does anyone know if this specific optimization is actually implemented in any
real-world database?

~~~
kwillets
From a Vertica query plan:

    
    
        explain select country, count(*) from tablename group by 1
    
        ...
        +-GROUPBY PIPELINED (GLOBAL RESEGMENT GROUPS) [Cost: 79, Rows: 1] (PATH ID: 1)
        |  Aggregates: count(*)
        |  Group By: tablename.country
        |  Execute on: All Nodes
        | +---> STORAGE ACCESS for tablename [Cost: 78, Rows: 14M (1 RLE)] (PATH ID: 2)
        ...
    

Note the row count at the top level.

This type of thing only works when the group column is the first in the
projection sort order (or for multiple columns, a prefix of it). Generally
GROUPBY PIPELINED indicates that the sort order is being used (including RLE).

------
amelius
What is the difference between a column-oriented database, and storing your
data in a regular database where all tables have only a single column?

~~~
wenc
Columnar databases have optimizations for columns, like RLE compression, Bloom
filters, etc. that speed up filtering, lookups and aggregations. Row-stores
don't typically implement these optimizations.

That said, some traditional row-store databases like SQL Server support
columnar indices, which give row-oriented databases similar performance
characteristics to columnar. The downside is lower write speed (the columnar
indices have to be maintained, and the writes are still done row-wise).

~~~
manigandham
SQL Server actually does have full column-oriented storage. They just call it
a columnstore index but it's not a true index and it will turn the whole table
into column-oriented compressed data files. It originally didn't support any
updates in v2012 because of this, but now has evolved and can even combine it
with in-memory option to support in-memory columnstores that support constant
real-time updates.

~~~
wenc
Right, but I would argue that it is a form of an index. An index is simply an
internal derived data structure that has to be kept in sync with the source
dataset.

All indices are derived views of the original dataset that are optimized with
certain characteristics.

~~~
manigandham
There might be confusion here. The feature is called columnstore indexes but
it has 2 modes: clustered or nonclustered. In clustered mode, the table itself
is converted to column-oriented data format and is not a derived dataset.
Adding a nonclustered columnstore index though will create a separate copy of
the table to be maintained.

A column-oriented table can also have regular b-tree indexes as well. Perhaps
it's unfortunate that they used indexes as the main interface for managing
these table types instead of explicit language.

~~~
wenc
Thanks for the clarification -- I wasn't aware that clustered and non-
clustered modes were stored differently on disk. I stand corrected.

