
The Rise and Fall of the OLAP Cube - shadowsun7
https://www.holistics.io/blog/the-rise-and-fall-of-the-olap-cube/
======
th0ma5
I used to admin OLAP infrastructure for a large Fortune 100. I think the
article makes a lot of great points, but ultimately there are many paths to
the same result. I'm not entirely sure what that group is up to these days but
using column stores is a big part of it I'm sure, but I think ultimately
deciding between precalculated vs. dynamically calculated pivots is the big
choice.

Whatever system you have can do that, but the real work IMHO is understanding
the org enough to cover the 80/20 of what people will want to see. Ideally you
want to get to a higher level of abstraction such that you continuously codify
your method of analysis or pivots to traditional tables if possible in order
for maximum repeatability.

Sometimes I wonder if graphs or trees really make more sense though and OLAP
being a tree of sorts is just a symptom of the RDBMS ubiquity, but this is
just a meandering notion perhaps.

------
quantified
This piece reiterates a bunch of the usual misunderstandings of OLAP. Likely
because it’s following the DBMS community’s redefinition of OLAP towards
cross-tab group-by, and away from the higher end of the OLAP market. The
“cube” is just as much a logical construct as a SQL table is, and just because
row stores were bad at certain analytics didn’t mean that SQL was.

Compressed column-stores hurt OLAP, because update throughput (the “on-line”
in OLAP) is relatively bad. Uncompressed / array stores are quite good.

The quantity and complexity of SQL necessary for a non-trivial OLAP view is
daunting: time-series views (mix YTD and current-period calcs for
transactional and balance accounts, get the ratio measures calculated, and
handle the joins necessary for calculating and aggregating from the many
(easily dozens) of fact tables that have different dimensionality and
granularity and need to be joined at their finest detail. The user will want
to see a set of metrics for a pair of orgsnizational units, and also as a
percentage difference between the two org units. SQL does not have inter-row
calculations, so quite a bit of work goes into re-shaping the SQL cursor’s
results to something the user wants to see.

All that query generation and result transformation is part of the value-add
of the “cube” server.

So the OLAP cube as a logical construct is definitely not fallen. Just the bad
ones. They’re less flexible than SQL but provide way more productivity within
the query space they’re built for.

~~~
davedx
> Compressed column-stores hurt OLAP, because update throughput (the “on-line”
> in OLAP) is relatively bad. Uncompressed / array stores are quite good.

But why do you want high update throughout when you're doing mostly reads?
Every definition I read about OLAP says this is one of the fundamental
differences, or am I misunderstanding something?

~~~
alexhutcheson
Your users want their dashboards and reports to contain data that's as up-to-
date as possible. No one is happy about waiting 24 hours or more for a batch
update to run if there are decisions they need to make right now. The
difference vs. OLTP is that it might be acceptable for events to take minutes
or even hours to be reflected in the database, vs. ~seconds or less for a
transaction where the user is interacting and waiting for a confirmation.

~~~
amylene
Most of the users I work with, when asked why they want more than daily
batches, respond with something emotional and not business value. I had a
request for real-time updates for a massive metric that is based around
monthly usage.

Ya just gotta push back, and hep people understand what the trade offs are for
real-time. Most people don’t need it.

~~~
quantified
Absolutely true. The ones that do, do. For example, multiuser planning systems
absolutely do. And those were what OLAP was originally intended to support.

------
willvarfar
I'm conflicted; its a nice write-up, and probably generally true right now,
for most stuff.

However, I still live with databases big enough to still need cubes, although
these cubes can afford to be less refined these days. Saying 'bigtable can do
a regex on 30M rows per second' isn't saying it can't be done cheaper and
quicker without paying google etc, if you just have some cubes.

And I think its going to track the normal sine wave: over time, data sets get
bigger, and we keep oscillating between needing to cube and being able to have
the reporting tool 'cube on the fly' behind the scenes.

I think there's a general move not mentioned in the article as data-lakes
become faster, and then data outstrips them, and so on too.

The strength will be tooling that transparently cubes-on-demand. I wish there
were efficient statistics and CDC that tracked metadata so tools can say 'this
mysql table has been written to since I last snapshotted something', and, even
better, 'this materialized view that I have in this database is now out of
date because of writes that affect the expression it is used from on that
other database over there' etc. Basic classic data-sources can do a lot of new
things to make downstream tools able to cache better.

I have a slight problem with the terminology in the middle of the article, as
I'm so far down the rabbit-hole that I think of cubes _as_ databases; I suffer
cognitive dissonance when I read about shifts from cubes to databases etc. To
me, a cube is just a fancy term for a table/view for a particular use-case.

One tool that I'm terribly excited about these days is presto.
[https://prestosql.io/](https://prestosql.io/) allows you to take a
constellation of different normal databases and query them as though they were
one big database. And you can just keep on adding data-sources. Awesome!

~~~
ozborn
Thanks for your comment, I'm not familiar with presto at all - but I did do a
bit of reading of an older article:
[https://www.slideshare.net/frsyuki/prestogres-
internals](https://www.slideshare.net/frsyuki/prestogres-internals)

Would you view presto in its current state as a replacement for vanilla
Postgres with FDW for standard data analysis queries? I don't fully understand
the Postgres/Presto relationship.

~~~
willvarfar
Hmm, presto is not Postgres.

In a way, presto is like a bunch of FDWs on steroids, and a query planner that
has above average cost model for hive etc.

There are plenty of things that presto isn’t, such as a good replacement for
Postgres in classic oltp workloads.

------
buremba
The columnar database engines are powerful enough to answer the ad-hoc
questions so you often don't need to materialize the summary data somewhere
else or use BI tools such as Tableau that fetch the data into their server and
let you run queries on their platform.

ELT solutions such as Airflow and DBT let you materialize the data on your
database with (incremental) materialized views similar to the way how OLAP
Cubes work but inside your database and only using SQL. That way, you won't
get stuck to vendor-lock issues (looking at you, Tableau and Looker), instead
manage the ELT workflow easily using these open-source tools.

These tools target the analysts/data engineers, not the business users though.
Your data team needs to model your data, manage the ETL workflow and adopt a
BI tool for you. When you want to get a new measure into a summary table, you
need to contact the analyst in your company and make him/her change the data
model. As someone who is working in this industry, I can say that we still
have a way but the BI workflows will be much more efficient in a few years
thanks to the columnar databases.

Shameless plug: We're also working for a data platform, you model your data
(dimensions, measures, relations, etc.) and build up ad-hoc analytics
interfaces for the business users. If the business user wants to optimize a
specific set of queries (OLAP cubes), they simply select the dimension/measure
pairs and the system automatically creates a DBT model that creates a summary
table in your database similar to OLAP cubes thanks to the GROUPING SETS
feature in ANSI SQL. Here are some of the public models if you're interested:
[https://github.com/rakam-io/recipes](https://github.com/rakam-io/recipes)

------
_Codemonkeyism
After maintaining an OLAP cube system for some years, I'm not that sure after
reading the article.

The nice thing of an OLAP cube is the UI and how business users can easily
drag and drop items to explore data (standard reports are best created
automatically and don't need an OLAP layout/setup).

If the UI (Tableau, Excel Power Pivot) is the same, then yes, OLAP cubes are a
thing of the past. Otherwise not.

~~~
tomrod
It's basically the same.

------
simo7
I think the article is a bit simplistic.

It's true that _often_ OLAP cubes are not needed. That's simply because the
amount of data and the latency requirements are _often_ not too demanding.

Also, materialized views don't solve the major issue with OLAP cubes: the need
of maintaining data pipelines.

I wonder if a solution to this problem could come from a different way of
caching result sets: new queries that would produce a subset of a previously
cached result could be run against the cached result itself. Of course this
opens up a new set of problems, cache invalidation etc..

------
simo7
Two false statements in this article:

> ...Amazon, Airbnb, Uber and Google have rejected the data cube...

Airbnb uses Druid which is essentially an OLAP cube.

> BigQuery, for instance, doesn’t allow you to update data at all

It's not like that anymore since several years.

~~~
lkcubing
Amazon uses Apache Kylin, which is a modern OLAP cube technology.

~~~
shadowsun7
Amazon also uses Redshift, which is a modern columnar database.

An example does not an argument make.

------
beefield
Sorry, could someone ELI5 what is OLAP? And while you are there, what is
Tabular Model? As background,I have worked with SQL and relational databases,
and occasionally keep on hearing these, but nobody ever explained to me what
these are and why I should be interested. So far I have just shrugged and
thought that I guess my workloads/datamodels/whatnot just do not need these
fancy things, but always I see them, there is someone nagging at the back of
my head that maybe you should have a look...

~~~
slumdev
OLAP: Online Analytical Processing. Cranking through large amounts of data
with a focus on aggregations like sums, averages, medians, etc. Measures
(numbers) are defined by dimensions (attributes with usually discrete
domains). Aggregations are frequently precomputed on many (or all) dimension
axes so that they are immediately available. Models can be built from
something as simple as a wide CSV file or as complicated as a snowflake
schema.

The kind of relational database you're familiar with is probably OLTP (Online
Transactional Processing).

~~~
beefield
So, when my aggregate queries/views are too slow even after indexing and
tuning, then I should start to google what OLAP is? My go-to tool for this has
been materialized views (or in some cases simply a new table that is refreshed
every now and then). What would be the cases when OLAP is better/worse than
materialized view? (based on the main article, it sounds like pretty much no
other advantage for OLAP than smaller resource requirements)

~~~
slumdev
If your needs are well-defined in advance, you are fine with a materialized
view. What I mean by "needs" is, "What questions are you trying to answer?"

OLAP's strength is that the platforms that implement it can precompute
aggregations across all of your data and let you quickly answer questions that
you might not have known you had.

------
iblaine
For anyone wondering, OLAP != OLAP Cubes

OLAP = A category of databases meant for analyzing data. These are eventually
consistent db's, and not OLTP db's. OLAP db's include Redshift, Teradata,
Snowflake, BigQuery, and others. Generally what makes a database an MPP
database is partitioning compute and storage. Generally what differentiates
one MPP db from another is whether or not data and compute are colocated.

OLAP Cubes = A feature built into SQL Server, that includes has its own
dialect of SQL called MDX. OLAP Cubes are decreasing in popularity because you
can achieve the same results through other means and less effort.

~~~
bsg75
Not unique to MS SQL Server - Oracle and IBM (InfoSphere) have cube engines,
and there have been independent implementations as well:
[https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers](https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers)

~~~
iblaine
That’s a valid point. Thanks.

------
lukehan
well, I would like to say the OLAP Cube is just re-rising now. There are 1000+
companies deployed Apache Kylin (OLAP Engine for Big Data) in the past 5
years, for 100+B rows, for 100+ concurrent users...many different use cases
are based that technology...it works very well with BI tools and so friendly
to analysts who are using such "old fashion" every day over the decade (how
hard for them to be Data Scientists?) check more here:
[http://kylin.apache.org/community/poweredby.html](http://kylin.apache.org/community/poweredby.html)

~~~
Merick
+1 for Apache Kylin, it's a great project and awesome open source community.
If anyone is curious about what modern OLAP is capable of, check it out.

If you want an alternative take on where OLAP is today and what it is capable
of, I usually recommend this article: [https://kyligence.io/blog/olap-
analytics-is-dead-really/](https://kyligence.io/blog/olap-analytics-is-dead-
really/)

~~~
shadowsun7
There are huge problems with that piece. For starters, OLAP != OLAP cube.
Columnar databases and OLAP cubes are both designed for OLAP workloads. They
are simply different architectures. Therefore, it is impossible to argue that
'OLAP is dead' — it cannot be dead, because OLAP is simply a type of database
usage.

At this point you might say, "oh, OLAP cubes refer to an abstraction, it can
be implemented using columnar stores!" — and I would point you to 40 years
worth of academic research that stretches back to the early 80s. The OLAP cube
or data cube refers to a specific type of data structure. It just so happens
that vendors like to use the term 'OLAP cube' even when they are using a
columnar engine under the hood, because it sells well.

------
liyang-kylin
Cost is a big factor the author underestimated in this big data era.
Precalculated cube is not only faster but also times cheaper in the cloud,
thanks to the reuse of precalculated result.

Dynamic query services in the cloud basically charge by processed data volume,
like Google BigQuery and Amazon Redshift/Athena. For small and medium dataset,
this works well. But for big data close to or above billions of rows, the cost
will make you reconsider.

In the recent Apache Kylin Meetup in Berlin, OLX Group shared their comparison
between OLAP cube and dynamic query in real case. Given 0.1 billion rows, cube
technology (Apache Kylin and SSAS) prevails over MPP+Columnar (Redshift)
easily. Especially Apache Kylin is 3.8x faster and 4.4x cheaper than Redshift
for their business. ([https://www.slideshare.net/TylerWishnoff/apache-kylin-
meetup...](https://www.slideshare.net/TylerWishnoff/apache-kylin-meetup-
berlin-with-olx-group))

For me, a mix of precalculation (80%) and dynamic calculation (20%) should hit
the sweet point between cost effectiveness and query flexibility.

------
lkcubing
Thanks for sharing. Interesting write up.

While this article accurately captures the issues with traditional OLAP Cubes,
it failed to recognize the latest development in this domain.

Projects like Apache Kylin, and its commercial version Kyligence, leverage
modern computer architectures such as columnar storage, distributed
processing, and AI optimization to build cubes over 100s of billions rows of
data that covers 100s of dimensions. The performance result is unprecedented
in either traditional OLAP cubes or today's MPP data warehouses. That's why
the world's largest banks, retailers, insurance companies, and manufactures
are turning to Kylin/Kyligence for the most challenging analytical problems.

Not to mention the rich semantic layer that modern OLAP cube technology
provides, which greatly simplifies analytics architecture in the enterprises.

And, comparing columnar stores to OLAP cubes is like comparing apples to
oranges. The former is a storage format and the latter is an analytical
pattern. Modern OLAP cube technology like Kylin/Kyligence stores cubes in
columnar stores anyway.

~~~
shadowsun7
> And, comparing columnar stores to OLAP cubes is like comparing apples to
> oranges. The former is a storage format and the latter is an analytical
> pattern. Modern OLAP cube technology like Kylin/Kyligence stores cubes in
> columnar stores anyway.

This is mistaken. I went back to read most of the academic literature on OLAP
cubes while working on this piece (which, unlike vendor marketing, is used
with consistency since the early 80s). OLAP cubes or data cubes refer
specifically to the data structure that grew out of nested arrays. An OLAP
cube may be materialized from a column store, but a column store isn't an OLAP
cube.

Relevant sources are included at the bottom of the piece.

------
markus_zhang
I have one question for you guys. If my company is focused on Spark and
Vertica, and I want to learn data modelling on top of those, does Kimball
still make sense? The article says yes in general but I'd like to know your
opinions.

Currently the BI team doesn't do much dimensional modelling as far as I see.
Every thing is taken from Kafka and dumped into some wide tables with all
columns that we the analysts need. Actually there is no data modelling at all.

------
ovi256
If you're a hacker interested in SQL and OLAP, you might enjoy Greenspun's (he
of of Greenspun's tenth rule fame) writings on these subjects:

[https://philip.greenspun.com/wtr/data-
warehousing.html](https://philip.greenspun.com/wtr/data-warehousing.html)
[https://philip.greenspun.com/sql/](https://philip.greenspun.com/sql/)

Although technically obsolete (as in talking about 90s database systems that
have bitten the dust since then), that's a minor defect. He spends most effort
on teaching timeless principles.

------
inshadows
What's some concrete example of OLAP cube? What does Alice, a data analyst,
actually do when she gets to work at her computer? What does she interact with
on the screen? Does she use some specialized software to project the data cube
into two dimensions (contingency tables) to find hidden meaning in the data?
There's a lot of abstract talk and no actual examples on the Internet, except
for SQL Server tutorials which always end up with some kind of E-R diagram.

------
huy
Interesting perspective. What do you say to someone who's been using OLAP cube
for their entire BI implementation? What would be the transition plan to
adopting MPP databases?

~~~
Spasnof
Background: Someone who has worked in the old Microsoft Analysis Services
stack 8 years ago and has moved to more columnar data formats like parquet or
row based MPP's.

Transition plan I would say is find the dataset that is exploding in size or
complexity and start your POC there. I did customer service datasets on OLAP
so it only grew at a pretty small scale and the data model didn't change that
much. So OLAP was fine except for the fact that nobody else knew how to
maintain it.

The main growing pain is find what your front end developer flow will be. It
will be the same governance as the OLAP but more democratized so be ready to
make your back end more front end. For MSAS it was excel but more modern
systems are also more wide open. The article suggests just SQL but that can
get out of control. How do you reduce reinventing the wheel etc? How do you
prevent a lineage mess of derivative on top of derivative if you give users
write access. Etc. IMO Tableau is a great product that allows the OLAP like
exploration but can use SQL as an input. Just make sure people get the sql
behind under some kind of source control and governance.

From the data model perspective it I think the main difference is make the
tables wider and "pre join" in your immutable dimensions with higher carnality
(ie customer). Just be careful of highly mutable data and keep those in
separate tables because it is very painful to rewrite columnar data. Ie if you
partition by date to update a single record you rewrite the entire date.

(About governance) I mean more passive governance not gatekeeping. Pretend
each end user and dataset costs you money. How do you track them passively
with some thin yet easily trackable logging? Business Unit and unique Job are
bare minimums.

------
js8
It seems that the article makes a categorical error, arguing that OLAP cubes
were replaced by columnar data stores. I always understood OLAP cube as an
abstract concept that can have various technical implementations, while column
store is a kind of optimization in that technical implementation.

~~~
endorphone
The article is generalizing, as such articles do. Necessarily so. There are
always exceptions. It is unfortunate that when such articles appear the
dominant response is nitpicking or exception pointing (I'd say 80%+ of the
comments thus far).

In the overwhelming bulk of cases firms have a column store that they generate
cubes from. OLAP is run against the cubes. Some put warehousing in between,
though that changes little. Cubes are fundamentally a form of caching because
historically it was prohibitive to do large-scale aggregations in real-time.
With massive memory servers, and more importantly flash storage that improves
aggregate performance at the enterprise scale by many magnitudes -- a million
times faster analysis and aggregation is entirely possible -- that historic
caching step becomes a hindrance and maintenance/timeliness issue. So it's
discarded.

That's all 100% true. It has happened in many orgs.

Not all, of course. But as a general trend.

------
sgt
How does something like Tableau fit in? I know of people using Tableau with a
Postgres connector, but I am not sure if that allows you the same kind of
performance as you'd get with OLAP or even a columnar DB.

~~~
arethuza
I don't know about Tableau, but Microsoft Power BI does use a columnar engine
- I believe its the same VertiPaq/xVelocity component used in Excel Power
Pivot and SSAS Tabular Models.

~~~
greggyb
Correct. It also shares a fair bit of implementation with columnstore indices
in MSSQL Server.

There's also a proprietary relational query language shared among Power BI,
Power Pivot, and SSAS Tabular, called DAX.

------
dexterzz
The author don't know today‘s OLAP。 Look Tabular Modeling in SQL Server
Analysis Services（Power BI） in-memory column-oriented analytics engine.

------
lou1306
> Codd got called out for his conflict of interest and was forced to retract
> his paper … but without much fallout, it seems: today, Codd is still
> regarded as ‘the father of the relational database’

I found this passage confusing. He is regarded as such because of his work on
the relational algebra, and the shady OLAP backstory is unrelated to that.

