
Ask HN: OLAP and Reporting that feels like 2012 - endymi0n
Hi Guys,<p>after researching how to continuously aggregate and mine our tracking data (200GB and growing fast) for almost a week, I'm still stuck. Is it just me or didn't I just find the right product yet? I must admit that I'm a generalist developer, no DBA - but to me it looks like all the products I've looked into just don't "feel" right. JasperReports, InfiniDB, Pentaho, just to name a few... it's all - how can I say - crusty, unintuitive, 1999. All the products look very corporate, there are basically no howto's, no prices, no shiny, bold Open Source products that fit my bill. I wouldn't even mind using a good commercial product that does what I want, but even the advertisements are "from DBA to DBA". Lots of termdropping like ETLs, M/ROLAP, BI, Case-Based-Reasoning - but nothing that looks reasonably simple and straightforward. Maybe I'm spoiled by the DX (Developer Experience) that Backbone and Rails give me, but is there really nobody that has done something simpler and more straightforward? Like "these are my dimensions, these are my facts - now generate my Cube here so I can go datamining"? Now I know this is a huge field and I might sound naive but I'd like to know if there are others sharing my pain or having something closer to a solution.<p>Some more data: I'm lead architect for a Berlin startup - currently we're storing our data in MySQL but moving to Cassandra soon. I'm open to any technologies that could simplify our lives here.<p>Thanks in advance,
Dom
======
troels
Yeah, I've been wondering the same thing. Here's what I found so far, but
really nothing definitive:

<http://demo.analytical-labs.com/>

<https://github.com/drkreso/cube>

<http://databrewery.org/cubes.html>

<http://exposedata.com/parallel/veggie/>

<http://square.github.com/crossfilter/>

Some saas's:

<http://www.metricly.com>

<https://www.google.com/fusiontables/Home/>

<http://chart.io>

<http://www.gooddata.com>

Another option is to use Excell. With an odbc data source and some
materialised views, you can pull stuff into a pivot table and get pretty far
on the BI front. Of course, you have to use Windows on the client and mssql on
the server side (I currently just mirror my MySql database into mssql on a
daily basis).

Interested in hearing if you found other options?

~~~
oscilloscope
The most recent parallel coordinates examples are here:

<http://exposedata.com/parallel/>

<http://bl.ocks.org/2420080> \- iPad and FF friendly

Also check out:

<http://vis.stanford.edu/projects/datavore/>

<https://github.com/mleibman/SlickGrid/wiki>

<http://novus.github.com/nvd3/>

More promising saas:

<http://polychart.com/>

<http://www.plot.io/>

------
bsg75
Is the greatest source of your pain at the data storage end (ETL, RDBMS),
aggregations (OLAP cube), or presentation (reports, visualizations)?

I ask because other than issues with the traditional RDBMS+cube approach,
mentioned by some in this thread as outdated, one of the issues I find is that
open source ETL and BI tools do not have the maturity of open source
databases.

Having moved from all-Microsoft shops (SQL Server, SSIS, SSRS), to PostgreSQL
for data, and a mix of OSS and commercial ETL and presentation tools, my
greatest source of frustration is the effort in working with tools that feel
"unfinished". Cumbersome processes to accomplish what should be simple tasks
can make any project frustrating.

This is not to say that the folks working on OSS OLAP / BI tools are not doing
good work. But these tools have not been in development as long as something
like PostgreSQL, so they simply don't have the maturity.

Another issue is that many of those who are consumers of open source OLAP
tools are data people, not as adept in languages like Java or C. Thus non-
developers can't contribute back to the projects at the same rate that
developers can.

None of these points solve your issue, but as it stands, I find a mix of
mature open source with commercial front ends suitable. I am mixing PostgreSQL
with Tableau with some success. Tableau means license costs and Windows, but
from an ease of development perspective, I have not found a suitable open
source replacement. Saiku is one that has my interest.

------
daemon13
I can confirm your findings - this domain [simple OLAP] really sucks.

As CFO I implemented [all including planning
database/dimensions/forms/reports/calculations/rules/etc except for
programming] 3 OLAP based projects in three multinational companies.

Closed source In 2 projects we used Hyperion Essbase, in 1 project we used SAP
Business Objects. In all cases we used consultants for programming the cubes
and all else per spec. Also, there is MS solution - have not used it.

There are a number of OpenSource solutions, including Mondrian, etc - most of
them use SQL database (PostGre et all) for storing data and building OLAP
cubes.

Check SO on OLAP, most Q&A are garbage, but found some good gems. You can
start from here

[http://stackoverflow.com/questions/3705011/what-should-i-
hav...](http://stackoverflow.com/questions/3705011/what-should-i-have-in-mind-
when-building-olap-solution-from-scratch)

I also saw post of Cubes implementation in Python some time back on HN -
Google for Stiivi-cubes-v0.7 and Stiivi-brewery-v0.7.

pls keep posted if you'll find smth cool, i.e. easy and simple DIY OLAP. After
spending close to several days on Googling, I gave up.

------
chris_dcosta
I've been in this field for more than a decade and I couldn't agree more. The
tools, and even the underlying theory is stuck in the past, and every project
I work on is hampered by the technology not meeting the expectations of
(business) users who demand a much more intuitive way of working - because
that's what the experience every day on consumer devices.

There are issues though with getting consistent and accurate measures that are
not easily solved by just looking at the data - albeit at a very high level of
detail) for example when customers move or change details, it's easy to lose
track of where you should be reporting the facts. And depending on the volume
of data it's not going to be a long term solution to recalculate _everything_
every day - which is the mistake a lot of companies make when trying to cope
with this.

The problem is in your question. The Cube. This is a construct that assumes
that the underlying data is typed allowing it to be re-arranged into facts,
characteristics and attributes, and unless you have the foresight to think of
your db like this before you build your web app, it's hard going to squeeze it
into the dimensional model.

In fact I'd go as far as saying that the dimensional model is no longer a
suitable model for dealing with event based data generation, and that may be
why you ask the question.

It's one of my personal goals to solve this, I have some thoughts but it's not
quite my priority at the moment, but I'd love to hear what others have to say
on this.

~~~
siganakis
I too have spent quite a few years working in this area. I have primarily been
involved in building tools to make this kind of analysis easier - but it is
very difficult.

As you mention, the primary issue is that almost all business data is _very_
complex. There are very few areas (in my experience), where simple queries
offer much value, because there are so many niggling business rules that lead
to exceptions and counter exceptions to rules.

This means that "self service" analytics are pretty much impossible. There are
often very few people in an organization who actually know how to define even
seemingly simple measures like profit or revenue.

This means you need consultants who can learn the business rules, and who also
understand the data platform. This makes these projects expensive, and makes
it hard for platforms to win in the market.

For the most part in my experience, the best way forward is to choose a
database engine (MySQL, Postgress, Oracle, DB2, SQL Server), get the data into
it (itself difficult!), then write a bunch of SQL to incorporate the business
rules. Then if you are feeling fancy (and still have budget!), built a nice
HTML front end on it.

Its not very sexy, but it does work. I would love it if there was an easier
way, but I am yet to find it.

~~~
viandante
Agree. But usually business rules and exceptions boil down to filters and look
ups on the dataset. It is difficult to keep up with that mess from outside,
but if you deliver a way to make look ups and filters efficiently, maybe you
got a solution.

------
dantiberian
We use Qlikview at work and SQL Server Reporting Services. I'm not clear on
what exactly you are trying to do but Qlikview could be worth a look. There is
a full featured demo which is only limited in that you can't share your
documents with others.

Two neat features of Qlikview are that it compresses everything into memory
(depending on the data up to 100:1 compression) so it is quick to manipulate
your data. The second feature is that it links everything and you can change
your filters on the fly. The net result is that you can explore your data on
the fly and get instant feedback to your filter changes.

~~~
gadders
+1 for QlikView. A lot of banks, amongst other companies, use it and it's very
good for an interface for end users (so IT people are out of the loop for
tweaking reports).

------
Consultuing
Almost all OLAP products provide this simple functionality, but it's almost
impossible to find where it is because it is hidden under layers and layers of
"added value" in the form of tools to extract, transform and present that data
in flashy executive dashboards.

My first advice would be, don't try to build the fancy executive level
dashboards, that's the component of the system with the least ROI, in spite of
being the most visible for upper level management and thus likely to be the
one that is easier to get funding for. The dashboard should be the top of the
pyramid, but without a solid foundation it will be an exercise in frustration
each time anyone wants to go deeper into any of the numbers.

The problem with these tools is twofold: first, they are targetted at big
enterprise buyers that are not the end users. They are strong on fancy and
simple features that do not scale well when your analysis becomes real-world
complex. Second, all those fancy tools break down under heavy data loads. So
in the end "these are my dimensions, these are my facts - now generate my Cube
here so I can go datamining" works only during the executive demo.

You're usually better cherry picking up components from these suites that work
well enough for you, and building the missing pieces on your own than trying
to fully try to shoehorn one of these suites into your needs. Or maybe your
needs are simple enough that one of these is good enough for you.

And finally there is the problem with data quality. Perhaps in the startup
web-based application world this problem is less relevant, but it's not
unusual for large organizations to invest heavily in large OLAP efforts only
to find later that they should first have invested heavily in large proces,
application and data cleanups. Garbage in, garbage out, no matter how fancy or
cube structured the garbage out is.

Which by the way, I think that the cube is a fairly good structure for end
users to work with. The only caveat is that have to make sure that they are
able to link the data in the cube with the underlying transaction-level
entries that make it up.

------
viandante
Well, at work I use Cognos. But it is clumsy and it forces you in its
paradigms. Instead, when I dream or code (I am learning to code, I am a
finance guy), I am making demos for things that should provide a better way to
explore your data: <http://andemo.ep.io/>.

Basically, my issue with corporate analytics is that there are no apis.
Everything has to be accessed from its software. SAP data stays in SAP or goes
in some 90' data warehause system.

Another issue is that data is stored in multiple versions in different files
across many users locations (Excel files madness). This drives people nuts and
I can't understand why don't we already have a BaseCamp for finance and
analytics.

A third issue is that people are mixing data and views. This makes usually
data in, i.e. in spreadsheets, completelly unuseful for other analysis, you
basically have to copy and paste data from the fancy templates to a clean file
to use it. Views should be web based too, not to enter the 'update this file
periodically manually' hell.

Once you have apis, it's a matter of building apps to consume them and
eventually to make snapshots of your data (you don't want to recalculate
monthly billings at every page request). Your specific need, for example,
could be solved with a fancy web based pivot table (the sort you have in
spreadsheets) that accesses json data...

So, finally, if you find or start an interesing project of this type, let me
know as I am interested to join!

~~~
dakoller
RE. issue 2) and 3) I just thought that data cubes an open, but standardized
format could help - not only in the financial area: I found RDF data cubes (
[http://publishing-statistical-
data.googlecode.com/svn/trunk/...](http://publishing-statistical-
data.googlecode.com/svn/trunk/specs/src/main/html/cube.html) ) perhaps being a
part of the solution there, with the value add that you could easily connect
different data source using their semantics.

(liked the "basecamp for analytics" point)

~~~
viandante
Yes, an RDF format looks like a good solution. It appears that RDF is quite
supported in python and that it also abstracts away persistance
([http://readthedocs.org/docs/rdflib/en/latest/persistence.htm...](http://readthedocs.org/docs/rdflib/en/latest/persistence.html))
in a way. I ll look into that, thanks.

If you like the basecamp idea for analytics, I am kfk on github, I am doing
some demos to get an idea.

------
pstoellberger
"there are basically no howto's, no prices, no shiny, bold Open Source
products that fit my bill"

as the developer of the first one that got mentioned here (saiku)
<http://demo.analytical-labs.com/>

i can say that the problem is that everybody asks for a shiny, nice, intuitive
open source project.. but there is very little will to contribute to such a
project.

we really try to create something thats easy to use, fun to play with the
data, and provides the necessary functionality. but unfortunately developing
something that looks simple needs time, a lot of time. so if anyone is
interested, i always appreciate any kind of contribution (not necessarily just
coding)

if more people would work together, there would already be such a tool that we
are all dreaming of

------
andrewkerr
Open source BI software is generally poor.

For intuitive front ends look at: neutrinoBI tableau software tibco spotfire
qlikview panoptical advizor solutions

Some (e.g. qlikview) insist on taking a dump of all data onto the workstation,
so aren't suitable for big data.

These may still need a relational database to actually store the data. We use
Vectorwise. Infobright (MySQL engine) is good if you're on a tight budget,
particularly their open source community edition. There are many players here:
Vertica, ParAccel, Exasol, Teradata, Aster Data, Netezza, Kognitio, InfiniDB
(also MySQL), probably others.

------
m-ic3
I guess <http://www.icCube.com> (an in-memory OLAP server) is quite easy to
setup (in JAVA following Web standards). Then, a quite simple JSON/HTTP
interface is provided in addition to the standard XMLA. There's currently a JS
client API and an early JAVA API as well. The JS client API is used by their
visualization library / dashboards.

I'm not sure that OLAP cubes are good for 200G / week. I would go for
something in between. Use a map/reduce algo. to mine your data and generate
the relevant data from which you will build cubes for later and different
analysis.

As somebody said, I guess OLAP cubes concepts are still good nowadays and MDX
is a very powerful language that goes far beyond SQL.

My 2 cents (disclaimer I work for icCube)

------
thibaut_barrere
I _think_ you'll like this:

<https://github.com/rsim/mondrian-olap>

As for the ETL part, you may like that (which I maintain):

<http://www.activewarehouse.info/>

~~~
thibaut_barrere
Also adding this:

<https://eazybi.com/>

which is a SaaS run (I think) by rsim.

~~~
rsim
Yes, I can confirm that <https://eazybi.com> is built and run by me.

The goals of eazyBI are as described in the question - simple way to upload
your data or import from other applications and easy to use OLAP reports,
charts and dashboards. And it is built using Ruby on Rails and Backbone.js as
well :) In the backend it is using Pentaho Mondrian OLAP engine which is
embedded in JRuby open-source library <https://github.com/rsim/mondrian-olap>

~~~
Cloven
eazyBI looks very interesting, but your site and sales pitch seem pretty
unpolished. How big is your team and where are you situated?

~~~
rsim
Currently it is primarily just me and I am from Latvia (Eastern Europe). And
trying to improve site and sales pitch as well, I appreciate any feedback :)

------
vtuulos
I feel your pain. I started <http://discoproject.org> back in 2007, trying to
make MapReduce more easily approachable.

If vanilla MapReduce is too low-level for you, you might want to take a look
at <http://bitdeli.com> which provides a cloud-based scripting platform for
real-time data (disclaimer: I am a co-founder of Bitdeli).

Neither Disco nor Bitdeli provides OLAP out of the box but if you are ok with
writing a screenful of Python to process your data, you might find them
useful.

------
me_bx
I know no solution doing all we need, only some interesting components. Here
are some more links:

storing data

* couchdb <http://couchdb.apache.org/>. Good if you already know the analysis you want to do. drawback is that indexes you create (= materialized views) are monodimensional. a multidimensional indexing on top of it would simply be great. Not great if you need real real time reporting and your data change a lot.

* mongodb: more flexible than couchdb for ad-hoc querying

* graph databases: neo4j, orientdb, ... might be a good alternative to OLAP

* ElasticSearch <http://www.elasticsearch.org/> lets you have faceted indexes (= index and retrieve facts according to dimensions values). Pluggable to a many storage solutions

reporting / visualizations

* d3.js <http://mbostock.github.com/d3/> far above other charting libraries

* Simile Exhibit <http://simile-widgets.org/exhibit3/> Great for faceted browsing (facet = dimension)

* Cube <http://square.github.com/cube/> Time series visualizations

* mind gapper <https://github.com/RandomEtc/mind-gapper-js> visualizations la Hans Rosling's Gapminder

* dejavis <http://beta.dejavis.org/> looks offline for the moment

Big data

* Hadoop + data-warehousing / Querying solutions Apache Pig and/or hive

ETL (nothing exceptional here)

* Yahoo Pipes

* Python flow

* Google Refine

Saas

* Bime <http://bimeanalytics.com/>

* Druid [http://metamarkets.com/2011/druid-part-i-real-time-analytics...](http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-rows-per-second/)

------
We64
Try Tableau, or Omniscope, both are pretty feature rich.

------
grundprinzip
Hey Dom,

you could try looking at

* SAP Business Objects Explorer - [http://www.sap.com/germany/solutions/sapbusinessobjects/sme/...](http://www.sap.com/germany/solutions/sapbusinessobjects/sme/index.epx)

They used to provide a feature that will directly load your data and let's you
simply specifying your fact and dimension tables and than you can directly run
your reports.

Maybe it's worth a try.

------
clarkevans
We've developed a high-level query language, <http://htsql.org>, for complex
reporting of medical informatics data sets stored in a relational database
(Sqlite, MySQL or PostgreSQL). I'm not sure how this might fit your
requirements, but it feels like 2012 to us -:)

------
factorialboy
Few years ago I felt that Adobe Flex has excellent client side OLAP
capabilities. But then that runs on Flash and you'll have to weigh in the pros
and cons.

Also for large data sets, wouldn't make sense to do things on the client.

------
philwise
Have you looked at SQL Azure? Microsoft have really good BI tools. I think
200GB of SQL Azure will cost a couple of hundred bucks a month.

