Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: OLAP and Reporting that feels like 2012
51 points by endymi0n on April 19, 2012 | hide | past | favorite | 28 comments
Hi Guys,

after researching how to continuously aggregate and mine our tracking data (200GB and growing fast) for almost a week, I'm still stuck. Is it just me or didn't I just find the right product yet? I must admit that I'm a generalist developer, no DBA - but to me it looks like all the products I've looked into just don't "feel" right. JasperReports, InfiniDB, Pentaho, just to name a few... it's all - how can I say - crusty, unintuitive, 1999. All the products look very corporate, there are basically no howto's, no prices, no shiny, bold Open Source products that fit my bill. I wouldn't even mind using a good commercial product that does what I want, but even the advertisements are "from DBA to DBA". Lots of termdropping like ETLs, M/ROLAP, BI, Case-Based-Reasoning - but nothing that looks reasonably simple and straightforward. Maybe I'm spoiled by the DX (Developer Experience) that Backbone and Rails give me, but is there really nobody that has done something simpler and more straightforward? Like "these are my dimensions, these are my facts - now generate my Cube here so I can go datamining"? Now I know this is a huge field and I might sound naive but I'd like to know if there are others sharing my pain or having something closer to a solution.

Some more data: I'm lead architect for a Berlin startup - currently we're storing our data in MySQL but moving to Cassandra soon. I'm open to any technologies that could simplify our lives here.

Thanks in advance, Dom




Yeah, I've been wondering the same thing. Here's what I found so far, but really nothing definitive:

http://demo.analytical-labs.com/

https://github.com/drkreso/cube

http://databrewery.org/cubes.html

http://exposedata.com/parallel/veggie/

http://square.github.com/crossfilter/

Some saas's:

http://www.metricly.com

https://www.google.com/fusiontables/Home/

http://chart.io

http://www.gooddata.com

Another option is to use Excell. With an odbc data source and some materialised views, you can pull stuff into a pivot table and get pretty far on the BI front. Of course, you have to use Windows on the client and mssql on the server side (I currently just mirror my MySql database into mssql on a daily basis).

Interested in hearing if you found other options?



Is the greatest source of your pain at the data storage end (ETL, RDBMS), aggregations (OLAP cube), or presentation (reports, visualizations)?

I ask because other than issues with the traditional RDBMS+cube approach, mentioned by some in this thread as outdated, one of the issues I find is that open source ETL and BI tools do not have the maturity of open source databases.

Having moved from all-Microsoft shops (SQL Server, SSIS, SSRS), to PostgreSQL for data, and a mix of OSS and commercial ETL and presentation tools, my greatest source of frustration is the effort in working with tools that feel "unfinished". Cumbersome processes to accomplish what should be simple tasks can make any project frustrating.

This is not to say that the folks working on OSS OLAP / BI tools are not doing good work. But these tools have not been in development as long as something like PostgreSQL, so they simply don't have the maturity.

Another issue is that many of those who are consumers of open source OLAP tools are data people, not as adept in languages like Java or C. Thus non-developers can't contribute back to the projects at the same rate that developers can.

None of these points solve your issue, but as it stands, I find a mix of mature open source with commercial front ends suitable. I am mixing PostgreSQL with Tableau with some success. Tableau means license costs and Windows, but from an ease of development perspective, I have not found a suitable open source replacement. Saiku is one that has my interest.


I can confirm your findings - this domain [simple OLAP] really sucks.

As CFO I implemented [all including planning database/dimensions/forms/reports/calculations/rules/etc except for programming] 3 OLAP based projects in three multinational companies.

Closed source In 2 projects we used Hyperion Essbase, in 1 project we used SAP Business Objects. In all cases we used consultants for programming the cubes and all else per spec. Also, there is MS solution - have not used it.

There are a number of OpenSource solutions, including Mondrian, etc - most of them use SQL database (PostGre et all) for storing data and building OLAP cubes.

Check SO on OLAP, most Q&A are garbage, but found some good gems. You can start from here

http://stackoverflow.com/questions/3705011/what-should-i-hav...

I also saw post of Cubes implementation in Python some time back on HN - Google for Stiivi-cubes-v0.7 and Stiivi-brewery-v0.7.

pls keep posted if you'll find smth cool, i.e. easy and simple DIY OLAP. After spending close to several days on Googling, I gave up.


I've been in this field for more than a decade and I couldn't agree more. The tools, and even the underlying theory is stuck in the past, and every project I work on is hampered by the technology not meeting the expectations of (business) users who demand a much more intuitive way of working - because that's what the experience every day on consumer devices.

There are issues though with getting consistent and accurate measures that are not easily solved by just looking at the data - albeit at a very high level of detail) for example when customers move or change details, it's easy to lose track of where you should be reporting the facts. And depending on the volume of data it's not going to be a long term solution to recalculate everything every day - which is the mistake a lot of companies make when trying to cope with this.

The problem is in your question. The Cube. This is a construct that assumes that the underlying data is typed allowing it to be re-arranged into facts, characteristics and attributes, and unless you have the foresight to think of your db like this before you build your web app, it's hard going to squeeze it into the dimensional model.

In fact I'd go as far as saying that the dimensional model is no longer a suitable model for dealing with event based data generation, and that may be why you ask the question.

It's one of my personal goals to solve this, I have some thoughts but it's not quite my priority at the moment, but I'd love to hear what others have to say on this.


I too have spent quite a few years working in this area. I have primarily been involved in building tools to make this kind of analysis easier - but it is very difficult.

As you mention, the primary issue is that almost all business data is very complex. There are very few areas (in my experience), where simple queries offer much value, because there are so many niggling business rules that lead to exceptions and counter exceptions to rules.

This means that "self service" analytics are pretty much impossible. There are often very few people in an organization who actually know how to define even seemingly simple measures like profit or revenue.

This means you need consultants who can learn the business rules, and who also understand the data platform. This makes these projects expensive, and makes it hard for platforms to win in the market.

For the most part in my experience, the best way forward is to choose a database engine (MySQL, Postgress, Oracle, DB2, SQL Server), get the data into it (itself difficult!), then write a bunch of SQL to incorporate the business rules. Then if you are feeling fancy (and still have budget!), built a nice HTML front end on it.

Its not very sexy, but it does work. I would love it if there was an easier way, but I am yet to find it.


Agree. But usually business rules and exceptions boil down to filters and look ups on the dataset. It is difficult to keep up with that mess from outside, but if you deliver a way to make look ups and filters efficiently, maybe you got a solution.


We use Qlikview at work and SQL Server Reporting Services. I'm not clear on what exactly you are trying to do but Qlikview could be worth a look. There is a full featured demo which is only limited in that you can't share your documents with others.

Two neat features of Qlikview are that it compresses everything into memory (depending on the data up to 100:1 compression) so it is quick to manipulate your data. The second feature is that it links everything and you can change your filters on the fly. The net result is that you can explore your data on the fly and get instant feedback to your filter changes.


+1 for QlikView. A lot of banks, amongst other companies, use it and it's very good for an interface for end users (so IT people are out of the loop for tweaking reports).


Almost all OLAP products provide this simple functionality, but it's almost impossible to find where it is because it is hidden under layers and layers of "added value" in the form of tools to extract, transform and present that data in flashy executive dashboards.

My first advice would be, don't try to build the fancy executive level dashboards, that's the component of the system with the least ROI, in spite of being the most visible for upper level management and thus likely to be the one that is easier to get funding for. The dashboard should be the top of the pyramid, but without a solid foundation it will be an exercise in frustration each time anyone wants to go deeper into any of the numbers.

The problem with these tools is twofold: first, they are targetted at big enterprise buyers that are not the end users. They are strong on fancy and simple features that do not scale well when your analysis becomes real-world complex. Second, all those fancy tools break down under heavy data loads. So in the end "these are my dimensions, these are my facts - now generate my Cube here so I can go datamining" works only during the executive demo.

You're usually better cherry picking up components from these suites that work well enough for you, and building the missing pieces on your own than trying to fully try to shoehorn one of these suites into your needs. Or maybe your needs are simple enough that one of these is good enough for you.

And finally there is the problem with data quality. Perhaps in the startup web-based application world this problem is less relevant, but it's not unusual for large organizations to invest heavily in large OLAP efforts only to find later that they should first have invested heavily in large proces, application and data cleanups. Garbage in, garbage out, no matter how fancy or cube structured the garbage out is.

Which by the way, I think that the cube is a fairly good structure for end users to work with. The only caveat is that have to make sure that they are able to link the data in the cube with the underlying transaction-level entries that make it up.


Well, at work I use Cognos. But it is clumsy and it forces you in its paradigms. Instead, when I dream or code (I am learning to code, I am a finance guy), I am making demos for things that should provide a better way to explore your data: http://andemo.ep.io/.

Basically, my issue with corporate analytics is that there are no apis. Everything has to be accessed from its software. SAP data stays in SAP or goes in some 90' data warehause system.

Another issue is that data is stored in multiple versions in different files across many users locations (Excel files madness). This drives people nuts and I can't understand why don't we already have a BaseCamp for finance and analytics.

A third issue is that people are mixing data and views. This makes usually data in, i.e. in spreadsheets, completelly unuseful for other analysis, you basically have to copy and paste data from the fancy templates to a clean file to use it. Views should be web based too, not to enter the 'update this file periodically manually' hell.

Once you have apis, it's a matter of building apps to consume them and eventually to make snapshots of your data (you don't want to recalculate monthly billings at every page request). Your specific need, for example, could be solved with a fancy web based pivot table (the sort you have in spreadsheets) that accesses json data...

So, finally, if you find or start an interesing project of this type, let me know as I am interested to join!


RE. issue 2) and 3) I just thought that data cubes an open, but standardized format could help - not only in the financial area: I found RDF data cubes ( http://publishing-statistical-data.googlecode.com/svn/trunk/... ) perhaps being a part of the solution there, with the value add that you could easily connect different data source using their semantics.

(liked the "basecamp for analytics" point)


Yes, an RDF format looks like a good solution. It appears that RDF is quite supported in python and that it also abstracts away persistance (http://readthedocs.org/docs/rdflib/en/latest/persistence.htm...) in a way. I ll look into that, thanks.

If you like the basecamp idea for analytics, I am kfk on github, I am doing some demos to get an idea.


"there are basically no howto's, no prices, no shiny, bold Open Source products that fit my bill"

as the developer of the first one that got mentioned here (saiku) http://demo.analytical-labs.com/

i can say that the problem is that everybody asks for a shiny, nice, intuitive open source project.. but there is very little will to contribute to such a project.

we really try to create something thats easy to use, fun to play with the data, and provides the necessary functionality. but unfortunately developing something that looks simple needs time, a lot of time. so if anyone is interested, i always appreciate any kind of contribution (not necessarily just coding)

if more people would work together, there would already be such a tool that we are all dreaming of


Open source BI software is generally poor.

For intuitive front ends look at: neutrinoBI tableau software tibco spotfire qlikview panoptical advizor solutions

Some (e.g. qlikview) insist on taking a dump of all data onto the workstation, so aren't suitable for big data.

These may still need a relational database to actually store the data. We use Vectorwise. Infobright (MySQL engine) is good if you're on a tight budget, particularly their open source community edition. There are many players here: Vertica, ParAccel, Exasol, Teradata, Aster Data, Netezza, Kognitio, InfiniDB (also MySQL), probably others.


I guess http://www.icCube.com (an in-memory OLAP server) is quite easy to setup (in JAVA following Web standards). Then, a quite simple JSON/HTTP interface is provided in addition to the standard XMLA. There's currently a JS client API and an early JAVA API as well. The JS client API is used by their visualization library / dashboards.

I'm not sure that OLAP cubes are good for 200G / week. I would go for something in between. Use a map/reduce algo. to mine your data and generate the relevant data from which you will build cubes for later and different analysis.

As somebody said, I guess OLAP cubes concepts are still good nowadays and MDX is a very powerful language that goes far beyond SQL.

My 2 cents (disclaimer I work for icCube)


I think you'll like this:

https://github.com/rsim/mondrian-olap

As for the ETL part, you may like that (which I maintain):

http://www.activewarehouse.info/


Also adding this:

https://eazybi.com/

which is a SaaS run (I think) by rsim.


Yes, I can confirm that https://eazybi.com is built and run by me.

The goals of eazyBI are as described in the question - simple way to upload your data or import from other applications and easy to use OLAP reports, charts and dashboards. And it is built using Ruby on Rails and Backbone.js as well :) In the backend it is using Pentaho Mondrian OLAP engine which is embedded in JRuby open-source library https://github.com/rsim/mondrian-olap


eazyBI looks very interesting, but your site and sales pitch seem pretty unpolished. How big is your team and where are you situated?


Currently it is primarily just me and I am from Latvia (Eastern Europe). And trying to improve site and sales pitch as well, I appreciate any feedback :)


I feel your pain. I started http://discoproject.org back in 2007, trying to make MapReduce more easily approachable.

If vanilla MapReduce is too low-level for you, you might want to take a look at http://bitdeli.com which provides a cloud-based scripting platform for real-time data (disclaimer: I am a co-founder of Bitdeli).

Neither Disco nor Bitdeli provides OLAP out of the box but if you are ok with writing a screenful of Python to process your data, you might find them useful.


I know no solution doing all we need, only some interesting components. Here are some more links:

storing data

* couchdb http://couchdb.apache.org/. Good if you already know the analysis you want to do. drawback is that indexes you create (= materialized views) are monodimensional. a multidimensional indexing on top of it would simply be great. Not great if you need real real time reporting and your data change a lot.

* mongodb: more flexible than couchdb for ad-hoc querying

* graph databases: neo4j, orientdb, ... might be a good alternative to OLAP

* ElasticSearch http://www.elasticsearch.org/ lets you have faceted indexes (= index and retrieve facts according to dimensions values). Pluggable to a many storage solutions

reporting / visualizations

* d3.js http://mbostock.github.com/d3/ far above other charting libraries

* Simile Exhibit http://simile-widgets.org/exhibit3/ Great for faceted browsing (facet = dimension)

* Cube http://square.github.com/cube/ Time series visualizations

* mind gapper https://github.com/RandomEtc/mind-gapper-js visualizations la Hans Rosling's Gapminder

* dejavis http://beta.dejavis.org/ looks offline for the moment

Big data

* Hadoop + data-warehousing / Querying solutions Apache Pig and/or hive

ETL (nothing exceptional here)

* Yahoo Pipes

* Python flow

* Google Refine

Saas

* Bime http://bimeanalytics.com/

* Druid http://metamarkets.com/2011/druid-part-i-real-time-analytics...


Try Tableau, or Omniscope, both are pretty feature rich.


Hey Dom,

you could try looking at

* SAP Business Objects Explorer - http://www.sap.com/germany/solutions/sapbusinessobjects/sme/...

They used to provide a feature that will directly load your data and let's you simply specifying your fact and dimension tables and than you can directly run your reports.

Maybe it's worth a try.


We've developed a high-level query language, http://htsql.org, for complex reporting of medical informatics data sets stored in a relational database (Sqlite, MySQL or PostgreSQL). I'm not sure how this might fit your requirements, but it feels like 2012 to us -:)


Few years ago I felt that Adobe Flex has excellent client side OLAP capabilities. But then that runs on Flash and you'll have to weigh in the pros and cons.

Also for large data sets, wouldn't make sense to do things on the client.


Have you looked at SQL Azure? Microsoft have really good BI tools. I think 200GB of SQL Azure will cost a couple of hundred bucks a month.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: