
Ask HN: Create your own open-source stack for product usage analytics and BI? - kochhar
Hi everyone,<p>I&#x27;m an architect for a small team trying to put together an analysis and reporting tool for our usage data (500GB&#x2F;mo). Googling around for a week plus and considering a bunch of open-source tools (Pentaho Suite, BiRT, JasperSoft, Spagobi) left me underwhelmed.<p>Can&#x27;t get anything more than OLAP queries. No dynamic cohorts or retention analysis. No data mining or modelling. We could even live with just OLAP but like this guy the interfaces and the reports make me think I got time warped back to 2005.<p>So now we want to glue together existing tools. I haven’t really come across anything that really made me sit up and go wow. At present the best options I’ve found are:
  - Talend or Kettle for ETL,
  - Pentaho Mondrian or Druid for OLAP processing, and 
  - nada for reporting<p>So my question is, if you had to choose the best of breed open-source components for an analytics stack, what would you pick? The stack should be able to<p><pre><code>  - integrate data from 3-5 sources (click-stream, metadata, pii)
  - join across keys from each source (about 10-15 total)
  - post joining, Xform from normalised to denormalised form to speed up queries

  - place this into a DB which provides
    - OLAP style queries through Java, Python, Ruby (multi-dimensional aggregation, slice&#x2F;dice)
    - API access via standard query language to extract data in XML&#x2F;JSON&#x2F;RDD format (use in R&#x2F;Jupyter)
    - dynamic cohort creation &amp; analysis
    - retention analysis
    - drill-down into individual rows of the incoming data
  
  - From here I’d like to integrate reporting tools for
     - damn sexy visualisation out of the box
     - interactive charts
     - easy to use report and dashboard builders (don&#x27;t have to be drag-drop)
     - scheduled report email
     - export reports and dashboards to PDF and JPG
     - export tabular data to CSV
</code></pre>
Am I just dreaming? Or is this even remotely possible? Thanks in advance!
======
ramiamar
You're not dreaming, perhaps you might just be underestimating the complexity
of building an analytics stack.

Here's my suggestion:

\- Start with picking you data warehouse. 500 GB/mo can exceed regular SQL
databases' performance, so the next cheapest option is to choose a cloud based
analytical datastore: BigQuery, Snowflake, Redshift, Athena.

\- Next, work on your ETL. There are some open source tools, they all require
a large amount of maintenance:

    
    
      - https://github.com/snowplow/snowplow
      - Talend
      - https://www.singer.io
      - http://www.fluentd.org/
      - Kafka & Kafka Connect
    

\- Retention, cohorts, and other analysis - you would probably need to write
your own queries, and schedule them with a cron like tool:

    
    
      - cron ( :) )
      - airflow
      - luigi
    

\- For visualization and charting, I know of two good open source tools:

    
    
      - Redash.io
      - http://airbnb.io/superset/
    

One thing you mentioned which might be hard to do is to join across keys from
each source. Analytical datastores are not very good at that, as tables are
partitioned, and joining tables requires that you plan ahead what you are
joining on, and partition your tables accordingly. To support joining records
in your ETL process, I'm not sure there are easy ways to implement this (let
me know if you find one, we are working on it too :) ). You can look at Kafka
Streams or Apache Storm, as a starting point

