
What we learned from building custom analytics over Amazon Redshift - itamarwe
https://www.alooma.com/blog/custom-analytics-amazon-redshift
======
LukaAl
Probably the biggest challenge I had in my last startup was setting up the
analytics platform. Obviously building an analytics platform wasn't the top
priority (probably the last).

So you put Google Analytics on your website, then you need the tracking code
from Facebook for the ads. And you add Flurry because it provides different
analytics (e.g: Funnels and Cohort Analysis). We also logged on the DB some
actions difficult to capture on analytics. Then you need mobile ads, so you
add in your app the framework to track the install. And, I forgot, you have
another platform to track your mailing lists.

The final result: for every business question you have data in at least 3
different platform that gives you three different answers. Even worst, when
you are still struggling with the product your "hard tech" co-founder could
not understand why you push to re-implement the analytics framework (not
necessarily building it in house) while your business co-founder decide to
don't ask questions and doesn't understand why you want better data. If it
looks like a recipe for disaster is because it is.

We really need better analytics platform, easy to use like Google Analytics
and MixPanel for the beginning but with the ability to build custom analysis
for the growing biz.

~~~
jon-wood
I'm a big fan of [Segment]([https://segment.com](https://segment.com)) who do
a great job of putting an abstraction layer on top of the massive selection of
analytics platforms you're likely to end up using. A particularly nice touch
is some of their plans having a feature where you can replay your entire event
history into new analytics tools you add, allowing you to almost immediately
start analysing all your data with them.

~~~
LukaAl
That's not quite right what I have in mind. It solve part of the issue, the
maintainability of 10+ integrations, but it is not really the biggest pain
point.

The biggest pinpoint is having 10 different service that gives you quite the
same answer but:

1) they cover different needs and you need a patchwork to customize to what
you need (and when you are small it is not even sure you know what you need so
the patchwork is biggest than needed).

2) they measure things in slightly different way making comparison difficult.

3) they have vastly different interfaces, thus requiring a lot of time to
learn them (I like to learn new things, but my time is valuable and better
spent on product features)

4) they create a mess in your code. This is the problem segment try to solve.
It is relevant but not top priority. Actually, I'm not totally sure of the
effect of inserting a level of abstraction doesn't actually make more complex
to solve point 1 and 2 (it is not a critic, just a real problem I faced when I
tried to abstract the analytics call in the app with few functions).

~~~
WCJohnson11
Segment actually provides you access to the raw data so if you can layer a BI
tool on top of it, then you have solved your main pain point.

------
greggyb
This is not, as billed, about building a custom analytics solution.

This is SQL as BI, which is great for small technical teams, but hardly a good
or scalable solution.

If anything, this is a vague argument to have a data warehouse. Data
warehousing is only an intermediate step in any analytics or BI solution and
the hard problems are not in moving data or writing one-off queries.

~~~
itamarwe
What kind of solution would you suggest? What is missing?

~~~
greggyb
BI or an analytics solution would be more all-encompassing than an ETL tool
(Alooma) and a database platform (Redshift).

ETL is much more than just moving data around. I haven't dug too deeply into
Alooma's offerings, but it does appear to have workflows for more than just
data from system A to destination B. Some of the important things to look at
in ETL are change-tracking for slowly changing dimensions, business logic
transformations, and mashing up data from disparate sources against conformed
dimensions. Master Data Management is its own practice, but is intimately
involved in ETL. Some ETL tools include Alooma, SSIS, Kettle.

Beyond ETL there is the data warehouse platform which usually consists of a
RDBMS, sometimes with an OLAP layer as well. OLAP is distinguished from
something like Redshift, which is a columnstore RDBMS, by including a metadata
layer with the data. MDX is probably the granddaddy of OLAP languages. Think
something that a pivot table can speak to directly. The data warehouse can
really be any RDBMS product.

At the data warehouse layer, there's much more than just flattened source
tables. A good deal of effort needs to go into the dimensional modelling for
end-user consumption and reporting. We could take an aside for Inmon vs
Kimball here, but it's spurious, because Inmon recommends dimensional data
marts for end user consumption, just like Kimball - the data warehouse is
never exposed in the Inmon methodology and dimensional data modelling is
required for either.

If we are utilizing an OLAP engine, then there is a lot of measure definition
to be done, encapsulating a lot of the logic that is displayed in the single-
purpose queries of the article's examples. The data model will be based on the
dimensional model from the data warehouse, but there's a lot more metadata in
an OLAP database, which helps make self-service reporting much easier. Many
more people are comfortable building a pivot table than even a simple SELECT
statement in SQL. Some OLAP engines are SSAS and Mondrian.

Regardless of whether we have an OLAP solution, we need a presentation layer
(typically several, as there are different types of reporting needs, and
different strata of sophistication among end users). This is something like
Tableau, Qlik, Quicksight, Cognos, Power BI (all prior products purport to
cover varying degrees of the data ETL and modelling process in addition to
providing a presentation layer), Crystal Reports, and SSRS are just some
samples of many in the commercial space (I'm not as familiar with open source
presentation layers).

This is just a high level overview of the major components of a BI or
analytics solution, and fairly barebones. The article spoke to ETL a little,
and at best hit tangentially on data warehousing and presentation.

------
melted
Was going to suggest Google BigQuery (faster, cheaper, more scalable) but then
looked at the rest of your workflow, and it looks like ETL would get more
complicated, and the visualization options wouldn't be as extensive (although
BQ does seem to support Tableau). On balance, not worth the hassle.

Looks pretty neat, job well done!

~~~
fhoffa
Hi melted! I have to agree with you that BigQuery is probably faster, cheaper
and more scalable than Redshift for this and other use cases. Specially given
the "join multiple datasets" angle.

But why would you say that ETL would be more complicated? Specially since
BigQuery is the best way to get raw Google Analytics data (for premium
customers).

Re: Visualization options - All of the options mentioned by Itamar work well
with BigQuery, except one (Quicksight) - and I know for sure how much re:dash,
Mode, Looker and Tableau love Bigquery.

Anyways, great article - and I love the real use case queries.

~~~
melted
Because the actual analytics (the part that BigQuery provides) is maybe 20% of
this solution, and judging by the slides their ETL is very easy to use. What
would I even use on Google Cloud to do ETL? Dataflow? Javascript UDFs?
Something else? All of that seems clunky compared to what these guys are
offering. And they have a bunch of data sources available "out of the box"
that would be a hassle to deal with manually.

Another issue with BigQuery seems to be unpredictability of cost. One typo
somewhere and you can easily run up a bill in tens of thousands of dollars
because your dashboard isn't caching something. In a similar situation
Redshift will merely get slow.

~~~
fhoffa
Clunky ETL: Please correct me if I'm wrong, but what I saw about ETLs in the
article "One example we encounter quite often is that Mixpanel stores
timestamps in seconds, while Redshift expects timestamps in milliseconds." \-
that kind of transformations I would much rather run inside BigQuery in a
couple seconds than going through a whole pipeline. Other things I could
outside, just as what they are doing now - but I didn't see any RS specific
advantages for transformations?

Cost: Cost should be way below other solutions - and to prevent problems
BigQuery now has cost controls at a user and project levels:
[https://cloud.google.com/bigquery/cost-
controls](https://cloud.google.com/bigquery/cost-controls)

Thanks for your comments!

------
supermatt
Nothing here about "what you learned". This is a straight up tutorial on how
to use your product!

~~~
yairwein
A lot of this does showcase how to use Alooma's product, true.

However, I hope that the SQL queries provided would be helpful to anyone who
would like to build it on their own.

FD: I am Alooma's CTO

~~~
supermatt
So, what did you learn? That bit is completely missing from the article..
Surely not how to write a (very simple) SQL query?

~~~
itamarwe
What data warehouse to choose, what are the different visualization tools
available, what events to send, how to setup Redshift, how to choose the keys
for the tables, and yes, how to recreate popular dashboards with simple SQL
queries.

Of course, we also have more complex queries for analysis, that we will share
at a future post.

------
bourneagain
It's actually hilarious that your SDK's are just modified versions of
Mixpanels'...

[https://www.alooma.com/docs/integration/ios-
sdk](https://www.alooma.com/docs/integration/ios-sdk)

~~~
itamarwe
Makes integration a breeze :)

------
sandGorgon
so if I use Alooma android SDK, I can push event data straight into RedShift ?

I'm not very clear about how to use Alooma.

~~~
itamarwe
Yes, exactly. You'll push events from Alooma's Android SDK and they'll land in
your Redshift in seconds. You can have a look at the documentation here:
[https://www.alooma.com/docs/integration/android-
sdk](https://www.alooma.com/docs/integration/android-sdk)

~~~
buremba
I couldn't really understand how you replaced Mixpanel and other analytics
services with Redshift. AFAIK Redshift is optimized for bulk inserts so the
micro-batch approach (a few seconds) is not is not optimal and in fact quite
expensive. Also the data in Mixpanel is not structured and if you're using
JSON in Redshift, you can't take advantage of the columnar storage format
which is actually the main point of using Redshift. Do you care to explain how
you solve these issues?

~~~
yairwein
Hi, Alooma uses batch size appropriate to the data volume, the bigger the data
volume is, the bigger the batch size will be.

Alooma not loading into a JSON column, but provides mapping UI and
transformation layer to in order structure and normalize the data into
different appropriate columns, thus leverages the columnar storage properties.

FD: I am Alooma's CTO

