
Building Analytics at 500px - emre
https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83?hn
======
snotrockets
A free (as in speech and beer) alternative to Periscopre.io would be
[http://redash.io](http://redash.io). You’ll have to host it yourself, but it
supports more than just SQL, and is centered around the idea of sharing
queries, not reports, which helps shifts the discussion from “how did you get
those numbers” to “what do those numbers mean”.

~~~
arikfr
re:dash author here if anyone has any questions.

Our company (EverythingMe) had a very similar journey to the one Samson
describes (Splunk, MySQL, logs, ... -> Redshift). At the time Persicope wasn't
there (or we didn't find it), so we built re:dash instead.

------
garthz
Airflow by airbnb is a nice alternative to luigi. We've been using it for our
ETL and it's been working greatly so far.
[http://airbnb.io/projects/airflow/](http://airbnb.io/projects/airflow/)

~~~
mycelium
Can you write a little more about your evaluation of the two?

We've been using the python celery task queue library to coordinate python
ETL-esque jobs, but as our setup gets more heterogenous language and job wise
we're looking for something built more explicitly for the purpose. Luigi and
Airflow are both contenders.

~~~
garthz
I think Luigi doesn't have a central scheduler which allows you to run tasks
hourly/daily. In the docs it suggests that you can use cron to trigger tasks
periodically. This was the major reason why we didn't pick Luigi. Airflow
provides that capability and also has a nice UI where you can manage the DAGs
(visualize the task instances, set task states, etc). It's a young project and
the codebase is clean and easy to understand. Airbnb is also actively
developing on it as far as I know.

~~~
mycelium
Thank you for your response!

------
akoumjian
One of my main takeaways from this article is that there is no such thing as
easy analytics. In particular, you have to put though and energy into
designing your analytics data stores and your ETL process.

It's a bit of a relief to see this after wondering why I've been unable to
find an analytics solution that felt complete.

------
georgewfraser
For anyone setting up a similar system, our product
[https://fivetran.com](https://fivetran.com) automates the most annoying part
of this: getting the data into Redshift! We support MySQL, Postgres,
Salesforce, and lots of other data sources.

~~~
Dramatize
Sorry if I'm getting this mixed up.

Snowplow is shown on your pricing page. Doesn't Snowplow already connect
directly with Redshift? Or was Fivetran built before Snowplow finished their
new setup?

~~~
georgewfraser
Yes this is a common point of confusion. There are three options:

1\. Run your own Snowplow collector that writes to your Redshift. 2\. Use
Snowplow Inc's hosted collector that writes into your Redshift. 3\. Use
Fivetran's Snowplow collector that writes to your Redshift.

We have customers running each of these configurations. In the case of #1 & 2,
they're just using us to sync other data into the same Redshift cluster as
their snowplow collector.

------
jboggan
Great post.

I didn't have this to read several months ago so I did end up writing my own
ETL solution. It has been fun though and now we're into the analytics phase
and using Chart.io - did you compare that to Periscope? It's a tad pricey but
the charts are pretty easy to generate using their query builder if you don't
want to get dirty with SQL (or have team members who don't have SQL skills
yet).

I do think that rolling your own ETL can be rewarding though - especially if
you are wrapping each attempt at an ETL process in a class and storing long
term data about the monthly/daily/hourly/irregular processes for internal
analysis, forecasting, bug-reporting, and providing fodder for visuals to sell
the rest of the team on what you're doing.

------
uberneo
Little bit more about luigi would be handy .. I believe same can be achieved
using Pentaho Kettle as that has a complete workflow structure and its
Opensource as well .. however any more examples or use cases of luigi?

------
omouse
This seems to echo something I said in another thread; the hackers are further
away from their users:

 _Even for technical people, having access to analytics is helpful. Being
informed about product metrics helps developers see the ‘why’ behind their
work — a key ingredient to high performance_

You mean that the technical people who actually implement the software are so
alienated from their users that they have to wait for a royal decree to spur
them to work?

I wonder how few of their ideas flow back up to the top.

I like the article; it's nice to see one use case of ETL. Too often people
directly hit the production database to get reports instead of ETLing into a
read-only db that won't affect site performance.

------
chrisacree
Is there a good guide to getting started using logfiles as the basis for
analytics? Best practices on using client vs server-side signals (and if both,
not double-counting), etc. I'm interested in learning more about it, but a
google search doesn't turn up anything relevant. Snowplow seems to be using a
setup like this, but not sure how they'd fit in with what the author was
discussing.

I'm using Django, but I imagine this kind of system would be mostly platform-
agnostic.

------
burnout1540
How much of this could be accomplished using something like Mixpanel? Honestly
curious.

I know a true ETL solution like this will be more powerful and flexible, but
how do you decide when it's worth the investment? Are there key use-cases or
reports that only this type of solution can provide? Is the main issue with
external analytics (like Mixpanel) that you need to know what data to collect
before you can report on it?

~~~
jdwyah
MixPanel is nice, but its support for multiple sources of data is not very
good. You can slice and dice events, but if you want to join that to your user
database, things get awkward/not-possible pretty quickly.

The 2nd thing is that MixPanel is write-once, read-forever. aka if you
misspell something or change definitions you end up in a bad place. These ETL
processes are much more capable of adapting and normalizing a schema over
time.

------
gesman
Excellent writeup. I love Splunk for many reasons (built a custom security App
for Splunk as a pretty successful fraud detection tool at an enterprise
financial firm).

Did you ever looked into Splunk DB connect app?

It allows to use structured DB (such as MySQL and others) as a data source for
Splunk. I haven't used it yet but interested in feedback.

------
kristianp
There's a grammar error in the first sentence. "an quick" should be "a quick".

Great, detailed article though.

------
mrandrewandrade
Great by post by Samson! I sometime wish I could tell stories and explain like
he can

------
uberneo
Can Airflow be used to load data from a file in filesystem( not in Hive ) to
MySql .. also instead of calling a Bashscript can we call a PHP script.. Any
pointers will be helpful.

~~~
mistercrunch
A PythonOperator using the MySqlHook should do the trick. Oh and you can call
a PHP script from BashOperator.

------
lenguyenthedat
Nice post! We went through a very similar process back in Zalora
([http://www.zalora.com](http://www.zalora.com)) building our own Data
Warehouse and Analytics function for over 300 internal users at the time.

Back when Redshift was on its initial beta release (2012), there almost wasn't
any ETL / charts tool available for us, so we ended up building most of the
tools and libraries ourselves. A few of them were open source too:

\- [https://github.com/zalora/redsift](https://github.com/zalora/redsift) A
web-interface SQL tool for Redshift, letting the user to query and also export
them to S3 (and send an email alert to user once done.)

\- [https://github.com/zalora/postgresql-user-
manager](https://github.com/zalora/postgresql-user-manager) for managing user
privileges

\- [https://github.com/lenguyenthedat/aws-redshift-to-
rds](https://github.com/lenguyenthedat/aws-redshift-to-rds) for copying Tables
from Redshift to RDS (postgresql)

\- [https://github.com/zalora/kraken](https://github.com/zalora/kraken) a bit
similar to luigi or airflow.

Things has changed quite a lot since then, there are a lot of great solutions
to our problems that are either free or very cheap and production-ready:

\- redash.io
([https://github.com/EverythingMe/redash](https://github.com/EverythingMe/redash))
- web-interface SQL tool with visualization - FREE

\- redshift_console
([https://github.com/EverythingMe/redshift_console](https://github.com/EverythingMe/redshift_console))
- redshift ops tool - FREE

\- flydata ([https://www.flydata.com/resources/flydata-sync/sync-rds-
mysq...](https://www.flydata.com/resources/flydata-sync/sync-rds-mysql-to-
redshift/)) sync live data from MySQL to Redshift - Subscription base

\- dreamfactory
([https://aws.amazon.com/marketplace/pp/B00GXYDK18?sr=0-3&qid=...](https://aws.amazon.com/marketplace/pp/B00GXYDK18?sr=0-3&qid=1385831342361))
for providing REST API interface to your database (supports redshift and other
databases) - FREE

We didn't use Tableau in Zalora either (due to pricing and the number of users
that we have in-house), and ended up building our own customized data
dashboards with d3js and a few other different frameworks.

However, as long as you are ok with the price, Tableau is pretty good. It's
being use widely in my current company
([http://commercialize.tv](http://commercialize.tv)) :

\- You can minimize processing from Tableau server by just create another data
mart layer from Redshift with your ETL tools / scripts, having Tableau
connecting directly to it.

\- Visualizations / Charts creation process is pretty much straightforward.
The end-result will look exceptionally comparing to other solutions that we
have tried.

\- They also have a really good and active community.

------
misthop
I thought this would be about analytics visualization dashboard with little
real estate (500px). Had no idea that 500px was a company name.

~~~
Splendor
They're an interesting company IMO. They've done a good job of 'gamifying'
photo sharing. I have at least one relative that's addicted to it and pays for
an annual subscription.

I'm not sure how well they're doing business-wise. I did notice that they
recently pivoted one aspect of their business. They used to facilitate users
selling prints of their works; now they've dropped that service and are
building a stock photo service instead.

~~~
detaro
What do you mean by "gamifying"?

~~~
Splendor
Instead of just letting people 'like' or 'favorite' users' photos, 500px
assigns a score to each photo, from 0-100. They have an algorithm that factors
in all the activity on a user's photo to assign that score. As the score
increases the photo moves to different sections of the site (Fresh, Upcoming,
Popular).

My relative posts a photo almost everyday and is always trying to figure out
how to get a higher score; different photo subjects, posting at different
times of the day, etc.

------
danvoell
Nice Post!

------
palmeida
For anyone looking for a visualisation solution reasonably priced were you
don't need to write SQL [http://www.viurdata.com](http://www.viurdata.com) We
will be launching support for Amazon Redshift very very soon.

------
taylozac85
Nice post.

Looker ([http://looker.com](http://looker.com)) is an alternative to Periscope
that a ton of venture-backed tech companies are using (mostly in conjunction
with Redshift). Here are two posts from Buffer and SeatGeek on their stacks
using Redshift, Luigi and Looker. Buffer:
[https://overflow.bufferapp.com/2014/10/31/buffers-new-
data-a...](https://overflow.bufferapp.com/2014/10/31/buffers-new-data-
architecture/). SeatGeek: [http://chairnerd.seatgeek.com/building-out-the-
seatgeek-data...](http://chairnerd.seatgeek.com/building-out-the-seatgeek-
data-pipeline/)

Looker does not necessitate each member of the team to know SQL to explore the
data or create reports. It also has a text-based modeling language that is a
thin abstraction layer of SQL. It makes SQL modular and reusable making it far
more efficient to support a wide range of analysis. It's more expensive than
Periscope, but it's way more powerful.

~~~
jonbishop
Hey Zack, it’s probably a good idea to disclose that you work at Looker if
you’re going to talk about them on HN.
[https://www.linkedin.com/profile/view?id=145490726](https://www.linkedin.com/profile/view?id=145490726)

I run marketing for Periscope. Like Chartio, you guys have a great product. We
routinely send people your way when they have more business user needs and
aren’t a good fit for Periscope.

For those evaluating tools like Periscope/Looker/Chartio:

Whereas Looker focuses on business users, we’ve built Periscope around serving
SQL analysts better than anyone else.

If you don't have someone who knows SQL on your team, Looker (or Chartio) is
probably a better fit. (though I should mention that some customers have
learned SQL through our customer support :)

We’re focused on the SQL analysts who are building dashboards and running
reports, day in, day out. With that said, every one of our customers has
employees who don’t know SQL, but need to dive into their data. So we’ve built
a ton of features to make this easy as well.

If you want to find out more, visit
[https://www.periscope.io](https://www.periscope.io). If Periscope looks like
a good fit, sign up and we'll get you set up with a free trial.

~~~
mistermann
Periscope question: "Periscope plugs directly into your databases...."

As I understand it, Periscope is cloud based, so could you elaborate on how it
might plug into large _locally hosted_ databases, or is the caveat that my
databases must also be in the cloud?

~~~
hglaser
(Periscope co-founder here.)

There needs to be a route from the internet to your DB, yes. Sometimes this
works because the DB is in the cloud, or sometimes a port or SSH tunnel is
opened into a local network.

~~~
mistermann
Ok, so you don't have to physically move you data to the cloud, it can reside
locally and just be queried on demand, right?

~~~
hglaser
yep

