Hacker News new | past | comments | ask | show | jobs | submit login
Building Analytics at 500px (medium.com/samson_hu)
173 points by emre on June 22, 2015 | hide | past | favorite | 42 comments

A free (as in speech and beer) alternative to Periscopre.io would be http://redash.io. You’ll have to host it yourself, but it supports more than just SQL, and is centered around the idea of sharing queries, not reports, which helps shifts the discussion from “how did you get those numbers” to “what do those numbers mean”.

re:dash author here if anyone has any questions.

Our company (EverythingMe) had a very similar journey to the one Samson describes (Splunk, MySQL, logs, ... -> Redshift). At the time Persicope wasn't there (or we didn't find it), so we built re:dash instead.

It also has a nice forking model that lets you take other people's work and extend it for your needs, which is super useful if like me you're not part of the analytics team and don't remember the structure of the analytics tables by heart. You just take a report that resembles what you want, and fork it.

Airflow by airbnb is a nice alternative to luigi. We've been using it for our ETL and it's been working greatly so far. http://airbnb.io/projects/airflow/

Can you write a little more about your evaluation of the two?

We've been using the python celery task queue library to coordinate python ETL-esque jobs, but as our setup gets more heterogenous language and job wise we're looking for something built more explicitly for the purpose. Luigi and Airflow are both contenders.

I think Luigi doesn't have a central scheduler which allows you to run tasks hourly/daily. In the docs it suggests that you can use cron to trigger tasks periodically. This was the major reason why we didn't pick Luigi. Airflow provides that capability and also has a nice UI where you can manage the DAGs (visualize the task instances, set task states, etc). It's a young project and the codebase is clean and easy to understand. Airbnb is also actively developing on it as far as I know.

Thank you for your response!

One of the major difference is how tasks are defined. In Luigi, you have to derive a base task class to create a task. In Airflow you instantiate a task by calling an operator (you can think of an operator as a task factory). If generating tasks dynamically is important to you, Airflow is a better option in that regard since you'd have to do meta-programming on the Luigi side.

Data engineering is moving in a direction where pipelines are generated dynamically. "Analysis automation", "analytics as a service", "ETL frameworks" require dynamic pipeline generation. Providing services around aggregation, A/B testing and experimentation, anomaly detection and cohort analysis require metadata-driven pipeline generation.

Airflow is also more state aware where the job dependencies are kept of for every run independently and stored in the Airflow metadata database. The fact that this metadata is handled by Airflow makes it much easier to say- rerun a task and every downstream tasks from it for a date range. You can perform very precise surgery on false positive/ false negative and rerun sub sections of workflow in time easily.

I'm not sure if this would fit your needs yet, but Pachyderm.io might be interesting if language agnostic is meaningful to you and you're using containers. Disclosure: I'm one of the founders and we don't have any of the UI features of Luigi or Airflow yet.

One of my main takeaways from this article is that there is no such thing as easy analytics. In particular, you have to put though and energy into designing your analytics data stores and your ETL process.

It's a bit of a relief to see this after wondering why I've been unable to find an analytics solution that felt complete.

For anyone setting up a similar system, our product https://fivetran.com automates the most annoying part of this: getting the data into Redshift! We support MySQL, Postgres, Salesforce, and lots of other data sources.

yeaaaaaahh but is that really the most annoying? dimensional modeling, getting all the joins right, scripting all the transforms (the T in ETL) is pretty annoying. Getting data into redshift is like, just the beginning. Not to mention, skips the transform part, at least.

Sorry if I'm getting this mixed up.

Snowplow is shown on your pricing page. Doesn't Snowplow already connect directly with Redshift? Or was Fivetran built before Snowplow finished their new setup?

Yes this is a common point of confusion. There are three options:

1. Run your own Snowplow collector that writes to your Redshift. 2. Use Snowplow Inc's hosted collector that writes into your Redshift. 3. Use Fivetran's Snowplow collector that writes to your Redshift.

We have customers running each of these configurations. In the case of #1 & 2, they're just using us to sync other data into the same Redshift cluster as their snowplow collector.

Side note - you have caching disabled on the html, css and images of the site, any reason why?

How do you deal with events with varying schema?

Assume that any one event type has a consistent schema and create a separate table with the custom properties for each event type.

Great post.

I didn't have this to read several months ago so I did end up writing my own ETL solution. It has been fun though and now we're into the analytics phase and using Chart.io - did you compare that to Periscope? It's a tad pricey but the charts are pretty easy to generate using their query builder if you don't want to get dirty with SQL (or have team members who don't have SQL skills yet).

I do think that rolling your own ETL can be rewarding though - especially if you are wrapping each attempt at an ETL process in a class and storing long term data about the monthly/daily/hourly/irregular processes for internal analysis, forecasting, bug-reporting, and providing fodder for visuals to sell the rest of the team on what you're doing.

Little bit more about luigi would be handy .. I believe same can be achieved using Pentaho Kettle as that has a complete workflow structure and its Opensource as well .. however any more examples or use cases of luigi?

This seems to echo something I said in another thread; the hackers are further away from their users:

Even for technical people, having access to analytics is helpful. Being informed about product metrics helps developers see the ‘why’ behind their work — a key ingredient to high performance

You mean that the technical people who actually implement the software are so alienated from their users that they have to wait for a royal decree to spur them to work?

I wonder how few of their ideas flow back up to the top.

I like the article; it's nice to see one use case of ETL. Too often people directly hit the production database to get reports instead of ETLing into a read-only db that won't affect site performance.

Is there a good guide to getting started using logfiles as the basis for analytics? Best practices on using client vs server-side signals (and if both, not double-counting), etc. I'm interested in learning more about it, but a google search doesn't turn up anything relevant. Snowplow seems to be using a setup like this, but not sure how they'd fit in with what the author was discussing.

I'm using Django, but I imagine this kind of system would be mostly platform-agnostic.

How much of this could be accomplished using something like Mixpanel? Honestly curious.

I know a true ETL solution like this will be more powerful and flexible, but how do you decide when it's worth the investment? Are there key use-cases or reports that only this type of solution can provide? Is the main issue with external analytics (like Mixpanel) that you need to know what data to collect before you can report on it?

MixPanel is nice, but its support for multiple sources of data is not very good. You can slice and dice events, but if you want to join that to your user database, things get awkward/not-possible pretty quickly.

The 2nd thing is that MixPanel is write-once, read-forever. aka if you misspell something or change definitions you end up in a bad place. These ETL processes are much more capable of adapting and normalizing a schema over time.

I recently worked on a similar project (with Privy.com), where Mixpanel was one of the external providers we tracked with. Mixpanel handles (AFAIK) an arbitrary number of properties. It was a lot easier than say Google Analytics, where custom dimensions are limited.

Excellent writeup. I love Splunk for many reasons (built a custom security App for Splunk as a pretty successful fraud detection tool at an enterprise financial firm).

Did you ever looked into Splunk DB connect app?

It allows to use structured DB (such as MySQL and others) as a data source for Splunk. I haven't used it yet but interested in feedback.

There's a grammar error in the first sentence. "an quick" should be "a quick".

Great, detailed article though.

Great by post by Samson! I sometime wish I could tell stories and explain like he can

Can Airflow be used to load data from a file in filesystem( not in Hive ) to MySql .. also instead of calling a Bashscript can we call a PHP script.. Any pointers will be helpful.

A PythonOperator using the MySqlHook should do the trick. Oh and you can call a PHP script from BashOperator.

Nice post! We went through a very similar process back in Zalora (http://www.zalora.com) building our own Data Warehouse and Analytics function for over 300 internal users at the time.

Back when Redshift was on its initial beta release (2012), there almost wasn't any ETL / charts tool available for us, so we ended up building most of the tools and libraries ourselves. A few of them were open source too:

- https://github.com/zalora/redsift A web-interface SQL tool for Redshift, letting the user to query and also export them to S3 (and send an email alert to user once done.)

- https://github.com/zalora/postgresql-user-manager for managing user privileges

- https://github.com/lenguyenthedat/aws-redshift-to-rds for copying Tables from Redshift to RDS (postgresql)

- https://github.com/zalora/kraken a bit similar to luigi or airflow.

Things has changed quite a lot since then, there are a lot of great solutions to our problems that are either free or very cheap and production-ready:

- redash.io (https://github.com/EverythingMe/redash) - web-interface SQL tool with visualization - FREE

- redshift_console (https://github.com/EverythingMe/redshift_console) - redshift ops tool - FREE

- flydata (https://www.flydata.com/resources/flydata-sync/sync-rds-mysq...) sync live data from MySQL to Redshift - Subscription base

- dreamfactory (https://aws.amazon.com/marketplace/pp/B00GXYDK18?sr=0-3&qid=...) for providing REST API interface to your database (supports redshift and other databases) - FREE

We didn't use Tableau in Zalora either (due to pricing and the number of users that we have in-house), and ended up building our own customized data dashboards with d3js and a few other different frameworks.

However, as long as you are ok with the price, Tableau is pretty good. It's being use widely in my current company (http://commercialize.tv) :

- You can minimize processing from Tableau server by just create another data mart layer from Redshift with your ETL tools / scripts, having Tableau connecting directly to it.

- Visualizations / Charts creation process is pretty much straightforward. The end-result will look exceptionally comparing to other solutions that we have tried.

- They also have a really good and active community.

I thought this would be about analytics visualization dashboard with little real estate (500px). Had no idea that 500px was a company name.

They're an interesting company IMO. They've done a good job of 'gamifying' photo sharing. I have at least one relative that's addicted to it and pays for an annual subscription.

I'm not sure how well they're doing business-wise. I did notice that they recently pivoted one aspect of their business. They used to facilitate users selling prints of their works; now they've dropped that service and are building a stock photo service instead.

What do you mean by "gamifying"?

Instead of just letting people 'like' or 'favorite' users' photos, 500px assigns a score to each photo, from 0-100. They have an algorithm that factors in all the activity on a user's photo to assign that score. As the score increases the photo moves to different sections of the site (Fresh, Upcoming, Popular).

My relative posts a photo almost everyday and is always trying to figure out how to get a higher score; different photo subjects, posting at different times of the day, etc.

Nice Post!

For anyone looking for a visualisation solution reasonably priced were you don't need to write SQL http://www.viurdata.com We will be launching support for Amazon Redshift very very soon.

Nice post.

Looker (http://looker.com) is an alternative to Periscope that a ton of venture-backed tech companies are using (mostly in conjunction with Redshift). Here are two posts from Buffer and SeatGeek on their stacks using Redshift, Luigi and Looker. Buffer: https://overflow.bufferapp.com/2014/10/31/buffers-new-data-a.... SeatGeek: http://chairnerd.seatgeek.com/building-out-the-seatgeek-data...

Looker does not necessitate each member of the team to know SQL to explore the data or create reports. It also has a text-based modeling language that is a thin abstraction layer of SQL. It makes SQL modular and reusable making it far more efficient to support a wide range of analysis. It's more expensive than Periscope, but it's way more powerful.

Hey Zack, it’s probably a good idea to disclose that you work at Looker if you’re going to talk about them on HN. https://www.linkedin.com/profile/view?id=145490726

I run marketing for Periscope. Like Chartio, you guys have a great product. We routinely send people your way when they have more business user needs and aren’t a good fit for Periscope.

For those evaluating tools like Periscope/Looker/Chartio:

Whereas Looker focuses on business users, we’ve built Periscope around serving SQL analysts better than anyone else.

If you don't have someone who knows SQL on your team, Looker (or Chartio) is probably a better fit. (though I should mention that some customers have learned SQL through our customer support :)

We’re focused on the SQL analysts who are building dashboards and running reports, day in, day out. With that said, every one of our customers has employees who don’t know SQL, but need to dive into their data. So we’ve built a ton of features to make this easy as well.

If you want to find out more, visit https://www.periscope.io. If Periscope looks like a good fit, sign up and we'll get you set up with a free trial.

Periscope question: "Periscope plugs directly into your databases...."

As I understand it, Periscope is cloud based, so could you elaborate on how it might plug into large locally hosted databases, or is the caveat that my databases must also be in the cloud?

(Periscope co-founder here.)

There needs to be a route from the internet to your DB, yes. Sometimes this works because the DB is in the cloud, or sometimes a port or SSH tunnel is opened into a local network.

Ok, so you don't have to physically move you data to the cloud, it can reside locally and just be queried on demand, right?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact