Hacker News new | past | comments | ask | show | jobs | submit login
The Twitch Statistics Pipeline (ossareh.posthaven.com)
76 points by dayjah on April 4, 2014 | hide | past | favorite | 12 comments



Of note: Twitch's Data Science team is hiring their 5th engineer http://science.twitch.tv/

I'm a designer that worked with Mike on a short project for the science team. The 4 of them are a huge pleasure to work with :) Very scrappy and analytical when it's time to get things done, but playful and generally funny fellows to be around. Interesting problems to work on when they're #4 in US's peak traffic count, after much bigger guys like Netflix, Google, Apple (and an underdog story beating out Hulu, FB, Valve) [1]

Twitch's office is very thoughtfully designed with fan art and game-themed interior and murals. It's a nice testament to the community that they're built around. The office manager, Ashley, has done an equally thoughtful job with amenities -- makes it easy to be productive.

[1] http://s.jtvnw.net/jtv_user_pictures/hosted_images/wallstree...


Any love for interns?


won't know if you don't ask : ) datascience@twitch.tv


"Our clients send base64 encoded JSON objects to our edge server using HTTP GETs."

Encoded json in get parameters? That's not what GET requests are for. use POST requests for that, or you'll quickly be limited by the max size of a get request, somewhere around 8KB.


Yes, totally agree on that part! In fact, we tend to have had bad experiences with URLs over 4KB. Fortunately most of our stats requests come in at about 1KB.

We're in the process of moving over to POSTs. The second part of this series will go into more detail as to the "whys". The primary reason we use GETs is backwards compatibility; we wanted our new team to be a success and in ensuring that we thought carefully about the battles worth fighting - the Mixpanel stats client is a good one, so we opted for being "Mixpanel Protocol"-compatible; they use base64 encoded json blobs shipped using an HTTP GET, and thusly so do we.


If you are sending over 8KB of JSON, something is fucked


I could see a bulk metric insert [e.g. a modified version of StatsD] being >8KB.


The post says that latency is an issue. In this case I would look at using Kinesis (AWS's hosted Kafka equivalent) which can barf data directly into Redshift (as I understand it; haven't used this functionality.)


Our latency concerns are "can our product managers make decisions quickly", to that end we're OK with a 24 hour latency for stats. The fact that we're at 5 hour latency gives us a lot of breathing room.

Kinesis looks really interesting, and is definitely something that we're going to look at once we work out our ETL process.

The order of priority for us has been:

1 - Get a pipeline up and running 2 - Make it robust 3 - Make it fast.

Pipeline v3, our current one, satisfies (2). We expect to be working on (3) in the near future. ETL is the latter part of (2). (3) results in powering dashboards, we expect those to contain a lot of joined data and having a robust ETL process is pretty key to that.


Yes - Kinesis is awesome. This is already a little out of date but shows how we're now porting our open source event pipeline to on top of Kinesis: http://snowplowanalytics.com/blog/2014/02/04/snowplow-0.9.0-...


Kinesis was released in mid-November 2013 so this team's effort predates that service.

I think Kinesis could replace the first three boxes in their diagram, and do it in real-time. (I haven't used kinesis so I could be wrong.)

It's amazing how fast big data infrastructure is evolving. For anybody looking to build something it seems your chosen solution will be obsolete by the time you release.


Nice to see lots of similarities with how we do things at Snowplow (https://github.com/snowplow/snowplow)...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: