
The Twitch Statistics Pipeline - dayjah
http://ossareh.posthaven.com/the-twitch-statistics-pipeline
======
donuts
Of note: Twitch's Data Science team is hiring their 5th engineer
[http://science.twitch.tv/](http://science.twitch.tv/)

I'm a designer that worked with Mike on a short project for the science team.
The 4 of them are a huge pleasure to work with :) Very scrappy and analytical
when it's time to get things done, but playful and generally funny fellows to
be around. Interesting problems to work on when they're #4 in US's peak
traffic count, after much bigger guys like Netflix, Google, Apple (and an
underdog story beating out Hulu, FB, Valve) [1]

Twitch's office is very thoughtfully designed with fan art and game-themed
interior and murals. It's a nice testament to the community that they're built
around. The office manager, Ashley, has done an equally thoughtful job with
amenities -- makes it easy to be productive.

[1]
[http://s.jtvnw.net/jtv_user_pictures/hosted_images/wallstree...](http://s.jtvnw.net/jtv_user_pictures/hosted_images/wallstreet_deepfield_graph.jpg)

~~~
nrmn
Any love for interns?

~~~
donuts
won't know if you don't ask : ) datascience@twitch.tv

------
jvehent
"Our clients send base64 encoded JSON objects to our edge server using HTTP
GETs."

Encoded json in get parameters? That's not what GET requests are for. use POST
requests for that, or you'll quickly be limited by the max size of a get
request, somewhere around 8KB.

~~~
kcbanner
If you are sending over 8KB of JSON, something is fucked

~~~
opendais
I could see a bulk metric insert [e.g. a modified version of StatsD] being
>8KB.

------
noelwelsh
The post says that latency is an issue. In this case I would look at using
Kinesis (AWS's hosted Kafka equivalent) which can barf data directly into
Redshift (as I understand it; haven't used this functionality.)

~~~
dayjah
Our latency concerns are "can our product managers make decisions quickly", to
that end we're OK with a 24 hour latency for stats. The fact that we're at 5
hour latency gives us a lot of breathing room.

Kinesis looks really interesting, and is definitely something that we're going
to look at once we work out our ETL process.

The order of priority for us has been:

1 - Get a pipeline up and running 2 - Make it robust 3 - Make it fast.

Pipeline v3, our current one, satisfies (2). We expect to be working on (3) in
the near future. ETL is the latter part of (2). (3) results in powering
dashboards, we expect those to contain a lot of joined data and having a
robust ETL process is pretty key to that.

~~~
alexatkeplar
Yes - Kinesis is awesome. This is already a little out of date but shows how
we're now porting our open source event pipeline to on top of Kinesis:
[http://snowplowanalytics.com/blog/2014/02/04/snowplow-0.9.0-...](http://snowplowanalytics.com/blog/2014/02/04/snowplow-0.9.0-released-
with-beta-kinesis-support/)

------
alexatkeplar
Nice to see lots of similarities with how we do things at Snowplow
([https://github.com/snowplow/snowplow](https://github.com/snowplow/snowplow))...

