
Building Analytics at 500px - titanas
https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83
======
potatosareok
Thanks a lot for a detailed post. 20gb of log to not seem like much though?
Not really sure of 500px scale. Fully conceding that at my company probably
are logging too much, I'm wondering what takes ETL pipeline 4hr to run over
that. Is just no need to optimize it since there's no benefit to have these
metrics real time? Or am I missing some really data heavy part. Again not
trying to be negative just wondering.

Luigi seems cool, can anyone comment on it compared Airflow or Spring XD, or
is are those just different products.

Periscope looks like Kibana for relational stores, also cool to see.

Thanks again for great post!

~~~
Wagthesam
Its not a lot of data, but the biggest constraints to me were cost time to
implementation. You could actually get an amazon memory intensive server and
do it all in memory, but I didn't have those resources. ETL + a Redshift
server in the end cost me around $5000 a year, which is TINY compared to the
value we got out of it and the cost that most companies pay.

~~~
potatosareok
Makes sense I'm used to physical servers with tons of ram at work. If you
don't mind, and I assume you're on Amazon entirely then, what size
instances/how many are you using for the ETL(luigi) nodes? And do is there any
infrastructure besides s3 (mysql dump, logs) => amazon instances (luigi) =>
redshift being involved?

~~~
Wagthesam
That is all there is. s3, one amazon instance for luigi which holds the mysql
read replica, and amazon redshfit. There wasn't any heavy ETL in Luigi. Luigi
mostly just extracted/dumped data. All the heavy lifting was in EMR

------
lexicography
This is such an awesome post. Samson shared the details of his data
engineering work at 500px and Wish during a Keen IO event last night. His
slides are here -> [https://keen.io/blog/130230045601/analytics-startups-and-
lau...](https://keen.io/blog/130230045601/analytics-startups-and-laughs-at-
keen-hq)

------
jasoon
Great post, and a nicely detailed read. This also came up on HN a few months
ago
([https://news.ycombinator.com/item?id=9760606](https://news.ycombinator.com/item?id=9760606)),
with some interesting alternatives to some of the presented tech in the
comments.

------
hanklazard
Wow, this post is incredible--I can't thank the author enough for writing it!
It's so thorough and has so many pieces of wisdom to offer in building a
complete analytics solution from top to bottom. I'll be returning to this
often.

------
Wagthesam
Author here

I'll follow up and say that my first three months in SF/Bay area have been
amazing. This is truly the capital of technology here. The level of talent and
the one-ness of the community goes beyond anything that exists in Canada. I'm
excited to take what I lear back one day.

Send me questions via email shanzhen.hu at gmail.com if you have any questions
about the process. I'm happy to help.

~~~
Jgrubb
Just wanted to say thanks for this writeup. I read it the last time on here
and it was definitely inspirational.

------
gedrap
Previous discussion 3 months ago:
[https://news.ycombinator.com/item?id=9760606](https://news.ycombinator.com/item?id=9760606)

------
pboutros
I really appreciate the amount of time - and honesty - that went into this
post. The author really sounds like they have their head firmly on their
shoulders. Thanks, Samson!

------
gricardo99
Very good post. So great to get this kind of detailed insights into the whole
process: requirements, trade offs, architecture and implementation decisions,
impact across the whole business and the end result and added value.

------
warbon123
It is quite interesting post. I have been working in a scenario where the
company is moving out of traditional Oracle based BI stack and this post just
highlights how opex for BI can be reduced drastically.

------
steveax
What a terrific article. Well organized, clear, just enough detail to support
the important points and on top of all that, a fun read. Thanks!

