
Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop - paulsb
http://www.infoq.com/presentations/Facebook-Hive-Hadoop
======
ajross
I don't have time to watch the whole thing right now, but I don't buy 12TB/day
for a second. Facebook has what, on the order of 1e8 users, doing a handful of
updates per day which are text snippets of a dozen bytes each. Not even close.

Maybe they're including things like posted image data? That's schemaless and
static, and not really a "data warehouse" problem as such. Someone enlighten
me, because this doesn't add up.

~~~
patio11
Depending on the resolution of your log files, I could buy it.

I recently set BCC to log AJAX requests individually to try to reproduce an
issue in production, and forgot to turn that off after I was done. "Whoops."
Thank God for logrotate -- I was chewing through 300 MB a day, or roughly the
same on a per user basis as you just calculated, from a site which has two
pages which are heavily AJAXed.

If you're capturing context for actions for later analysis, that is going to
chew through space like craaaaaaazy. (For example, Facebook obviously doesn't
want people leaving the service. They'll show you photos of 10 friends to make
you stay. I suggest that if the leaving customer is male then the more ladies
you have in that lineup, to a point, the less likely he is to leave. If we
thought ahead when designing that feature, to capture the friend IDs of the
photos we show, we can have some analyst quickly knock out whether prior data
suggests that to be true prior to having engineering run an A/B test on it. So
that is an extra 10 numbers to keep track of for every access of that action.
Multiply that sort of contextual information times all the actions FB is
interested in and allow them to snapshot the state of your social graph so
that they're using the proper context for you back then, and they really do
have some hefty, hefty needs.

~~~
sailormoon
Woah, not 300MB per day! At that rate, it'll only take 9 years to fill up a
1TB drive! Oh noes!

~~~
patio11
Or under three months to fill up my VPS' 20 GB allocation for disk storage and
cause further attempts by customers to create bingo cards to throw write
errors, bringing my business to a crunching halt.

Like I said, yay for logrotate.

------
sailormoon
I heard WoW has 1.3PB online.

