
Facebook, Hadoop, and Hive - paulsb
http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/
======
clofresh
Pretty interesting article. My rss reader caught the text before the site went
down:

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of
Cloudera, in which he discussed a Hadoop-based effort at Facebook he
previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook
contacted me to expand upon and in a couple of instances correct what Jeff had
said. They also filled me in on Hive, a data-manipulation add-on to Hadoop
that they developed and subsequently open-sourced.

Updating the metrics in my Cloudera post,

Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly
better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for
user data is reasonable.

Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now,
not 10. Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff.
Ad targeting queries are the most frequent, and they’re run hourly. Dashboards
are repopulated daily.

Nothing else in my Cloudera post was called out as being wrong.

In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single
cluster, due to be increased to 1000 soon. Facebook thinks this is the second-
largest* Hadoop installation, or else close to it. What’s more, Facebook
believes it is unusual in spreading all its apps across a single huge cluster,
rather than doing different kinds of work on different, smaller sub-clusters.

*Apparently, Yahoo is at 2000 nodes (and headed for 4000), 1000 or so of which are operated as a single cluster for a single app.

Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor
data warehouse to Hadoop — augmented by Hive — rather than to an MPP data
warehouse DBMS. Major drivers of the choice included:

License/maintenance costs. Free is a good price.

Open source flexibility. Facebook is one of the few users I’ve ever spoken
with that actually cares about modifying open source code.

Ability to run on cheap hardware. Facebook runs real-time MySQL instances on
boxes that cost $10K or so, and would expect to pay at least as much for an
MPP DBMS node. But Hadoop nodes run on boxes that cost no more than $4K, and
sometimes (depending e.g. on whether they have any disk at all) as little as
$2K. These are “true” commodity boxes; they don’t even use RAID.

Ability to scale out to lots of nodes. Few of the new low-cost MPP DBMS
vendors have production systems even today of >100 nodes. (Actually, I’m not
certain that any except Netezza do, although Kognitio in a prior release of
its technology once built a 900ish node production system.)

Inherently better performance. Correctly or otherwise, the Facebook guys
thought that Hadoop had performance advantages over DBMS, due to the lack of
overhead associated with transactions and so on.

One option Facebook didn’t seriously consider was sticking with the incumbent,
which Facebook folks regarded as “horrible” and a “lost cause.” The daily
pipeline took more than 24 hours to process. Although aware that its big-DBMS-
vendor warehouse could probably be tuned much better, Facebook didn’t see that
as a path to growing its warehouse more than 100-fold. (But based on my
discussion with Cloudera, I gather that vendor’s DBMS is indeed used to run
some reporting today.)

Reliability of Facebook’s Hadoop/Hive system seems to be so-so. It’s designed
for a few nodes at a time to fail; that’s no biggie. There’s a head node
that’s a single-point of failure; while there’s a backup node, I gather
failover takes 15 minutes or so, a figure the Facebook guys think they could
reduce substantially if they put their minds to it. But users submitting long-
running queries don’t seem to mind delays of up to an hour, as long as they
don’t have to resubmit their queries. Keeping ETL up is a higher priority than
keeping query execution going. Data loss would indeed be intolerable, but at
that level Hadoop/Hive seems to be quite trustworthy.

There also are occasional longer partial(?) outages, when an upgrade
introduces a bug or something, but those don’t seem to be a major concern.
Facebook’s variability in node hardware raises an obvious question — how does
Hadoop deal with heterogeneous hardware among its nodes? Apparently a fair
scheduling capability has been built for Hadoop, with Facebook as the first
major user and Yahoo apparently moving in that direction as well. As for
inputs to the scheduler (or any more primitive workload allocator) — well,
that depends on the kind of heterogeneity.

Disk heterogeneity — a distributed file system reports back about disk.

CPU heterogeneity — different nodes can be configured to run different numbers
of concurrent tasks each.

RAM heterogeneity — Hadoop does not understand the memory requirements of each
task, and does not do a good job of matching tasks to boxes accordingly. But
the Hadoop community is working to fix this.

Further notes on Hive

Without Hive, some basic Hadoop data manipulations can be a pain in the butt.
A GROUP BY or the equivalent could take >100 lines of Java or Python code, and
unless the person writing it knew something about database technologically, it
could use some pretty sub-optimal algorithms even then. Enter Hive.

Hive sets out to fix this problem. Originally developed at Facebook (in Java,
like Hadoop is), Hive was open-sourced last summer, by which time its SQL
interface was in place, and now has 6 main developers. The essence of Hive
seems to be:

An interface that implements a subset of SQL

Compilation of that SQL into a MapReduce configuration file.

An execution engine to run same.

The SQL implemented so far seems to, unsurprisingly be, what is most needed to
analyze Facebook’s log files. I.e., it’s some basic stuff, plus some timestamp
functionality. There also is an extensibility framework, and some ELT
functionality. Known users of Hive include Facebook (definitely in production)
and hi5 (apparently in production as well). Also, there’s a Hive code
committer from Last.fm.

Other links about huge data warehouses:

eBay has a 6 1/2 petabyte database running on Greenplum and a 2 1/2 petabyte
enterprise data warehouse running on Teradata.

Wal-Mart, Bank of America, another financial services company, and Dell also
have very large Teradata databases.

Yahoo’s web/network events database, running on proprietary software, sounded
about 1/6th the size of eBay’s Greenplum system when it was described about a
year ago.

Fox Interactive Media/MySpace has multi-hundred terabyte databases running on
each of Greenplum and Aster Data nCluster.

TEOCO has 100s of terabytes running on DATAllegro. To a probably lesser
extent, the same is now also true of Dell.

Vertica has a couple of unnamed customers with databases in the 200 terabyte
range.

