

Building a Data Intensive Web App with Hadoop, Hive, & EC2 - pskomoroch
http://www.cloudera.com/hadoop-data-intensive-application-tutorial

======
pskomoroch
Related blog post here: [http://www.cloudera.com/blog/2009/07/31/tracking-
trends-with...](http://www.cloudera.com/blog/2009/07/31/tracking-trends-with-
hadoop-and-hive-on-ec2/)

Full source code on Github:
<http://github.com/datawrangling/trendingtopics/tree/master>

Dataset on Amazon Public Data Sets:
[http://developer.amazonwebservices.com/connect/entry.jspa?ex...](http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2596)

------
loganfrederick
I have a friend interning at Hadoop over the summer. I'm particularly
interested in how their business model will work. Their branch of Hadoop I
believe is being offered for free, and they're basically cloud computing
consultants with the goal of customizing Hadoop for different
industries/clients. It's a huge market opportunity.

~~~
neilc
Running a consulting firm around Hadoop and related technologies makes a lot
of sense, but taking VC funding to do so is surprising, IMHO (Cloudera are
backed by Accel and Greylock). A consulting firm has relatively small capital
requirements, but is fundamentally less scalable than a "product" firm: to
make 2x revenue, you need ~2x the staff. I'm curious to see whether they'll be
able to achieve the sort of returns that a typical VC expects if they stick to
a purely-consulting business model.

Cloudera have taken ~$11M in VC funding so far[1]; is there really ~$110M in
profit to be made off Hadoop consulting, training and support in the medium
term? I wonder.

One possibility is that they're using consulting to build revenue and
mindshare in the short-term, and using the capital they've raised to launch
something more substantial in the longer-term (say, running their own
cloud/hosted Hadoop service).

[1] [http://ostatic.com/blog/hadoop-centric-cloudera-
gets-6-milli...](http://ostatic.com/blog/hadoop-centric-cloudera-
gets-6-million-in-series-b-funding)

------
idefine
It was great to have Pete Skomoroch speak about this at the hadoop meetup in
DC. I am really glad that it is being shared with the rest of the community
now. Cloudera is collecting good use cases and providing innovative ideas in
their blog. Thanks again for sharing Pete.

------
mrlebowski
This is almost exactly what I have been working on, and would be a lot of
help! Thanks !

~~~
mrlebowski
Why are you loading the processed data on MySQL tables? I am not sure about
how much MySQL would scale, given that wikipedia has ~3million articles. Like
I said, I am working on a similar problem right now and we are trying to avoid
MySQL. Did you guys consider HBase or other big-table like implementations?

HN insights will be valuable, thank you!

~~~
pskomoroch
The live site trendingtopics.org is using MySQL for all 3 million articles and
it handles it pretty well with the right indexing, bulk loads, and memcached.
I built the initial demo in 10 days, so I choose Rails w/ MySQL mostly for
simplicity and with the intention of adding Solr or Sphinx search. The way the
data is stored (key value style w/ JSON timelines) was actually intended to
lend itself to replacing MySQL with another fast big-table like datastore.

~~~
mrlebowski
Thanks for the quick reply. How many machines are running MySQL for you?

I was reading this website - [http://www.metabrew.com/article/anti-rdbms-a-
list-of-distrib...](http://www.metabrew.com/article/anti-rdbms-a-list-of-
distributed-key-value-stores/)

I have not tried HBase and HyperTable myself yet, but the blog post says that
they still have latency issues. What are your views?

~~~
pskomoroch
We're just using a single c1.medium instance for the database right now.
Trendingtopics.org is a relatively low traffic, read-only site and most of the
reads are for a handful of urls on the front page which can be cached.

Also, after processing the raw log data with Hadoop, we only need to
store/lookup 3M records in the MySQL presentation layer, which is well within
the capabilities of a tuned RDBMS. Many Rails sites are backed by MySQL, so I
thought linking Hadoop/Hive to a common data workflow would make for a good
example.

I've been hearing that recent improvements to HBase 0.20 could make it a
contender: [http://stackoverflow.com/questions/1022150/is-hbase-
stable-a...](http://stackoverflow.com/questions/1022150/is-hbase-stable-and-
production-ready) and some high volume sites like Mahalo are already using it.
That said, there are other alternative data stores (Cassandra, Voldemort,
Tokyo Tyrant) that might be worth exploring if a database isn't cutting it for
you.

------
maolson
Awesome stuff. Nice to see a deep practical discussion on building working
systems.

------
neilkod
Brilliant tutorial - Very in-depth and covers quite a lot of ground.

------
miked98
Practical, extensive, and timely piece on the nuts and bolts of weaving Hadoop
and EC2.

------
christofd
Intense, thorough tutorial. Nice to see a pragmatic Hadoop walk-thru.

