

Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive - novicecoder
http://www.cloudera.com/blog/2009/11/11/hadoop-world-rethinking-the-data-warehouse-with-hadoop-and-hive-from-ashish-thusoo/

======
physcab
I've been using Hive for the past month or so. It has significantly reduced my
development time in doing analytics. If you're interested in this tool, let me
give some advice:

1) Hive shines when your dataset is truly massive (ie > 100mb). Anything less
should probably be done in MySQL, unless your backend processing is
considerable.

2) Don't come into the project thinking "Oh, I can do SQL on huge datasets and
get immediate results." Hive's main advantage is reducing the need to write
custom MapReduce scripts--so your processing time is still the same.

3) Hive has its quirks. You have to structure joins right and your where
clauses. It's not as forgiving as typical SQL.

~~~
tom_b
I'm assuming you meant 100GB? not MB?

I get a lot of mileage out of standard ANSI SQL. Given that Hive isn't quite
as forgiving, can you give some specifics on things you lose in Hive that you
miss the most?

~~~
physcab
No I meant 100MB. It really depends on how strained your databases are. For
example, we have about 6 people constantly running jobs on MySQL and running
it into the ground. So if I ever wanted to do some data manipulation I have to
pretty much wait until 11pm. But yes, if your situation isn't as bad, you can
do a lot with MySQL.

HiveQL is less forgiving in the sense that sometimes it doesn't return what
you'd expect. Sometimes when I run DISTINCTS and GROUPBY's, the output isn't
what SQL on MySQL would give. For example, DISTINCTs on strings don't work
very well. GROUPBY's lose a little functionality, and you have to read the
documentation to figure out what you can do.

Other than that, queries in Hive are not as cheap as they are in a traditional
DB. Hive offers great flexibility, but you still have to plan your query
carefully because its going to take a while to execute on your large dataset.
Even if you specify LIMIT, it doesn't enforce that until after all the
MapReduce jobs have run.

~~~
jhammerb
If you have bucketed tables, you may find the TABLESAMPLE operator useful to
iteratively construct complex queries and get query results back sooner:
<http://wiki.apache.org/hadoop/Hive/LanguageManual/Sampling>. Agreed that
Hive, and databases in general, could be better in this regard.

------
mark_l_watson
I use Hadoop with custom map/reduce apps, but I am just experimenting a little
with Pig, and playing with Hive is on my to-do list. I may change my mind on
this, but I find the map/reduce model to be easy to work with, and
development/debugging is fairly quick and easy running a single development
Hadoop node.

BTW, I don't view Hadoop as only being useful for very large datasets. It
seems reasonable to build automated Hadoop processing into a new application
that has smaller data size requirements. You don't give up much in
performance, and the extra development time is reasonable. Then you have lots
of flexibility for scaling up. Also, if you only have sporadic needs to
process large data sets, using Elastic MapReduce is very cost effective.

