

Search Big Data in Five Minutes with Pig, Wonderdog and ElasticSearch - rjurney
http://hortonworks.com/blog/search-data-at-scale-in-five-minutes-with-pig-wonderdog-and-elasticsearch/

======
ajays
FTA: "Note however, the ElasticSearch UDF returns nearly instantly, while the
FILTER is slooooooowwww… "

It's just 450MB of data (in Avro format). Why should the filter be
"slooooooooowwww" ?

~~~
villagefool
Agreed, the actuall time measurements are missing for the comparison.

~~~
rjurney
I'll add them.

~~~
squarecog
450 Megs? Use pig -x local :)

~~~
rjurney
I did.

------
big_data
Pig may lack some syntactic sugar, but it really gets the job done! It's a
great way to get MR jobs going in a hurry.

------
rjurney
I'd like to see more YC companies using Pig.

~~~
virmundi
See my problem with Pig is that it doesn't unit test well. It's easy, it's
quick, it's fairly powerful, but unit testing is horrible. I spent 3 days
extending PigUnit to make it easier to overload hard coded reference files in
the script so you can provide mock test files. Even getting this to work, and
having under 5 megs of reference file for the test, the test takes 3 minutes
for 7 simple filters and joins and about 900 megs of RAM.

Compare this to Cascading which is far easier to unit test.

Also PIG is a hog when it comes to making jars. If you don't install pig on
EVERY node in the cluster and rather provide it via a submitted uber jar, the
jar is HUGE (50 or so megs). PIG's dependencies are ridiculous.

Again compare to Cascading who has a far smaller foot print.

I would like to a strong comparison of HBase. Pig is useful for one-off, ad-
hoc stuff, but I don't think it's production ready.

~~~
rjurney
Lets be clear: Pig powers Yahoo and LinkedIn. Large enterprises. There's no
question whether it is production ready.

~~~
squarecog
And Twitter.

We unit test our pig scripts, it's pretty straightforward given MockStorage
class we have contributed (you can find it in pig trunk). Granted, we've long
been separating load statements from actual logic, which allows us to fairly
easily mock up data to feed into the Pig flows. It would be harder if your
loads and flows are in the same file.

~~~
matterhayes
At LinkedIn we collected together a bunch of UDFs we developed for pig and
released it as the DataFu library. We used PigUnit to unit test each of the
UDFs:

[https://github.com/linkedin/datafu/tree/master/test/pig/data...](https://github.com/linkedin/datafu/tree/master/test/pig/datafu/test/pig)

