
Hadoop, Pig, and Twitter (NoSQL East 2009) - r11t
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
======
PStamatiou
Pics from Kevin's talk and NoSQL East (I was there):

<http://www.flickr.com/photos/pauls/sets/72157622686185710/>

[http://www.flickr.com/photos/pauls/4061368173/in/set-7215762...](http://www.flickr.com/photos/pauls/4061368173/in/set-72157622686185710/)

------
jbm
I had no idea that there was a higher language level version of Hadoop.

<http://research.yahoo.com/project/90>

Considering the number of venture capitalists who are swarming about in the
Hadoop space, it's definitely worth the effort to get to know Pig a little
better.

------
patio11
Pig continues to strike me as a beautiful little improvement in doing
analytics work. I don't need Twitter-scale (one of the benefits of charging
people money is that it means MySQL is probably adequate in terms of
performance for analytics) but I do like the idea of arbitrarily composed
questions which hurt my head less than SQL.

Hmm... There is something to think about...

~~~
cf
The problem with Pig is it is very much alpha. Schemas are not definable for
many things. Look at how the code example had to be truncated since he was
explicitly pulling so many fields. If you need to write your own UDF functions
to do anything mildly significant. UDFs can be written only in Java. If that
doesn't work you have to resort to Pig Streaming and take the associated
performance hit.

Pig is still pretty cool, but it is not all there yet. A lot of the claims of
avoiding Java made by the presenter are misleading.

~~~
rjurney
You can avoid Java if you have a team of guys that do write Java, to make Pig
work for your dataset. Lots of shops have exactly this - and so Pig allows
regular developers across the organization to conduct ad hoc analysis of web
scale data - with Pig, you feel like you can 'touch the terabytes.' With the
SQL patch, this can extend outwards to regular analysts. Also, you can express
an awful lot with what is included in Pig and whats in the Piggybank. But
you're very right about alpha. A one page Pig analysis can take me all day -
way longer than it should. I am continuously discovering oddities and
undocumented behavior, and finding tricks that I have to pull to make my
scripts work that I really shouldn't have to.

For instance, once naming schemes get two deep, in Pig 0.6 (admittedly, this
is trunk, and its wild out there), you can no longer access the elements. I
don't think this is supposed to be the case, but the result is that after
every operation that groups, etc. I have to do an extra FOREACH GENERATE name
AS name to make the script work. You also have to explicitly flatten all
GROUPed results, even if there is only a single item, and this is very
confusing at first.

The documentation is also extremely sparse in real-world uses of the syntax.
Sometimes, it feels like coding in brainfuck, and I seriously think about
switching over to Java and Cascading. But when Pig works, its really damned
elegant and fast, so I still do most of my work in Pig. There is no better
tool out there for simple to moderately complex analyses of terabytes of data
in Hadoop. Pig excels for relatively simple tasks, and even scales to very
complex tasks, with UDFs - but once you cross some line in terms of
complexity, you're better off using something like Cascading.

Oh, some of you may find this interesting... I built a GUI to sit on top of
Pig: <http://cloudstenography.com/> In practice, using this for real analysis
of data from multiple sources is more challenging than the videos show, but it
makes a point about the ease of accessing data - right out to Excel.

