

Pig: high level language to process big datasets via Hadoop - bayareaguy
http://www.scribd.com/doc/2027476/pig-webscale-processing-Yahoo-Research

======
bayareaguy
Christopher Olston from Yahoo Research presented Pig at the UCB DBMS lunch
talk today. It's a framework that translates a high level processing
specification into Hadoop jobs. Users only worry about providing functions for
specific data parsing and computation. Pig and Hadoop do all the rest.

Here is a sample program (taken from slide 11) that joins two big datasets
(Visits and Pages) to find sessions that end with the "best" page:

    
    
            Visits = load '/data/visits' as (user, url, time);
            Visits = foreach Visits generate user, Canonicalize(url), time;
    
             Pages = load '/data/pages' as (url, pagerank);
    
                VP = join Visits by url, Pages by url;
        UserVisits = group VP by user;
          Sessions = foreach UserVisits generate FindSessions(*);
      HappyEndings = filter Sessions by BestIsLast(*);
    
             store HappyEndings into '/data/happy_endings';
    

Pig is open-source. It's being "incubated" as an Apache project. More details
here:

Apache Page: <http://incubator.apache.org/pig>

Subversion Repository: <http://svn.apache.org/repos/asf/incubator/pig>

Powerpoint slides: <http://www.cs.cmu.edu/~olston/pig.ppt>

