

Comparing Pig Latin and SQL for Constructing Data Processing Pipelines - erikeldridge
http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html

======
lsb
_"Notice how SQL forces the pipeline to be written inside-out, with operations
that need to happen first happening in the from clause sub-query."_

That temporary table is unnecessary: you can rewrite the query as

    
    
      insert into ValuableClicksPerDMA
        select dma, count(*)
        from users, clicks, geoinfo
        where users.name = clicks.user
          and value > 0
          and clicks.ipaddr = geoinfo.ipaddr
        group by dma;
    

Perhaps the reason the SQL is running slow is that the join is on a string,
users.name = clicks.user, instead of an id.

Also, why the disregard for the optimizer? We trust optimizers to allocate
memory well, we trust optimizers to generate code well, and we can analyze the
database to see which indices partition the data the best. Having an optimizer
means that you just rerun ANALYZE and the query gets rewritten if the data
changes, versus changing the procedure long after the fact to reorder tables.
(And if you want a specific table order, just use cross join or something to
prevent that.)

------
jhammerb
I thought Microsoft did a good job of adding a procedural-like feel to SQL
with SCOPE's virtual views; see <http://icme2007.org/~jrzhou/pub/Scope.pdf>
for more.

PigLatin certainly is a nice approach to the problem as well, of course.

~~~
lsb
That domain is parked and the pdf link is broken.

------
btilly
For me the key point was this:

 _Of course this can be resolved with the use of intermediate or temporary
tables._

Other than waving his hands he didn't really give an downsides to this
approach. And in my past experience intermediate tables are a great way to go.

------
gfodor
Also, check out Cascading.

