Google App Engine Pipeline API

kljensen · on Dec 16, 2010

I can't wrap my head around what exactly this does. How is this related to mapreduce? Can anybody explain?

bslatkin · on Dec 16, 2010

Lemme give you a naive example.

Say you wanted to generate a heatmap using MapReduces. How would you do it? You'd probably need something like this:

  1. Map location data points to (region -> weight)
  2. Reduce (region -> weight) to (region -> sum of weights)
  3. Map data points to (region -> 1)
  4. Reduce (region -> 1) to (region -> sum of points in region)
  5. Shuffle output of #2 and #4
  6. Reduce (region -> sum of points) and (region -> sum of weights) to (region -> average weight)

The pipeline API makes it easy to describe the dependencies between these separate MR jobs, wait for each segment to complete before triggering the next, and lets you reuse this logic as part of a larger computational workflow.

The Mapper framework/MapReduce integration part is not ready yet, but we're getting there. Release early/often~

ps. For those of you who know how to do a heatmap in a single MR: I'm just trying to demonstrate why you may need to pass inputs/outputs between MR jobs.

toddh · on Dec 16, 2010

When you break up your application up as a set of tasks, which GAE pretty much requires, it's hard to execute them in any sort of order. Your program is not executing sequentially so there aren't any while loops and if-then statements to make sure calculations happen in a certain sequence. So you need a higher level kind system that can order everything. Node.js which is a similar sort of event driven system, but not distributed, has a number of similar libraries https://github.com/ry/node/wiki/modules under Flow control / Async goodies.

projectileboy · on Dec 16, 2010

Maybe this isn't what they're targeting, but I could see it functioning nicely as a workflow engine (google 'jBPM' for an example).

elblanco · on Dec 16, 2010

Yeah, I was thinking BPEL for a minute...

shib71 · on Dec 16, 2010

My original thought was shot down when I read this:

  The Pipeline API is not a data or stream processing engine.

bslatkin · on Dec 16, 2010

What I mean by this is that we're not doing the same thing as Cascading (http://www.cascading.org/), which requires you to transform your problem into the tuple-space domain. Stream processing frameworks like Cascading are for green-field implementations that maximize incremental performance.

On the other hand, the Pipeline API is task oriented. Developers use it with a procedural approach. The focus is on parameter and return value passing and scheduling. It's easy to reuse your existing code in this framework. Think of it as something closer to a parallelizable Bash than a data processing framework.

brown9-2 · on Dec 16, 2010

I'm impressed that Google continues to iterate on AppEngine and add features.