

Key/value is dead. Long live tuples: Pangool for Hadoop - ivanprado
http://www.datasalt.com/2012/03/pangool-hadoop-api-made-easy/

======
jasonkolb
I just popped in to say that I'm tired of the "X is dead" linkbait headlines.
They demonstrate a myopic view of the world. Visual Basic and COBOL are still
around.

~~~
philwelch
Dead is relative. Dead usually means "dead to me".

~~~
kellenfujimoto
Or in the case of VB, "better off dead".

------
protomyth
a couple of links that came to mind with this:

<http://en.wikipedia.org/wiki/Tuple_space>

<http://en.wikipedia.org/wiki/Linda_(coordination_language)>

[http://www.amazon.com/Mirror-Worlds-Software-Universe-
Shoebo...](http://www.amazon.com/Mirror-Worlds-Software-Universe-Shoebox-
How/dp/019507906X/ref=sr_1_1?ie=UTF8&qid=1331144023&sr=8-1)

~~~
alatkins
And for a slightly more modern take on the tuple space, check out Java Spaces
[1] or Gigaspaces [2]. There's still plenty of active research on the topic
too [3] (disclaimer: I did my PhD thesis on distributed tuple spaces).

I've long contended that a tuple space was basically a generalised key-value
store, so it's nice to see projects like this one crop up.

[1] <http://java.net/projects/jini/>

[2] <http://www.gigaspaces.com/>

[3] <http://eprints.utas.edu.au/9996/>

------
silssilsssil
I'm wondering what's the need for this when we already have Apache Pig, etc?

~~~
ivanprado
Hi, I'm one of the developers of Pangool. The idea of Pangool is not to be yet
another higher level API on top of Hadoop but rather to pose a replacement for
the low-level Hadoop Java MapReduce API. Pangool has the same performance and
flexibility than that of the Java MapReduce API although it makes several
things a lot easier and convenient. There is no tradeoff, just advantages.
There will be cases where you'd want to use Pig or Cascading. There will be
some other cases where you'd want the flexibility and efficiency of MapReduce.
For those cases we conceived Pangool. Nowadays only very advanced Hadoop users
could write efficiently-performing MapReduce Jobs. Pangool hides all the
advanced boilerplate code needed for writing highly efficient MapReduce jobs,
making things like secondary sorting or reduce-side joins extremely easy.

~~~
haberman
> There is no tradeoff, just advantages.

Though I don't have deep expertise in Hadoop, I find this claim highly
suspect. High-level APIs achieve user-friendliness by making
decisions/assumptions about the way a lower-level API will be used. I would be
very surprised if there was _no_ use case for which your API does impose a
trade-off vs. the low-level Hadoop API.

I feel much more confident using a high-level API if its author is up-front
about what assumptions it's making. If the claim is that there is no trade-off
vs. the low-level API, I generally conclude that the author doesn't understand
the problem space well enough to know what those trade-offs are.

I could be wrong, but this is my bias/experience.

~~~
ferrerabertran
Hi haberman, I'm one of the developers of Pangool. Let me try to clarify why
we stated that. I understand it may sound aggresive.

Pangool is based on an extension of the MapReduce model we suggest and call
"Tuple MapReduce". This is explained in detail in this post:
[http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-
the-c...](http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-the-classic-
mapreduce/)

What this means is that in Pangool, if you worked with 2-sized Tuples, you
would be able to do exactly the same that you do now with Java MapReduce -
That includes custom RawComparators and arbitrary business logic in any place
of the MapReduce chain (Mapper, Combiner, Reducer). Using n-sized Tuples
together with Pangool's group & sort by, reduce-side join API will only mean
less code, easier code at no loss of performance or flexibility.

Realize that Pangool is still a MapReduce API so it doesn't add any level of
abstraction.

We designed Pangool with the aim of offering it as a replacement of the
current MapReduce API. Therefore we are not labelling it as a "higher-level
API" but as comparable low-level API.

On the other hand we are also benchmarking Pangool to show it doesn't impose a
performance overhead: <http://pangool.net/benchmark.html>

~~~
scott_s
The tradeoff, then, is that if someone's current problem maps exactly to the
current API, then your API is more complex than needed.

~~~
tim_h
Pangool actually seems like a generalization of Hadoop. This doesn't
necessarily make it more complex. If a problem maps exactly to the Hadoop API,
then it should also map exactly to the Pangool API by setting m=2 (in the
extended map reduce model described at [http://www.datasalt.com/2012/02/tuple-
mapreduce-beyond-the-c...](http://www.datasalt.com/2012/02/tuple-mapreduce-
beyond-the-classic-mapreduce/)).

~~~
scott_s
I agree with your first sentence, but disagree with the second. That you can
find an exact mapping does not prevent the underlying API from being more
complex than what you need. That you had to realize "Oh, m=2" is more
complexity.

I'm not arguing this is a terrible thing. In fact, I think this is an
acceptable level of additional complexity for the power it buys you. But if
we're going to make an honest evaluation of the trade-offs, I think we must
mention this.

It may be relevant to the discussion to point out that I work on a tuple-based
streaming system. Product:
<http://www-01.ibm.com/software/data/infosphere/streams/> Academic:
<http://dl.acm.org/citation.cfm?id=1890754.1890761>,
<http://dl.acm.org/citation.cfm?id=1645953.1646061>

------
rjurney
So it sounds like this slots in like so, in order of abstraction:

HIVE -> Pig -> Pangool -> Cascading -> MapReduce

Nice addition!

~~~
ferrerabertran
Hi rjurney. I would say "Hive, Pig, Cascading" are on the higher level API
side and "Pangool, MapReduce" on the low-level side. Pangool is a MapReduce
API that aims to make MapReduce simpler. We explain this better in our FAQ:
<http://pangool.net/faq.html>

~~~
rjurney
HIVE -> Pig -> Cascading -> Pangool -> MapReduce ?

------
lightblade
Tuples reminds me of RDBMS

~~~
kbob
Exactly. A tuple is exactly the same as a relation.

~~~
sixbrx
_Set_ of tuples (of like kind) is a relation.

