
Clojure/core — Functional Relational Programming with Cascalog - fogus
http://clojure.com/blog/2012/02/03/functional-relational-programming-with-cascalog.html
======
plam
What I particularly like about Cascalog is: 1) Composability 2) Unit testing
3) No installation as it is embedded in your jar.

------
amstr
The two things missing from Cascalog that would take it from great to godlike
are 1) an easy way to use the distributed cache and 2) a way to run Cascalog
jobs on the cluster without the compilation/hadoop jar cycle. I don't know if
#2 is even possible but it would be ridiculously powerful.

~~~
plam
Could you elaborate on #1 please? Wouldn't a distributed cache defeat the
purpose of data locality of Hadoop? Regardless, I guess one could write a tap
to Avout to enable this?

~~~
amstr
Sorry, just saw this reply. Hadoop comes with a distributed cache that is
generally used for small files -- a common example would be doing a large join
against a small table that would fit in memory. For example if you wanted to
filter out stopwords or something, the currently accepted way is to put this
stopword list into the resources/ directory of your JAR, which is not really
optimal for data that might change frequently.

[http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/...](http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html)

and for discussions related to Cascalog:
[http://groups.google.com/group/cascalog-
user/browse_thread/t...](http://groups.google.com/group/cascalog-
user/browse_thread/thread/fbf96e5c37d317b4) and
[https://groups.google.com/forum/#!topic/cascalog-
user/l5SEW3...](https://groups.google.com/forum/#!topic/cascalog-
user/l5SEW3vJheo)

I have not seen any info on using Cascalog alongside Avout, but the idea makes
sense.

