1. Type system mismatch. Ruby is dynamic and not type checked; Java is static and type checked; Cascading is somewhere in between and unfortunately doesn't seem to use Java's type system as well as it could.
2. User-defined functions in strings. For some reason Cascading lets you write user-defined functions as strings and compiles them dynamically during job execution. This was the way to write user-defined code in Cascading.JRuby.
Cascading requires a compilation step, yet since you're writing Ruby code, you get get none of the benefits of static type checking. It was standard to discover a type issue only after kicking off a job on, oh, 10 EC2 machines, only to have it fail because of a type mismatch. And user code embedded in strings would regularly fail to compile – which you again wouldn't discover until after your job was running.
Each of these were bad individually, together, they were a fucking nightmare. The interaction between the code in strings and the type system was the worst of all possible worlds. No type checking, yet incredibly brittle, finicky and incomprehensible type errors at run time. I will never forget when one of my friends at Etsy was learning Cascading.JRuby and he couldn't get a type cast to work. I happened to know what would work: a triple cast. You had to cast the value to the type you wanted, not once, not twice, but THREE times.
Scalding fixes both problems since it is statically typed and lets you write real user-defined functions instead of stuffing them in strings. To me, the main moral is never, ever design an API that involves writing code in strings. It's just bound to be a disaster. Also, if you're going to have a compilation step anyway, you might as well get some static checking for obvious problems like type errors as part of the bargain.
But we probably could have erected a blast shield around that mess if only we could have written functions and aggregators in ruby.
I think the thing to remember with the c.j stuff is at the time Cascading's Java API was pretty much state of the art, and we had to shore up a lot of what was in c.j in order to get to a viable DSL. And at that point, it was a huge leap forward from a readability standpoint. (Cascading's API had the same shortcomings as our c.j API wrt types etc, fwiw)
OTOH, Scalding has the edge in terms of community size and ecosystem. I haven't found this to be a big issue -- shimming an existing Hadoop input / output to work with either project is quite simple -- but YMMV.
- There a couple types for datasets in the Scalding API: TypedPipe, and KeyedList and subclasses. Scoobi subsumes both of these under DList; thanks to the usual Scala wizardry, this has all the methods to operate on key-value pairs without loss of typesafety. This isn't a huge deal, but it removes the tiny pains of constantly converting back and forth between the two.
- Scoobi's other abstraction, DObject, represents a single value. These are usually created by aggregations or as a way to expose the distributed cache, and have all the operations you'd expect when joining them together or with full datasets. You can emulate this in Cascading / Scalding, but it's a bit less explicit and more error-prone.
- There's no equivalent to the compile-time check for serialization in Scalding, AFAICT.
- Scoobi has less opinions about the job runner itself... there are some helpers for setting up the job, but all features are available as a library. For some reason, I found the two harder to separate in Scalding?
- IIRC, Scalding did job setup by mutating a Cascading object that was available implicitly in the Job. In Scoobi, you build up an immutable datastructure describing the computation and hand that to the compiler. This suits my sense of aesthetics better, I suppose...
* Also, thanks to you guys for Algebird! That's a really fantastic little project, and I use it all the time.
1) Scalding has a DObject like type: ValuePipe[+T].
2) The reason you must explicitly call .group to go to a keyed type is that is costly to do a shuffle, this makes it clear to people when they do trigger a shuffle. If you don't like that, make an implicit def from TypedPipe[(K, V)] to Grouped[K, V]
3) You can easily use scalding as a library, but most examples use our default runner. We use it as a library in summingbird. But you are right, a nice doc to help people see what to do might help people (hint: set up an implicit FlowDef and Mode, do your scalding code, then call a method to run the FlowDef).
I thought Etsy was test-buying some of the kitchen items listed there and finding them painfully unfit for use.
(No, really. I'm not trying to be clever. That honestly was my first thought.)
This is a problem I have and often find bemusing but sometimes alarming. At least a few times a week I encounter text that I initially grossly misread by automatically inserting my own words instead of what was actually written.
I am inclined to think it may have something to do with doing a lot complicated programming. Basically after years of programming and having to deal with detail, you survive by being good at applying triage to problems, ie knowing what to ignore and what to pay attention to. Perhaps misreading text is your brain attempting to apply this type of triage.