

Scalding at Etsy - mcfunley
http://mcfunley.com/scalding-at-etsy

======
StefanKarpinski
I was the one at Adtuitive who chose Cascading.JRuby. It was a pre-existing
DSL on top of Cascading. It needed some work, but it was a pretty nice concise
way to generate Cascading jobs. However, with half a dozen to a dozen people
hacking features into it over time at Etsy, and with no real design
coordination, things got pretty out of hand. There were a couple of
fundamental problems:

1\. Type system mismatch. Ruby is dynamic and not type checked; Java is static
and type checked; Cascading is somewhere in between and unfortunately doesn't
seem to use Java's type system as well as it could.

2\. User-defined functions in strings. For some reason Cascading lets you
write user-defined functions as strings and compiles them dynamically during
job execution. This was _the_ way to write user-defined code in
Cascading.JRuby.

Cascading requires a compilation step, yet since you're writing Ruby code, you
get get none of the benefits of static type checking. It was standard to
discover a type issue only after kicking off a job on, oh, 10 EC2 machines,
only to have it fail because of a type mismatch. And user code embedded in
strings would regularly fail to compile – which you again wouldn't discover
until after your job was running.

Each of these were bad individually, together, they were a fucking nightmare.
The interaction between the code in strings and the type system was the worst
of all possible worlds. No type checking, yet incredibly brittle, finicky and
incomprehensible type errors at run time. I will never forget when one of my
friends at Etsy was learning Cascading.JRuby and he couldn't get a type cast
to work. I happened to know what would work: a triple cast. You had to cast
the value to the type you wanted, not once, not twice, but THREE times.

Scalding fixes both problems since it is statically typed and lets you write
real user-defined functions instead of stuffing them in strings. To me, the
main moral is never, ever design an API that involves writing code in strings.
It's just bound to be a disaster. Also, if you're going to have a compilation
step anyway, you might as well get some static checking for obvious problems
like type errors as part of the bargain.

~~~
mcfunley
One other unforced error with c.j was having almost every function take a
dictionary of args instead of a sane argument list. That left "grep the
codebase" and "read the entire function definition and god help you if it
passes on the argument dict" as the two horrible options for figuring out how
to even call most of the functions.

But we probably could have erected a blast shield around that mess if only we
could have written functions and aggregators in ruby.

~~~
gfodor
Yeah that right there was our biggest error I think. We should have had a
"compilation pass" that used JRuby's ruby parser to do program transformation
on the original source, extracting the inline operator code and doing
"something" with it to generate fast JVM bytecode (probably using something
like Mirah) and then transforming the call site into something that would use
this. It would have been a few months of nightmarish debugging but would have
also provided the leverage we needed to do data flow-level type
checking/annotation as well, in addition to not having to write our UDFs in
another language.

I think the thing to remember with the c.j stuff is at the time Cascading's
Java API was pretty much state of the art, and we had to shore up a lot of
what was in c.j in order to get to a viable DSL. And at that point, it was a
huge leap forward from a readability standpoint. (Cascading's API had the same
shortcomings as our c.j API wrt types etc, fwiw)

~~~
StefanKarpinski
At that point you're just implementing a crappy statically type language
inside of a Ruby DSL. Using a real statically typed JVM language that's more
expressive than Java – i.e. Scala – would be much better. So, that's basically
Scalding. Of course, Scalding wasn't even close to existing when we chose
Cascading.JRuby, so yeah.

~~~
gfodor
Yeah I'm just talking counterfactually. At the time there were no solid
cascading DSLs whatsoever, cascalog wasn't even around for another 2 years or
so iirc. Cascading itself was the new hotness vs writing raw map reduce, so
c.j was kind of on the edge of what was out there. The decision to keep UDFs
outside of Ruby was made at the outset (out of concern for performance, and
being icing on the cake, etc.) and not really revisited.

------
bkirwi
I'd like to get in a plug for my personal favourite Scala / Hadoop
productivity framework, Scoobi.[0] If you're already familiar with Scala,
Scoobi has the more familiar and intuitive API, and it leverages the type
system to provide stronger guarantees before the code is even run. (For
example, if Cascading / Scalding don't know how to serialize your data you'll
get a runtime error; in Scoobi, it complains at compile-time. This is _really_
useful when your compile / deploy / run / error loop may take several
minutes...) I've also run into a few bugs in Scalding, while I've found Scoobi
to be much more solid.

OTOH, Scalding has the edge in terms of community size and ecosystem. I
haven't found this to be a big issue -- shimming an existing Hadoop input /
output to work with either project is quite simple -- but YMMV.

[0] [https://github.com/NICTA/scoobi](https://github.com/NICTA/scoobi)

~~~
avibryant
Yeah, for those same reasons I much prefer using Scalding's typed API [1],
which feels very similar to Scoobi. The tuple API shown in these slides is
great for places like Etsy that already have a large investment in Cascading,
but otherwise you're better off getting the added type safety and similarity
to the standard Scala API.

[1] [https://github.com/twitter/scalding/wiki/Type-safe-api-
refer...](https://github.com/twitter/scalding/wiki/Type-safe-api-reference)

~~~
bkirwi
I'm familiar with the typed API, but it still doesn't quite bring me as far as
Scoobi does. I recognize your name from the Scalding code, so I'll say that
this is meant as helpful criticism and not a complaint.*

\- There a couple types for datasets in the Scalding API: TypedPipe, and
KeyedList and subclasses. Scoobi subsumes both of these under DList; thanks to
the usual Scala wizardry, this has all the methods to operate on key-value
pairs without loss of typesafety. This isn't a huge deal, but it removes the
tiny pains of constantly converting back and forth between the two. \-
Scoobi's _other_ abstraction, DObject, represents a single value. These are
usually created by aggregations or as a way to expose the distributed cache,
and have all the operations you'd expect when joining them together or with
full datasets. You can emulate this in Cascading / Scalding, but it's a bit
less explicit and more error-prone. \- There's no equivalent to the compile-
time check for serialization in Scalding, AFAICT. \- Scoobi has less opinions
about the job runner itself... there are some helpers for setting up the job,
but all features are available as a library. For some reason, I found the two
harder to separate in Scalding? \- IIRC, Scalding did job setup by mutating a
Cascading object that was available implicitly in the Job. In Scoobi, you
build up an immutable datastructure describing the computation and hand that
to the compiler. This suits my sense of aesthetics better, I suppose...

* Also, thanks to you guys for Algebird! That's a really fantastic little project, and I use it all the time.

~~~
posco
Quick point:

1) Scalding has a DObject like type: ValuePipe[+T]. 2) The reason you must
explicitly call .group to go to a keyed type is that is costly to do a
shuffle, this makes it clear to people when they do trigger a shuffle. If you
don't like that, make an implicit def from TypedPipe[(K, V)] to Grouped[K, V]
3) You can easily use scalding as a library, but most examples use our default
runner. We use it as a library in summingbird. But you are right, a nice doc
to help people see what to do might help people (hint: set up an implicit
FlowDef and Mode, do your scalding code, then call a method to run the
FlowDef).

~~~
bkirwi
1) Ah, the ValuePipe is (relatively) new; thanks for the pointer. 2) You have
to explicitly `.group` in Scoobi as well; it transforms a DList[(K,V)] to a
DList[(K, Iterable[V])] or similar. You _don 't_ have to call `.toTypedPype`
to get map and friends, though, since it's just a DList. 3) I've actually
written this exact integration, so I'm glad it's the approved method! The
global, mutable Mode made me nervous, IIRC.

~~~
posco
The global Mode is gone in 0.9.0. And there is an implicit from Grouped to
TypedPipe, so you don't need to call .toTypedPipe (that directly seems less
likely to cause problems, especially given we have mapValues and filter on
Grouped, so we should avoid needlessly leaving the Grouped representation).

~~~
bkirwi
Neat! It looks like things are changing fast; I'll have to do another read-
through.

------
morgante
Totally read this as scaling at Etsy. Still interesting though. And somewhat
accurate.

~~~
krick
It's nothing, I read this as "Soldering is easy" and then spent about 5
seconds trying to understand why the material is something different from what
I expected. Probably should sleep more.

~~~
Edmond
Is anyone aware of psychological research into why some people are prone to
this type of error?

This is a problem I have and often find bemusing but sometimes alarming. At
least a few times a week I encounter text that I initially grossly misread by
automatically inserting my own words instead of what was actually written.

I am inclined to think it may have something to do with doing a lot
complicated programming. Basically after years of programming and having to
deal with detail, you survive by being good at applying triage to problems, ie
knowing what to ignore and what to pay attention to. Perhaps misreading text
is your brain attempting to apply this type of triage.

~~~
stormbrew
I don't think it's likely to have anything to do with 'some people' or any
kinds of tasks you perform. It's probably more to do with word shapes [1],
which are probably at least part of how you read.

[1] [http://en.wikipedia.org/wiki/Bouma](http://en.wikipedia.org/wiki/Bouma)

