

Elastic MapReduce Streaming in Go - danso
http://open.blogs.nytimes.com/2014/07/10/emr-streaming-in-go/

======
knodi
Streams is something Go is very good at. I recently did project where I was
download a large CSV zipped (50gigs) file and had to run analytics and collect
metrics on the data. With io.Buffer i was able to download, unzip, parse and
process the data all in realtime with ease. Love it, very powerful.

P.S: without any of the data hitting disk. :)

~~~
sologoub
What kind of memory footprint would something like this have?

I'm assuming you didn't use 50GB of RAM and the processing was line-by-line,
which should be quite small.

~~~
jchavannes
I built an ETL that uses a Go stream for the transform. Once the data has been
processed it's written to disk on our analytics servers. Since it can keep up
with the stream, memory footprint is very small (which also means we never
have to hit disk on ETL servers).

Go outperformed our previous solution by over 60x, making our new bottleneck
disk I/O instead of CPU.

~~~
taschenbillard
what language was your previous solution in?

~~~
jchavannes
PHP. It was what I used initially so I could have access to our internal libs
since our main application is in PHP. After crashing the ETL server a couple
times though, I decided to try moving the heavy lifting to Go.

When we benchmarked them, what would take PHP 30 minutes, Go could do in less
than 30 seconds. Only the transform is written in Go, the rest of the ETL is
still in PHP.

~~~
lipoicacid
[http://jaxbot.me/articles/benchmarks_nodejs_vs_go_vs_php_3_1...](http://jaxbot.me/articles/benchmarks_nodejs_vs_go_vs_php_3_14_2013)

This bubble sort benchmark between PHP and Go supports your claims.

------
coreymgilmore
Go, while definitely more complicated to learn initially, has the benefit of a
very powerful and inclusive std lib. To me, this is one of it's largest
benefits: minimal need for external libs and extra things to install.

~~~
tete
Go is about the easiest language to learn from my point of view, due to being
extremely small and side effect free, while having GC, etc. What are you
comparing with?

Other than LISP maybe, but I never really learned that

~~~
res0nat0r
I'm assuming a scripting language like Python (like they started with in the
article) or Ruby.

~~~
tete
I found both Python and Ruby a lot harder to learn. Both have a lot of context
depending stuff, are inconsistent (print not being a function in Python), hard
to deal with stuff (Python's unicode support or result of 3/2), weird syntax
(__init__) and other things (where to import what from), dealing with lists,
etc.

Like, Hello World is probably simpler, but building even script style program
seems easier and you never get stuck, because of something you don't know yet.

I can't really talk about Ruby. I learned Ruby after Perl and Smallkalk, so
putting those two together was too hard.

Perl was a weird thing. It's possibly the hardest language to learn, but the
fastest to program in. Kinda the opposite of Go in that sense. Everything
depends a lot of context, tons of tricks, special variables and varying
behavior, with default behavior you might not know you want and stuff like
that. Also it requires a lot of selfdiscipline, while Go (kinda like Java)
forces you to write code the way it is meant to be programmed in. In Perl you
always can see whether someone actually programs in Java/C/Shell/Python/...
and doesn't know the language well. In Go code one program/library feels like
the other.

------
stevejohnson
I'm surprised they didn't try mrjob (Python framework), which both has a more
concise interface and handles shipping their dependencies to EMR, which seemed
to be their primary complaint. Probably would have been easier than rewriting
a bunch of stuff in Go. It's supposed to solve the exact problems they have.

Of course, they might like Go for other reasons and have a better time in
general, but it's odd not to read a mention of the obvious and simple
solution.

------
julienchastang
Whoa. The nytimes is publishing code!? Times (no pun intended) really are
changing.

~~~
thecoffman
The NYTimes has published lots of code:
[https://github.com/nytimes](https://github.com/nytimes)

~~~
julienchastang
Yes, in fact they have several well known developers. I just did not realize
they had a code blog.

------
alanleblanc
Is it really as simple as putting a binary on S3 to deploy. Anyone know where
more detail--exact steps to take--about that can be foundS3

~~~
2mur
You can essentially deploy with a cross compiled binary by scp:

[http://dave.cheney.net/2012/09/08/an-introduction-to-
cross-c...](http://dave.cheney.net/2012/09/08/an-introduction-to-cross-
compilation-with-go)

~~~
benmanns
This is how I've deployed Go code from OSX to a Linux server:

    
    
        brew install go --cross-compile-common
        GOOS=linux GOARCH=amd64 go build
        scp thebinary user@host:thebinary

------
philip1209
It sounds like they should investigate Heka.

------
billwilliams
Not sure I'm convinced the effort of switching streaming frameworks is worth
it as opposed to switching to a jvm compiled language.

