Elastic MapReduce Streaming in Go

knodi · on July 18, 2014

Streams is something Go is very good at. I recently did project where I was download a large CSV zipped (50gigs) file and had to run analytics and collect metrics on the data. With io.Buffer i was able to download, unzip, parse and process the data all in realtime with ease. Love it, very powerful.

P.S: without any of the data hitting disk. :)

sologoub · on July 18, 2014

What kind of memory footprint would something like this have?

I'm assuming you didn't use 50GB of RAM and the processing was line-by-line, which should be quite small.

jchavannes · on July 18, 2014

I built an ETL that uses a Go stream for the transform. Once the data has been processed it's written to disk on our analytics servers. Since it can keep up with the stream, memory footprint is very small (which also means we never have to hit disk on ETL servers).

Go outperformed our previous solution by over 60x, making our new bottleneck disk I/O instead of CPU.

taschenbillard · on July 18, 2014

what language was your previous solution in?

jchavannes · on July 18, 2014

PHP. It was what I used initially so I could have access to our internal libs since our main application is in PHP. After crashing the ETL server a couple times though, I decided to try moving the heavy lifting to Go.

When we benchmarked them, what would take PHP 30 minutes, Go could do in less than 30 seconds. Only the transform is written in Go, the rest of the ETL is still in PHP.

lipoicacid · on July 18, 2014

http://jaxbot.me/articles/benchmarks_nodejs_vs_go_vs_php_3_1...

This bubble sort benchmark between PHP and Go supports your claims.

knodi · on July 19, 2014

Yes, the app only took up 150MBs max as the data was processed in chunks.

coreymgilmore · on July 18, 2014

Go, while definitely more complicated to learn initially, has the benefit of a very powerful and inclusive std lib. To me, this is one of it's largest benefits: minimal need for external libs and extra things to install.

tete · on July 18, 2014

Go is about the easiest language to learn from my point of view, due to being extremely small and side effect free, while having GC, etc. What are you comparing with?

Other than LISP maybe, but I never really learned that

khyryk · on July 18, 2014

> side effect free

In which way(s)?

res0nat0r · on July 18, 2014

I'm assuming a scripting language like Python (like they started with in the article) or Ruby.

tete · on July 18, 2014

I found both Python and Ruby a lot harder to learn. Both have a lot of context depending stuff, are inconsistent (print not being a function in Python), hard to deal with stuff (Python's unicode support or result of 3/2), weird syntax (__init__) and other things (where to import what from), dealing with lists, etc.

Like, Hello World is probably simpler, but building even script style program seems easier and you never get stuck, because of something you don't know yet.

I can't really talk about Ruby. I learned Ruby after Perl and Smallkalk, so putting those two together was too hard.

Perl was a weird thing. It's possibly the hardest language to learn, but the fastest to program in. Kinda the opposite of Go in that sense. Everything depends a lot of context, tons of tricks, special variables and varying behavior, with default behavior you might not know you want and stuff like that. Also it requires a lot of selfdiscipline, while Go (kinda like Java) forces you to write code the way it is meant to be programmed in. In Perl you always can see whether someone actually programs in Java/C/Shell/Python/... and doesn't know the language well. In Go code one program/library feels like the other.

aroman · on July 18, 2014

Go's standard library is vastly smaller than Python's. (though probably of higher quality)

ominous_prime · on July 18, 2014

Go's std lib is much more useful though (on top of being higher quality).

I've done a number of fairly complex Go projects without leaving the std library -- I can't say the same for python.

lgas · on July 18, 2014

Can either of you give any concrete examples of how Go's standard library is of higher quality than python's? I'm honestly curious.

ominous_prime · on July 18, 2014

Probably the most visible example is the http library. You would never use python's reference http server in production; Go's http server is not only rock solid, but very high performance.

Go's http client is really easy to use; python's http client story is complicated and incomplete, and the majority of people went with the `requests` library.

lgas · on July 18, 2014

Thanks for your answer. Those examples make sense.

irskep · on July 18, 2014

I'm surprised they didn't try mrjob (Python framework), which both has a more concise interface and handles shipping their dependencies to EMR, which seemed to be their primary complaint. Probably would have been easier than rewriting a bunch of stuff in Go. It's supposed to solve the exact problems they have.

Of course, they might like Go for other reasons and have a better time in general, but it's odd not to read a mention of the obvious and simple solution.

julienchastang · on July 18, 2014

Whoa. The nytimes is publishing code!? Times (no pun intended) really are changing.

thecoffman · on July 18, 2014

The NYTimes has published lots of code: https://github.com/nytimes

julienchastang · on July 18, 2014

Yes, in fact they have several well known developers. I just did not realize they had a code blog.

alanleblanc · on July 18, 2014

Is it really as simple as putting a binary on S3 to deploy. Anyone know where more detail--exact steps to take--about that can be foundS3

2mur · on July 18, 2014

You can essentially deploy with a cross compiled binary by scp:

http://dave.cheney.net/2012/09/08/an-introduction-to-cross-c...

benmanns · on July 18, 2014

This is how I've deployed Go code from OSX to a Linux server:

    brew install go --cross-compile-common
    GOOS=linux GOARCH=amd64 go build
    scp thebinary user@host:thebinary

philip1209 · on July 18, 2014

It sounds like they should investigate Heka.

billwilliams · on July 18, 2014

Not sure I'm convinced the effort of switching streaming frameworks is worth it as opposed to switching to a jvm compiled language.