Hacker News new | past | comments | ask | show | jobs | submit login
Elastic MapReduce Streaming in Go (nytimes.com)
80 points by danso on July 18, 2014 | hide | past | favorite | 26 comments



Streams is something Go is very good at. I recently did project where I was download a large CSV zipped (50gigs) file and had to run analytics and collect metrics on the data. With io.Buffer i was able to download, unzip, parse and process the data all in realtime with ease. Love it, very powerful.

P.S: without any of the data hitting disk. :)


What kind of memory footprint would something like this have?

I'm assuming you didn't use 50GB of RAM and the processing was line-by-line, which should be quite small.


I built an ETL that uses a Go stream for the transform. Once the data has been processed it's written to disk on our analytics servers. Since it can keep up with the stream, memory footprint is very small (which also means we never have to hit disk on ETL servers).

Go outperformed our previous solution by over 60x, making our new bottleneck disk I/O instead of CPU.


what language was your previous solution in?


PHP. It was what I used initially so I could have access to our internal libs since our main application is in PHP. After crashing the ETL server a couple times though, I decided to try moving the heavy lifting to Go.

When we benchmarked them, what would take PHP 30 minutes, Go could do in less than 30 seconds. Only the transform is written in Go, the rest of the ETL is still in PHP.


http://jaxbot.me/articles/benchmarks_nodejs_vs_go_vs_php_3_1...

This bubble sort benchmark between PHP and Go supports your claims.


Yes, the app only took up 150MBs max as the data was processed in chunks.


Go, while definitely more complicated to learn initially, has the benefit of a very powerful and inclusive std lib. To me, this is one of it's largest benefits: minimal need for external libs and extra things to install.


Go is about the easiest language to learn from my point of view, due to being extremely small and side effect free, while having GC, etc. What are you comparing with?

Other than LISP maybe, but I never really learned that


> side effect free

In which way(s)?


I'm assuming a scripting language like Python (like they started with in the article) or Ruby.


I found both Python and Ruby a lot harder to learn. Both have a lot of context depending stuff, are inconsistent (print not being a function in Python), hard to deal with stuff (Python's unicode support or result of 3/2), weird syntax (__init__) and other things (where to import what from), dealing with lists, etc.

Like, Hello World is probably simpler, but building even script style program seems easier and you never get stuck, because of something you don't know yet.

I can't really talk about Ruby. I learned Ruby after Perl and Smallkalk, so putting those two together was too hard.

Perl was a weird thing. It's possibly the hardest language to learn, but the fastest to program in. Kinda the opposite of Go in that sense. Everything depends a lot of context, tons of tricks, special variables and varying behavior, with default behavior you might not know you want and stuff like that. Also it requires a lot of selfdiscipline, while Go (kinda like Java) forces you to write code the way it is meant to be programmed in. In Perl you always can see whether someone actually programs in Java/C/Shell/Python/... and doesn't know the language well. In Go code one program/library feels like the other.


Go's standard library is vastly smaller than Python's. (though probably of higher quality)


Go's std lib is much more useful though (on top of being higher quality).

I've done a number of fairly complex Go projects without leaving the std library -- I can't say the same for python.


Can either of you give any concrete examples of how Go's standard library is of higher quality than python's? I'm honestly curious.


Probably the most visible example is the http library. You would never use python's reference http server in production; Go's http server is not only rock solid, but very high performance.

Go's http client is really easy to use; python's http client story is complicated and incomplete, and the majority of people went with the `requests` library.


Thanks for your answer. Those examples make sense.


I'm surprised they didn't try mrjob (Python framework), which both has a more concise interface and handles shipping their dependencies to EMR, which seemed to be their primary complaint. Probably would have been easier than rewriting a bunch of stuff in Go. It's supposed to solve the exact problems they have.

Of course, they might like Go for other reasons and have a better time in general, but it's odd not to read a mention of the obvious and simple solution.


Whoa. The nytimes is publishing code!? Times (no pun intended) really are changing.


The NYTimes has published lots of code: https://github.com/nytimes


Yes, in fact they have several well known developers. I just did not realize they had a code blog.


Is it really as simple as putting a binary on S3 to deploy. Anyone know where more detail--exact steps to take--about that can be foundS3


You can essentially deploy with a cross compiled binary by scp:

http://dave.cheney.net/2012/09/08/an-introduction-to-cross-c...


This is how I've deployed Go code from OSX to a Linux server:

    brew install go --cross-compile-common
    GOOS=linux GOARCH=amd64 go build
    scp thebinary user@host:thebinary


It sounds like they should investigate Heka.


Not sure I'm convinced the effort of switching streaming frameworks is worth it as opposed to switching to a jvm compiled language.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: