Hacker News new | comments | show | ask | jobs | submit login
How and Why Mixpanel Switched from Erlang to Python (mixpanel.com)
139 points by ankrgyl 2321 days ago | hide | past | web | favorite | 80 comments

I might get flamed/downvoted for this, but where's the content of this article? Apart from some vague tropes ("Erlang is bad at string processing!" "We need scalability, and that means async Python", "Pick the right Python libraries!"), there's really nothing interesting about the implementation details, especially considering such a attention-grabbing headline (to put it nicely).

What specifically was bad about the Erlang code? Isn't this just saying that if nobody in your company really understands a language, don't use it?

This is more about the technical competency of a specific company than general technical issues. Or, to put it bluntly, it's more "Mixpanel sucks at Erlang" than "Erlang sucks". Don't get me wrong, I'd be really, really interested in a good analysis why in this case Erlang was the wrong choice, but this article didn't even get close to anything technically interesting (Even with the ubiquitous requests/s graph).

The thing I learned from the article is that they return 0 for failure and 1 for success. And then, instead of fixing this meaningfully, they added a bit of sticky plaster that enables you to ask for verbose info when you get a 0.

I can't help but think when I spot such code smells that perhaps the issue runs deeper. So when I then read about Erlang vs Python I also can't but think that I bet it's not the language choice that is the problem.

I don't think I fully understood that I judge code at first sight like that, but I clearly do.

Author of the post here. I agree that 0 and 1 aren't the best choices for a response, but you have to keep backwards compatability in mind. We wanted this change to be transparent to people who have production code that interacts with the API, so it wasn't the appropriate time to change that. On the other hand, for people developing future APIs, we've added extra feedback in a way that doesn't affect other customers' code. I agree that it's not the best solution, but you have to make these kinds of compromises when you have real customers.

Yes, that's a very odd thing to do. More so when you have a better model -- HTTP response codes -- right in front of you!

HTTP response codes don't cover all errors though. They're good at describing problems with HTTP communication, but aren't adequate for describing non-HTTP application errors.

Additional error indicators are fine (much better than overloading the meaning of HTTP codes), but they should be sane, meaningful, consistent with developer expectations, etc. It's that part where the implementation is weird.

I looked at their site and saw they're hiring. Being a problem solver I tend to think "I can fix your problems, let me help", so I looked to see if they need an architect or chief scientist or something... but when the architect position ( http://mixpanel.theresumator.com/apply/Eoh3qJ/Solutions-Arch... ) has the minimum experience of "Student (College)"... I just face palm and am not surprised by any of this - I bet their definition of solutions architect is really just sales support.

I think they just need to raise the calibre of their backend, the signals here (this article, the error codes, the job advert) are not things that inspire confidence in the product. As the main thing that they do is crunching data I'm taken aback that even their backend engineer position advert ( http://mixpanel.theresumator.com/apply/CiOzuu/Software-Engin... ) doesn't even have the word algorithm in it, but maybe they have a wizard who does that for them.

Am I being harsh? Probably... I don't know the guys there and I may just have a very narrow reading of too little info to make such assumptions, so to be a bit balanced (I'm British, it's what we do), they clearly "get" reporting needs.

Their interfaces are show that they understand the kind of thing people want to see. I was impressed by those, I've done reporting a lot, and so many times you see reports that forget that the reader is trying to understand something. So it's refreshing to see reports that appear to remember that.

I usually return Bad Request with an explanation/verbose error in the body, for errors which don't fall under the other codes. It works pretty well.

Doesn't Bad Request usually mean malformed HTTP request data? I find myself using Forbidden when users send me bogus/invalid data, as its a more generic 'No', and a body with more information never hurts.

It does, but I consider it more of an indicator that the request (including its content) was somehow invalid. I reserve Forbidden for cases where the client needs to authenticate.

EDIT: Actually, I see that the spec says that authentication should do nothing to fix a Forbidden error, so you might be right. Unauthorized is for authentication.

Perhaps this is what buro9 means about overloading error codes, but the rest map so cleanly on the errors that it's hard to ignore them. You can't very well return 200 and say "There was error X" in the response.

Maybe we should use 418?

If in doubt, I'm a Teapot serves well ;)

What's the HTTP error code for "ambiguous input, please select from one of the following unambiguous suggestions"?

Just returning an error isn't helpful, you want to help them resolve it. So you want to return something, and this isn't success (in an application sense) as you didn't do what they asked of you, you're doing something else. But it IS success in HTTP terms, the request made it to the server, got processed fine, and came back fine.

Application codes != HTTP codes.

It's debatable whether the application is entirely separate from the RPC layer. For example, when writing a RESTful API, would you really want the server to respond with 200 for everything, even for "Resource not found" errors?

In web applications, HTTP codes and application codes are quite integrated. Of course, you do need some extra error info, as HTTP error codes can't possibly cover every error in your app, but 422 with extra info as a catch-all sounds like a reasonable compromise.

I follow Rails example and use 422 (Unprocessable Entity - The request was well-formed but was unable to be followed due to semantic errors).

Ah, that sounds perfect, actually. Thank you.

I agree with you. The most important lesson I got from this post is that you'll get a bunch of hits if you choose a provocative title.

There isn't even a comparison on the graph between Erlang and Python versions!

I think it did make a technically interesting point--if your infrastructure is written in a language that nobody in your team has mastered, it might be worth the investment to port it to something else.

Erlang is widely regarded as a great platform for servers with lots of concurrent connections because threads are so lightweight (Facebook chat is built on Erlang). And yet Mixpanel decided to rewrite in a language that would be more maintainable for them in the future.

Yeah, I don't think anyone disagrees with you. It's just that this point was rather hidden beneath a misleading title and content!

I don't think the title is misleading at all. "How and Why We Switched from Erlang to Python".

How - by writing a WSGI app that uses gevent and simplejson Why - because the Erlang code was old, no one knew Erlang well enough to maintain, and Python is the language used everywhere else at Mixpanel

The title wasn't "OMG PYTHON RULEZ ERLANG SUCKZ!!". The only real ambiguity in the title was that Mixpanel (as a whole codebase) didn't switch from Erlang to Python, just a small module switched.

What did you expect with this title? More language criticism?

On second reading, you're right. I'm now trying to decide if I should delete by original comment.

I agree with you, for what it's worth.

I just hope you're not suggestion the author should never have written it - not every blog post needs to be a dazzler. I'll assume you're angry about the upvotes instead.

Pretty harsh response to a post from an intern, dudes and dudettes.

Yeah, the erlang server code was pretty gross. It was one of the very first things written when we started Mixpanel over 2 years ago, and it's only been updated a few times since then.

I feel like the big thing you guys are missing is how little time we have. It's not like we don't know when we have bad code, or we don't realize that we made mistakes in the initial server design - we just have a million things on our collective plate. Fixing a very simple server - (accept request, validate json, put on queue) that is doing its job okay hasn't been a high priority thing.

When this code was written, Mixpanel had zero customers and we weren't sure what we were building yet. In that regard, Erlang has been a rock. We've barely had to touch it during the rampup from 0 to thousands of requests per second.

Now that we have the manpower, and we've learned what we really need, we can rewrite it to make things easier on ourselves. If we can get acceptable performance in python, there is no reason to use erlang.

I think there's some merit to the other complaints (error codes, etc), but that's another symptom of this thing being written so long ago. We want to improve things incrementally (and backwards-compatibly) for now, but it will be dramatically simpler for us to make changes to the server now that it's written in python.

Ultimately, we have skeletons in the closet, just like the rest of you - I'm sure all of you have some bad code in production somewhere. Now we're saying "Look, we're getting rid of our skeletons!" and you guys are like "OMG WHY YOU HAVE SKELETONS" instead of "sweet, no more skeletons".

Seriously, it would be interesting to see your code. As an Erlang inventor/developer it would be interesting to see how the language is actually used and how that relates to the problems people have.

I agree that not having Erlang competence in your company IS a good reason to change language.

I agree with this. It would be interesting to see how Erlang was used - I find engineers that have "issues" with their erlang programs aren't actually using OTP to its fullest (using behaviors and supervisors, packaging as an application, etc...).

  Finally, we use a few stateful, global data structures to
  track incoming requests and funnel them off to the right 
  backend queues. In Erlang, the right way to do this is to 
  spawn off a separate set of actors to manage each data 
  structure and message pass with them to save and retrieve 
  data. Our code was not set up this way at all, and it was 
  clearly crippled by being haphazardly implemented in a 
  functional style.
Seriously?! I have used only a little erlang, but this makes no sense to me - it's like you were writing some big java project and put everything in one huge class, with all methods and variables static. It's hard for me to imagine why and how someone would write production erlang app with no actors, especially some kind of server. No wonder the thing sucked in the first place.

From the article:

   Because of these performance requirements, we originally wrote the
   server in Erlang (with MochiWeb) two years ago. After two years of
   iteration, the code has become difficult to maintain.  No one on
   our team is an Erlang expert, and we have had trouble debugging
   downtime and performance problems. So, we decided to rewrite it
   in Python, the de-facto language at Mixpanel.
My first impulse would have been to have one or more team members become Erlang experts. Was that considered?

"After two years of iteration (on an Erlang codebase) ...no one on our team is an Erlang expert,"

This sounds a little strange. How is this possible? High turnover on the team?

The API server described in this post is actually a very very small part of the Mixpanel codebase and was the only part written in Erlang (as far as I know). The Erlang server was written in the very early days of Mixpanel and worked well enough for a year and a half, mostly untouched. The statement about iteration is a bit ambiguous, because there was actually very little iteration on the Erlang server, but a ton on the rest of the (Python) product.

So, the lack of iteration wasn't due to high turnover, it was due to having something that worked and other problems to solve. (I'm an intern at Mixpanel)

By the way, you could update your jobs offers - ie the part that you write Erlang, which it seems is not true anymore. I don't get why you put that info there in the first place if you had only one piece or erlang code, which you removed instead of refactorizing.

... and guess who got the task to re-write the server?

...an intern...

Or why not point out the obvious? Erlang is hard to learn.

Anyone claiming to have mastered it in a couple of months like most fanatics here are (pardon my language) full of crap.

I've seen people master it quickly, writing their own behaviours etc..

I learned it on the tube to and from work in a relatively short space of time, writing a comet server for streaming updates to browsers with a colleague. After the initial hump it was easy and fun. I don't claim to be a genius programmer, just an interested dabbler.

To be fair though, the more average developer does struggle with it, and that was a reason my company didn't take it up widely.

Took me a week to be comfortable with it, but would take at least 2 years for me to grok the performance implications of each construct.

Erlang is the first functional language I've written, and I found it very easy to get going with. The syntax takes a bit of getting used to, but it's a compact language with not too many constructs to learn. I bought the Erlang and OTP in Action book to get my head around the OTP system, but the rest was fairly approachable.

Well, "Mastery" is a standard that I think, by definition, few achieve.

But competence in erlang is certainly achievable in a couple of months.

In my example, I started having never once written anything in a functional language. I started by reading Armstrongs book with pragmatic programmers. After a couple of months I could write anything I wanted to in erlang, without too much trouble, and much more importantly:

I was in love with erlang

The idea of writing something in erlang was very exciting and pleasurable to do.

I think that's quite an accomplishment for a language with such a different syntax and with me only putting a couple months into it.

Now, obviously, I won't say I'd mastered it. But prior to that I had the fear of the unknown when it came to functional languages, and a distinct repulsion at erlangs syntax. The only reason I learned it was that I believed it to be the only language that had done concurrency right.

And I had a project to write in it that required concurrency. That might be a big help. I imagine it is hard to learn any language if you don't have a goal that it is well positioned to achieve to work towards while learning.

I suspect people who think erlang is hard to learn, haven't spent any time learning erlang. But in truth, no language, except the very first one, was hard for me to learn.

It seems in a lot of cases, Erlang is just used because of its reputation at being really good at concurrency, mostly in rather minimal API implementations -- or in other words, servers that could easily be done in almost any language that provides some decent event-handling functions. Ruby, Python, node, C/libev, etc.

So unless we're talking about thousands of lines of code, it really doesn't matter what library or language you'll choose for something like this. If this would be your only use of Erlang, it's probably not worth it. Erlang is pretty great at building distributed, high-concurrency applications that are good at coping with errors. For one out of those three, you have plenty of other options…

Out of curiosity, how long do you think it takes to become become an expert in Erlang?

10,000 hours.

“Finally, we use a few stateful, global data structures to track incoming requests and funnel them off to the right backend queues. In Erlang, the right way to do this is to spawn off a separate set of actors to manage each data structure and message pass with them to save and retrieve data.”

Nope, that’s not the right way. The way you were doing it ended up making all calls sequential and bound to single processes that could lose state. That’s not right.

The best way to do it would have been to use ETS tables (which can be optimized either for parallel reads or writes), which also allows destructive updates, in order to have the best performance and memory usage possible. Note that you could then have had memory-only Mnesia table (adding transactions, sharding and distribution on top of ETS) to do it.

As for string performances, I’m wondering if you used lists-as-strings, binary strings or io-lists to do your things. This can have significant impact in performance and memory use.

Then again, if you had a bunch of Python and no Erlang experts, I can’t really say anything truly convincing against a language switch. Go for what your team feels good with.

> The biggest challenge for me was pushing the server from working 99.9% of the time to 99.99% of the time, because those last few bugs were especially hard to find.

Could you expand upon this some more? How do you know the server works 99.99% of the time (or 99.9%)? Do you run regression tests using actual past requests?


Bob's software sucks, let's switch to Bob's software.

That's pretty much all I got from that article.


I dont know if its what he is referring to, but it would be an awesome coincedence if it wasnt

Bob Ippolito wrote mochiweb (the erlang web server) and looks to be involved in eventlet http://eventlet.net/doc/history.html

Seems a reasonable assumption. Thanks.

It sounds like they're accepting really simple HTTP requests (event updates) and inserting a job in a queue.

Really simple + rarely changing + needs to scale to really high req/sec = perfect candidate for being written in C. Maybe as an nginx module?

This is only true if the queue isn't the bottle neck. If the queue can only handle 2,500 req/s and the Python program can send at 3,000 req/s, what use is it writing a C program that sends at 12,000 req/s?

1) Get a faster queue. 2) Create a QueueQueue that batch inserts.

The intern was told to rewrite the server for a first assignment? What were the other programmers doing? jquery?

Trying to decide whether P=NP, it looks like.

What was the original Erlang performance?

I mean, good enough is good enough, and local culture counts, no problem there, just curious.

This is exactly what I want to know. What were the numbers?

"The main difference is that eventlet can’t influence the Python runtime, but actors are built into Erlang at a language level, so the Erlang VM can do some cool stuff like mapping actors to kernel threads (one per core) and preemption. We get around this problem by launching one API server per core and load balancing with nginx."

The actor model is for concurrency, which is when your threads are communicating with one another, right? What about the task that the API server does requires inter-thread communication?

The author is wrong about simplejson performing 10x better than the json included with python.

Here is my proof: http://j2labs.tumblr.com/post/7305664569/python-vs-javascrip...

No, we ran an extensive benchmark against log data and found that simplejson was indeed 10x faster. Your benchmark assumes a different "shape" of json dictionary than ours, and I would recommend updating your methodology to use real data instead. I added ujson to our benchmark, and here are the results (seconds):

$ python json_bench.py history.log.1 json 106.270362854 simplejson 11.336577177 cjson 5.63336491585 ujson 3.81600308418

There's not much about "why," in fact, these are the only sentences that are at all relevant to "why:"

After two years of iteration, the code has become difficult to maintain. No one on our team is an Erlang expert, and we have had trouble debugging downtime and performance problems.

Erlang is historically bad at string processing, and it turns out that string processing is very frequently the limiting factor in networked systems because you have to serialize data every time you want to transfer it. There’s not a lot of documentation online about mochijson’s performance, but switching to Python I knew that simplejson is written in C, and performs roughly 10x better than the default json library.

I was able to provide some important operations in constant time along with other optimizations that were cripplingly slow in the Erlang version.

The [Python] community is extremely active, so many of my questions were already answered on Stack Overflow and in eventlet’s documentation.

If string processing is a bottleneck in your system, either your system isn't doing anything else interesting to take up CPU time, or you've done something very, very wrong. Serialization is a damn-near solved problem.

Newcomers to Erlang tend to do string handling with a heavy Ruby/Java/whatever accent. That's the problem. The default Erlang string type is a linked list of ints (which can be pattern-matched on), but atoms (AKA "symbols" is Lisp, Ruby, etc.) and binaries (arrays of raw binary data) address situations that need more specific trade-offs.

In particular, redundant string concatenation and flattening tends to be CPU hog, but IO-Lists automatically flatten all string types during transmission and have already been thoroughly optimized.

Of course, if you ignore the serialization solutions Erlang provides, there is a performance hit.

It's a web server that hands off to a queue. What do you think it's doing besides serialization?

Wow, this post made me feel like the world's most incompetent intern.

Don't beat yourself up.

I once met an intern that is very good at abstraction and writing "OK"-designed OOP code (OK because it looks and sound good minus the ability to unit-test, but other than that it was simple enough for other people to understand and quite flexible). On the flip side, he's not that good when it comes to networking code (pretty much system programming stuff). He could be good, but at that time, software design (in OOP environment) was his forte.

You might have your own pluses. Besides, we don't know what the code looks like or whether what this intern wrote is the truth. If you've been in this industry long enough you'll start to take a lot of things with a lot of grain of salt.

I'm pretty sure he must have received lot of input from his senior peers so let's spare the kid ;)

It seems like the lesson here is that basically any language (Python, Ruby, ...) will perform about the same with non-blocking I/O.

Does this mean that Erlang and node.js are mostly compelling because of the prevalence of async versions of common libraries? Or are they not that compelling in web contexts in the first place?

A lot of the languages will probably perform similar on non-blocking I/O because they are all leveraging epoll (or select or kqueue) underneath it all. There is great variation however, on how the green threads are exposed. Node.js has callbacks, Python has yields, and Erlang has messages. Some of these approaches are easy to reason about and maintain than others.

I always found Haskell's take on parallelism interesting, and maybe it is faster. In Haskell you create a unit of work called a 'spark'. You can have billions of these, they get mapped to lightweight Haskell threads (powered by epoll) and those get mapped onto OS threads.

Erlang is compelling because it's been built from the ground up to support reliable distributed computing and heavily battle tested in incredibly high-volume applications. Non-blocking I/O is just the plumbing in a far more sophisticated machine.

Sophistication is often the ultimate enemy.

H. Thoreau once said "In proportion as he simplifies his life, the laws of the universe will appear less complex.." (Walden, Princeton University Press, 1971, p.323-324)

I'm a big believer in KISS. If you don't have the problems Erlang was designed to solve, Erlang is probably not a good choice. Rolling your own amateur version of Erlang on top of evented Python or Javascript is probably also not a good move though.

I think you are confusing sophistication and complexity. Part of the sophistication of Erlang is how it simplifies the complexities of concurrent programming and, more importantly, the handling of failures in concurrent programming.

Complexity, is the enemy.

The most sophisticated solutions are often very simple, because they were written from a sophisticated perspective, not a naive perspective.

erlang has stood the test of time, and produced great results because, really, it is very simple.

It just comes from a sophisticated perspective.

I understand that this is meant just as an experience report, but I have to say this article didn't convince me in any way that this rewrite was a good idea. Obvious questions:

1. How does the performance of the new system compare to the old system?

2. What exactly were those maintenance issues with the Erlang server? Did just no-one in your team find the time to learn Erlang well enough? I know Erlang isn't the prettiest of languages, but async I/O isn't the only advantage of Erlang. A battle-tested concurrent runtime and built-in support for fault-tolerance are two obvious examples.

But, but, so pretty and elegant!

  quick_sort([]) ->
  quick_sort([H | T]) ->
      quick_sort([X || X <- T, X < H]) ++ [H] ++ quick_sort([X || X <- T, X >= H]).

That's exactly the point we wanted to convey. Erlang and node are great, but we know Python really well and were able to write a performant server with the tools we're familiar with.

You failed. I blame title and tone.

After 2 years in production you nave no one on the inside that knows the core part of your system? Duh! Start investing time in your core technology. Blaming Erlang for poor R&D management choices is not going to fly here.

Do they run the message queue on the same box as the gateway server in production? If not then the test he ran isn't a direct comparison (since network latency between the app server & queue isn't accounted for). Running both of those services on the same box isn't great either, since they could slow each other down, and you lose both if the box dies.

Still, very cool, congrats ankrgyl, it's awesome to be able to write stuff like that as an intern!

This sums it up:

    "No one on our team is an Erlang expert"
Regarding mochijson we switched to jiffy [1] (NIF-based native C parser).

Also I would love to get a comparison between 2-years old (probably badly written) Erlang server and a new Python/eventlet server.

[1] https://github.com/davisp/jiffy

related: a benchmark of mochiweb vs cowboy vs misultin (all Erlang) vs node.js vs Python Tornado:


Riak is a fairly large, open source, NoSQL database, written in erlang. I've looked at its source code on occasion knowing little about its internals, and found them to be really comprehensible. Sometimes it is shocking to see how elegant the code is.

At the same time, I have gone and looked at code I wrote back when I was first looking at erlang, that does much less and is much more verbose, confusing and sprawling.

I don't think erlang lacks maintainability. I think it just requires some discipline- like any language.

It sounds like your company has a culture of python hackers and erlang was chosen because you felt you needed to choose something "serious" for this bit of work, rather than because you loved erlang and would use erlang even if you needed to write something trivial. There's nothing wrong with that, but I don't see this article as revealing any hidden weaknesses in erlang.

Regarding the JSON parsing issue, erlang has excellent support for code written in other languages, specifically C, and you could wrap any C based JSON parser and use it, though I bet someone has already done this for you. I believed I was watching such a project on GitHub but can't for the life of me find it now.

Jiffy is awesome!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact