
How and Why Mixpanel Switched from Erlang to Python - ankrgyl
http://code.mixpanel.com/2011/08/05/how-and-why-we-switched-from-erlang-to-python/
======
mhd
I might get flamed/downvoted for this, but where's the content of this
article? Apart from some vague tropes ("Erlang is bad at string processing!"
"We need scalability, and that means async Python", "Pick the right Python
libraries!"), there's really nothing interesting about the implementation
details, especially considering such a attention-grabbing headline (to put it
nicely).

What _specifically_ was bad about the Erlang code? Isn't this just saying that
if nobody in your company really understands a language, don't use it?

This is more about the technical competency of a specific company than general
technical issues. Or, to put it bluntly, it's more "Mixpanel sucks at Erlang"
than "Erlang sucks". Don't get me wrong, I'd be really, really interested in a
good analysis why in this case Erlang was the wrong choice, but this article
didn't even get close to anything _technically_ interesting (Even with the
ubiquitous requests/s graph).

~~~
buro9
The thing I learned from the article is that they return 0 for failure and 1
for success. And then, instead of fixing this meaningfully, they added a bit
of sticky plaster that enables you to ask for verbose info when you get a 0.

I can't help but think when I spot such code smells that perhaps the issue
runs deeper. So when I then read about Erlang vs Python I also can't but think
that I bet it's not the language choice that is the problem.

I don't think I fully understood that I judge code at first sight like that,
but I clearly do.

~~~
noelwelsh
Yes, that's a very odd thing to do. More so when you have a better model --
HTTP response codes -- right in front of you!

~~~
buro9
HTTP response codes don't cover all errors though. They're good at describing
problems with HTTP communication, but aren't adequate for describing non-HTTP
application errors.

Additional error indicators are fine (much better than overloading the meaning
of HTTP codes), but they should be sane, meaningful, consistent with developer
expectations, etc. It's that part where the implementation is weird.

I looked at their site and saw they're hiring. Being a problem solver I tend
to think "I can fix your problems, let me help", so I looked to see if they
need an architect or chief scientist or something... but when the architect
position ( [http://mixpanel.theresumator.com/apply/Eoh3qJ/Solutions-
Arch...](http://mixpanel.theresumator.com/apply/Eoh3qJ/Solutions-
Architect.html) ) has the minimum experience of "Student (College)"... I just
face palm and am not surprised by any of this - I bet their definition of
solutions architect is really just sales support.

I think they just need to raise the calibre of their backend, the signals here
(this article, the error codes, the job advert) are not things that inspire
confidence in the product. As the main thing that they do is crunching data
I'm taken aback that even their backend engineer position advert (
[http://mixpanel.theresumator.com/apply/CiOzuu/Software-
Engin...](http://mixpanel.theresumator.com/apply/CiOzuu/Software-Engineer-
Backend.html) ) doesn't even have the word algorithm in it, but maybe they
have a wizard who does that for them.

Am I being harsh? Probably... I don't know the guys there and I may just have
a very narrow reading of too little info to make such assumptions, so to be a
bit balanced (I'm British, it's what we do), they clearly "get" reporting
needs.

Their interfaces are show that they understand the kind of thing people want
to see. I was impressed by those, I've done reporting a lot, and so many times
you see reports that forget that the reader is trying to understand something.
So it's refreshing to see reports that appear to remember that.

~~~
StavrosK
I usually return Bad Request with an explanation/verbose error in the body,
for errors which don't fall under the other codes. It works pretty well.

~~~
mark_story
Doesn't Bad Request usually mean malformed HTTP request data? I find myself
using Forbidden when users send me bogus/invalid data, as its a more generic
'No', and a body with more information never hurts.

~~~
StavrosK
It does, but I consider it more of an indicator that the request (including
its content) was somehow invalid. I reserve Forbidden for cases where the
client needs to authenticate.

EDIT: Actually, I see that the spec says that authentication should do nothing
to fix a Forbidden error, so you might be right. Unauthorized is for
authentication.

Perhaps this is what buro9 means about overloading error codes, but the rest
map so cleanly on the errors that it's hard to ignore them. You can't very
well return 200 and say "There was error X" in the response.

Maybe we should use 418?

~~~
buro9
If in doubt, I'm a Teapot serves well ;)

What's the HTTP error code for "ambiguous input, please select from one of the
following unambiguous suggestions"?

Just returning an error isn't helpful, you want to help them resolve it. So
you want to return something, and this isn't success (in an application sense)
as you didn't do what they asked of you, you're doing something else. But it
_IS_ success in HTTP terms, the request made it to the server, got processed
fine, and came back fine.

Application codes != HTTP codes.

~~~
StavrosK
It's debatable whether the application is entirely separate from the RPC
layer. For example, when writing a RESTful API, would you really want the
server to respond with 200 for everything, even for "Resource not found"
errors?

In web applications, HTTP codes and application codes are quite integrated. Of
course, you do need some extra error info, as HTTP error codes can't possibly
cover every error in your app, but 422 with extra info as a catch-all sounds
like a reasonable compromise.

------
trefn
Pretty harsh response to a post from an intern, dudes and dudettes.

Yeah, the erlang server code was pretty gross. It was one of the very first
things written when we started Mixpanel over 2 years ago, and it's only been
updated a few times since then.

I feel like the big thing you guys are missing is how little _time_ we have.
It's not like we don't know when we have bad code, or we don't realize that we
made mistakes in the initial server design - we just have a million things on
our collective plate. Fixing a very simple server - (accept request, validate
json, put on queue) that is doing its job okay hasn't been a high priority
thing.

When this code was written, Mixpanel had zero customers and we weren't sure
what we were building yet. In that regard, Erlang has been a rock. We've
barely had to touch it during the rampup from 0 to thousands of requests per
second.

Now that we have the manpower, and we've learned what we really need, we can
rewrite it to make things easier on ourselves. If we can get acceptable
performance in python, there is no reason to use erlang.

I think there's some merit to the other complaints (error codes, etc), but
that's another symptom of this thing being written so long ago. We want to
improve things incrementally (and backwards-compatibly) for now, but it will
be dramatically simpler for us to make changes to the server now that it's
written in python.

Ultimately, we have skeletons in the closet, just like the rest of you - I'm
sure _all_ of you have some bad code in production somewhere. Now we're saying
"Look, we're getting rid of our skeletons!" and you guys are like "OMG WHY YOU
HAVE SKELETONS" instead of "sweet, no more skeletons".

~~~
rvirding
Seriously, it would be interesting to see your code. As an Erlang
inventor/developer it would be interesting to see how the language is actually
used and how that relates to the problems people have.

I agree that not having Erlang competence in your company IS a good reason to
change language.

~~~
Ixiaus
I agree with this. It would be interesting to see how Erlang was used - I find
engineers that have "issues" with their erlang programs aren't actually using
OTP to its fullest (using behaviors and supervisors, packaging as an
application, etc...).

------
rednum

      Finally, we use a few stateful, global data structures to
      track incoming requests and funnel them off to the right 
      backend queues. In Erlang, the right way to do this is to 
      spawn off a separate set of actors to manage each data 
      structure and message pass with them to save and retrieve 
      data. Our code was not set up this way at all, and it was 
      clearly crippled by being haphazardly implemented in a 
      functional style.
    

Seriously?! I have used only a little erlang, but this makes no sense to me -
it's like you were writing some big java project and put everything in one
huge class, with all methods and variables static. It's hard for me to imagine
why and how someone would write production erlang app with no actors,
especially some kind of server. No wonder the thing sucked in the first place.

------
tzs
From the article:

    
    
       Because of these performance requirements, we originally wrote the
       server in Erlang (with MochiWeb) two years ago. After two years of
       iteration, the code has become difficult to maintain.  No one on
       our team is an Erlang expert, and we have had trouble debugging
       downtime and performance problems. So, we decided to rewrite it
       in Python, the de-facto language at Mixpanel.
    

My first impulse would have been to have one or more team members become
Erlang experts. Was that considered?

~~~
plinkplonk
"After two years of iteration (on an Erlang codebase) ...no one on our team is
an Erlang expert,"

This sounds a little strange. How is this possible? High turnover on the team?

~~~
codexon
Or why not point out the obvious? Erlang is hard to learn.

Anyone claiming to have mastered it in a couple of months like most fanatics
here are (pardon my language) full of crap.

~~~
zwischenzug
I've seen people master it quickly, writing their own behaviours etc..

I learned it on the tube to and from work in a relatively short space of time,
writing a comet server for streaming updates to browsers with a colleague.
After the initial hump it was easy and fun. I don't claim to be a genius
programmer, just an interested dabbler.

To be fair though, the more average developer does struggle with it, and that
was a reason my company didn't take it up widely.

~~~
mahmud
Took me a week to be comfortable with it, but would take at least 2 years for
me to grok the performance implications of each construct.

------
mononcqc
“Finally, we use a few stateful, global data structures to track incoming
requests and funnel them off to the right backend queues. In Erlang, the right
way to do this is to spawn off a separate set of actors to manage each data
structure and message pass with them to save and retrieve data.”

Nope, that’s not the right way. The way you were doing it ended up making all
calls sequential and bound to single processes that could lose state. That’s
not right.

The best way to do it would have been to use ETS tables (which can be
optimized either for parallel reads or writes), which also allows destructive
updates, in order to have the best performance and memory usage possible. Note
that you could then have had memory-only Mnesia table (adding transactions,
sharding and distribution on top of ETS) to do it.

As for string performances, I’m wondering if you used lists-as-strings, binary
strings or io-lists to do your things. This can have significant impact in
performance and memory use.

Then again, if you had a bunch of Python and no Erlang experts, I can’t really
say anything truly convincing against a language switch. Go for what your team
feels good with.

------
breck
> The biggest challenge for me was pushing the server from working 99.9% of
> the time to 99.99% of the time, because those last few bugs were especially
> hard to find.

Could you expand upon this some more? How do you know the server works 99.99%
of the time (or 99.9%)? Do you run regression tests using actual past
requests?

------
sayrer
:)

Bob's software sucks, let's switch to Bob's software.

~~~
spooneybarger
?

~~~
daleharvey
I dont know if its what he is referring to, but it would be an awesome
coincedence if it wasnt

Bob Ippolito wrote mochiweb (the erlang web server) and looks to be involved
in eventlet <http://eventlet.net/doc/history.html>

~~~
roder
[https://bitbucket.org/which_linden/eventlet/src/f30a2fa65f30...](https://bitbucket.org/which_linden/eventlet/src/f30a2fa65f30/AUTHORS)

------
staunch
It sounds like they're accepting really simple HTTP requests (event updates)
and inserting a job in a queue.

Really simple + rarely changing + needs to scale to really high req/sec =
perfect candidate for being written in C. Maybe as an nginx module?

~~~
megaman821
This is only true if the queue isn't the bottle neck. If the queue can only
handle 2,500 req/s and the Python program can send at 3,000 req/s, what use is
it writing a C program that sends at 12,000 req/s?

~~~
staunch
1) Get a faster queue. 2) Create a QueueQueue that batch inserts.

------
mahmud
The intern was told to rewrite the server for a first assignment? What were
the other programmers doing? jquery?

~~~
super-serial
Challenge accepted? <http://memegenerator.net/instance/9240617>

------
jerf
What was the original Erlang performance?

I mean, good enough is good enough, and local culture counts, no problem
there, just curious.

~~~
theclay
This is exactly what I want to know. What were the numbers?

------
tigerthink
"The main difference is that eventlet can’t influence the Python runtime, but
actors are built into Erlang at a language level, so the Erlang VM can do some
cool stuff like mapping actors to kernel threads (one per core) and
preemption. We get around this problem by launching one API server per core
and load balancing with nginx."

The actor model is for _concurrency_ , which is when your threads are
communicating with one another, right? What about the task that the API server
does requires inter-thread communication?

------
j2labs
The author is wrong about simplejson performing 10x better than the json
included with python.

Here is my proof: [http://j2labs.tumblr.com/post/7305664569/python-vs-
javascrip...](http://j2labs.tumblr.com/post/7305664569/python-vs-javascript-
the-json-race)

~~~
ankrgyl
No, we ran an extensive benchmark against log data and found that simplejson
was indeed 10x faster. Your benchmark assumes a different "shape" of json
dictionary than ours, and I would recommend updating your methodology to use
real data instead. I added ujson to our benchmark, and here are the results
(seconds):

$ python json_bench.py history.log.1 json 106.270362854 simplejson
11.336577177 cjson 5.63336491585 ujson 3.81600308418

------
martincmartin
There's not much about "why," in fact, these are the only sentences that are
at all relevant to "why:"

 _After two years of iteration, the code has become difficult to maintain. No
one on our team is an Erlang expert, and we have had trouble debugging
downtime and performance problems._

 _Erlang is historically bad at string processing, and it turns out that
string processing is very frequently the limiting factor in networked systems
because you have to serialize data every time you want to transfer it. There’s
not a lot of documentation online about mochijson’s performance, but switching
to Python I knew that simplejson is written in C, and performs roughly 10x
better than the default json library._

 _I was able to provide some important operations in constant time along with
other optimizations that were cripplingly slow in the Erlang version._

 _The [Python] community is extremely active, so many of my questions were
already answered on Stack Overflow and in eventlet’s documentation._

~~~
carbonica
If string processing is a bottleneck in your system, either your system isn't
doing anything else interesting to take up CPU time, or you've done something
very, very wrong. Serialization is a damn-near solved problem.

~~~
silentbicycle
Newcomers to Erlang tend to do string handling with a heavy Ruby/Java/whatever
accent. That's the problem. The default Erlang string type is a linked list of
ints (which can be pattern-matched on), but atoms (AKA "symbols" is Lisp,
Ruby, etc.) and binaries (arrays of raw binary data) address situations that
need more specific trade-offs.

In particular, redundant string concatenation and flattening tends to be CPU
hog, but IO-Lists automatically flatten all string types during transmission
and have already been thoroughly optimized.

Of course, if you ignore the serialization solutions Erlang provides, there is
a performance hit.

------
mattdeboard
Wow, this post made me feel like the world's most incompetent intern.

~~~
hello_moto
Don't beat yourself up.

I once met an intern that is very good at abstraction and writing
"OK"-designed OOP code (OK because it looks and sound good minus the ability
to unit-test, but other than that it was simple enough for other people to
understand and quite flexible). On the flip side, he's not that good when it
comes to networking code (pretty much system programming stuff). He could be
good, but at that time, software design (in OOP environment) was his forte.

You might have your own pluses. Besides, we don't know what the code looks
like or whether what this intern wrote is the truth. If you've been in this
industry long enough you'll start to take a lot of things with a lot of grain
of salt.

------
socratic
It seems like the lesson here is that basically any language (Python, Ruby,
...) will perform about the same with non-blocking I/O.

Does this mean that Erlang and node.js are mostly compelling because of the
prevalence of async versions of common libraries? Or are they not that
compelling in web contexts in the first place?

~~~
monopede
I understand that this is meant just as an experience report, but I have to
say this article didn't convince me in any way that this rewrite was a good
idea. Obvious questions:

1\. How does the performance of the new system compare to the old system?

2\. What exactly were those maintenance issues with the Erlang server? Did
just no-one in your team find the time to learn Erlang well enough? I know
Erlang isn't the prettiest of languages, but async I/O isn't the only
advantage of Erlang. A battle-tested concurrent runtime and built-in support
for fault-tolerance are two obvious examples.

~~~
MetaCosm
But, but, so pretty and elegant!

    
    
      quick_sort([]) ->
          [];
      quick_sort([H | T]) ->
          quick_sort([X || X <- T, X < H]) ++ [H] ++ quick_sort([X || X <- T, X >= H]).

------
Vitaly
After 2 years in production you nave no one on the inside that knows the core
part of your system? Duh! Start investing time in your core technology.
Blaming Erlang for poor R&D management choices is not going to fly here.

------
gnubardt
Do they run the message queue on the same box as the gateway server in
production? If not then the test he ran isn't a direct comparison (since
network latency between the app server & queue isn't accounted for). Running
both of those services on the same box isn't great either, since they could
slow each other down, and you lose both if the box dies.

Still, very cool, congrats ankrgyl, it's awesome to be able to write stuff
like that as an intern!

------
nivertech
This sums it up:

    
    
        "No one on our team is an Erlang expert"
    

Regarding mochijson we switched to jiffy [1] (NIF-based native C parser).

Also I would love to get a comparison between 2-years old (probably badly
written) Erlang server and a new Python/eventlet server.

[1] <https://github.com/davisp/jiffy>

------
nikcub
related: a benchmark of mochiweb vs cowboy vs misultin (all Erlang) vs node.js
vs Python Tornado:

[http://www.ostinelli.net/a-comparison-between-misultin-
mochi...](http://www.ostinelli.net/a-comparison-between-misultin-mochiweb-
cowboy-nodejs-and-tornadoweb/)

------
nirvana
Riak is a fairly large, open source, NoSQL database, written in erlang. I've
looked at its source code on occasion knowing little about its internals, and
found them to be really comprehensible. Sometimes it is shocking to see how
elegant the code is.

At the same time, I have gone and looked at code I wrote back when I was first
looking at erlang, that does much less and is much more verbose, confusing and
sprawling.

I don't think erlang lacks maintainability. I think it just requires some
discipline- like any language.

It sounds like your company has a culture of python hackers and erlang was
chosen because you felt you needed to choose something "serious" for this bit
of work, rather than because you loved erlang and would use erlang even if you
needed to write something trivial. There's nothing wrong with that, but I
don't see this article as revealing any hidden weaknesses in erlang.

Regarding the JSON parsing issue, erlang has excellent support for code
written in other languages, specifically C, and you could wrap any C based
JSON parser and use it, though I bet someone has already done this for you. I
believed I was watching such a project on GitHub but can't for the life of me
find it now.

~~~
MetaCosm
Jiffy is awesome!

