
Safely Rewriting Mixpanel’s Highest-Throughput Service - dcu
https://engineering.mixpanel.com/2019/07/24/safely-rewriting-mixpanels-highest-throughput-service-in-golang/
======
ZeroCool2u
The author mentions finding the GCP Python PubSub API unreliable. That's
definitely not been my experience.

Could someone from Mixpanel elaborate maybe?

~~~
evpa1g
Hey, I'm the author. That migration was a while ago, so I had to look back at
what exactly we did. The gist was our old python servers used a custom fork of
eventlet ([https://eventlet.net/](https://eventlet.net/)) which didn't work
well with the cloud pubsub library. The core library in itself is not
unstable, but when we introduced it to the existing code it became unreliable.

~~~
ZeroCool2u
Thanks for clarifying! Great write-up!

------
ma2rten
I haven't followed mix panels development. I was surprised that a company
which went as far as writing their own time series database apparently decided
to go all in on Google cloud.

~~~
i0exception
Disclaimer: Not the author, but I was on the team that migrated our
infrastructure to GCP.

As a startup with limited resources, it's important for us to invest all our
engineering strength into the things that create direct value for our
business. We'd rather pay Google to manage machines and run services like
Kubernetes, Spanner, Pub/Sub and others and free up the engineers to work on
our core analytics platform.

~~~
paladinxx
I think the parent may have meant the opposite of how you interpreted it...
That writing a tsd db from scratch doesn't match what you just stated about
investing engineering time where it makes the most sense.

~~~
i0exception
We don't run a TSDB. A TSDB doesn't work for the kinds of queries we run -
specifically, TSDBs don't work if you want to

    
    
      * analyze every datapoint you receive
      * when the dimensional cardinality is high
      * you want to analyze behaviors over time (e.g. the output depends on the orders of events followed - like creating a funnel report)
    

There's no off-the-shelf solution that does this at the scale at which we
operate - hence the need to write our own custom solution.

------
rwilson4
Very cool! Any stats you can share about increased performance switching to
go?

~~~
mda
I would not be surprised when moved prom python to anything like java, c#, go
and end up at least 10x.

~~~
weberc2
It depends on how much of the “Python” implementation was actually C.

------
User23
I faced a similar challenge. The technique I used was to carefully read the
code of the program to be rewritten, and then I wrote a spec for it. The
format of the spec was that everything that was feasibly testable was written
as a unit test, and everything else was written as comments between the tests
and the text explaining them. Naturally the original code satisfied the test
suite/specification.

Then I used that test/spec to do a TDD type development of the new service. It
was the easiest rollout I've ever done. Everything just worked when it went
into production. I even ended up giving some internal presentations on the
process.

I also tested with logged input from the source program. It's neat to see this
technique is common.

------
Exuma
Wow cool article. I learned about Envoy which looks really awesome.

Just out of curiosity, what were some of the bugs you found? Were they related
to semantics of python not carrying over to go? or was it that you tried using
new go features like goroutines and they didnt work as expected?

~~~
evpa1g
The bugs were typically related to general correctness or translating python
semantics. The correctness-type bugs were due to the complex nature of the
API. The python translation bugs were just about getting the correct behavior
of statements like

if val: # ...

in Go.

------
welder
Did p95 latency change due to the migration?

~~~
evpa1g
Yes! You can see the p99 results here (we don't have an aggregated p95
measurement, so used p99):
[https://imgur.com/a/oQRbyBF](https://imgur.com/a/oQRbyBF)

Both max and avg p99 latency became much more stable. Max appears to have gone
down a little too.

------
marcrosoft
s/golang/go

