
MixPanel tracking API down - mef
http://status.mixpanel.com/
======
suhail
Hi everyone,

We believe we have figured out the problem and have already gone ahead with a
solution. The issue, at this time, appears to be that the traffic we normally
get to our API has increased substantially. Whether it's legitimate or not is
unknown but we're unfortunately at the mercy of waiting for DNS to update.
It's also possible we may see bottlenecks further down our infrastructure path
but we're actively thinking through what's next - fortunately, we have more
control there.

We'll have a full transparent write up once things are back to normal.

If you're using our JavaScript libraries, your website should not be greatly
affected by our downtime other than the data that will be lost. We're deeply
sorry and disappointed ourselves.

Our support team is happy to talk to you at status.mixpanel.com to give you
updates as they get them.

Suhail

~~~
rsbrown
Suhail, with regard to impact on installed sites I agree that lost event data
is not the end of the world. However, it's currently taking 20-30 seconds for
requests (inserting the mixpanel js) to timeout. During the outage, it would
be much less painful if the requests were failing more immediately.

I just sent this same message to support (at) mixpanel.com.

EDIT: just to reiterate -- the major problem isn't the API being down. It's
the extreme timeouts being experienced trying to load a static JS file.

~~~
stevenou
I completely agree. I woke up to some crazy long page rendering time on
NewRelic and couldn't figure out what was going on... until I saw the huge
backlog of mixpanel events on my db. Not only is it making page rendering
slow, even worse, it's tying up my background workers for 30 seconds at a
time, causing a huge backlog on ALL my background tasks.

~~~
ksowocki
Same here. I wish we had an MP widget that insulated us from downtime from the
mixpanel API.

------
mrchess
This is unfortunate. I was about to implement MixPanel on my startups site (it
already laced in our development code), but their API being down completely
locks our application ... actually shipping MixPanel in production not looking
so hot right now.

EDIT: As others have voiced my issue was that it was taking up to 30 seconds
for requests to timeout disrupting various things in my JS code (it is a
Backbone app) -- it didn't actually "lock" it. I was using the JS API.

~~~
patio11
As somebody who consumes a few of these things, and got a pager about it prior
to it being on HN, this is something that a) will inevitably happen and b)
should not block the application. Think very, very carefully before you ever
block the request/response cycle on an external API. (I'd say "Never do it"
but I think I could come up with conceivable apps and APIs where that makes
sense if I was fully awake.) Since analytics callbacks don't generate
immediate customer value and can fail totally without discomfitting anyone,
you should never block a request for them.

Instead, you deal with them asynchronously. There's a few ways to do it
mechanically. I offload them to my job queue and set the priority to "lowest
possible." When the job queue is otherwise empty, a worker process (that no
actual human is waiting on) slurps up a few of the events and fires them off.

Only downside, which I never bothered fixing: I have an "OMG the queue is
stuffed to overflowing... the queue worker must be non-responsive!" monitoring
test which, since that symptom has featured in most of my customer-visible
downtime, generates a red alert. (i.e. Immediate SMS followed by phone call
escalation, as opposed to an "FYI check this" email.) Any downtime at Mixpanel
or KissMetrics lasting longer than a few minutes reliably triggers this alert.

On the plus side, this means that when I tell you "Mixpanel really doesn't
fail at 2 AM Japan time all that often" you should trust that I'd have
noticed.

~~~
iheartmemcache
I (and I'm sure other HNers) would love it if you couldgo into a little bit
more detail as to how your job queue is constructed.

~~~
patio11
Code-wise: Delayed::Job <https://github.com/collectiveidea/delayed_job>

Design-wise: Two queue worker processes independent of web server processes.
Delayed::Job lets you assign tasks a priority and lets you assign queue
workers to priority levels it is allowed to look at. By convention, priority 0
(highest) is interactive tasks in my applications (a user is at their keyboard
waiting for an answer) and priority 10 (least priority) is, well, Mixpanel.
Not because I don't love them, just because if Mixpanel blocks for an entire
month my rent still gets paid and that isn't true of any other priority level.

Queue worker A only works on priorities 0 through 9. Queue worker B works on 0
through 10. This ensures that even if Mixpanel (or Kissmetrics, also on 10)
perpetually times out the higher-priority levels will never be totally
blocked.

Delayed::Job worker processes are monitored by god, which resets them if they
fail, become bloated, etc. They're also separately monitored by Scout
monitoring's DJ plugin, which fires a yellow alert if a job is ever older than
an hour (shouldn't happen but can in event of e.g. an API outage) or a red
alert if there's ever more than X jobs in the queue (high probability that
this represents a crash god can't recover from).

Because even that setup was letting through about one outage every six months
to a year, I have one other ace-in-the-hole: my ajax polling actions which
check whether particular jobs are complete will, if they fail Y times in a
row, a) instantiate a queue worker within the web server process (this
degrades request processing but can't totally break the site) and b) fire off
an independent-from-everything-else phone call to my cell phone saying "Your
first, second, and third line of defense have failed. Get ye to an SSH
terminal."

------
jhuckestein
This is why I queue all my mixpanel requests on the server. I'll just re-run
them once it's back up.

~~~
le_isms
This is a great idea! Do you mind letting us know what your setup's like?

~~~
wulczer
We just issue Celery jobs to track events in Mixpanel, so if their API is
failing, the jobs fail and get retried later on.

If we get too many messages accumulated in the queue where these jobs go, we
just purge it (better lose that data than let RabbitMQ die because of memory
exhaustion). Although there have to be a whole damn lot of events there for us
to actually notice

~~~
le_isms
Really neat! I might be a newbie to this, but how do you get those client-side
javascript function calls over to the serverside? I am assuming the client's
browser is running the Mixpanel Javascript and that your events are
serverside.

~~~
wulczer
Yep, like rprime said, we're sending tracking events from the server (let me
plug <http://libsaas.net> here).

So instead of using mixpanel.track from Javascript, you do an AJAX call to
your own server and schedule a Celery job there.

There's an issue with passing Mixpanel's super properties to your server-side
handler, but that's the general idea.

~~~
le_isms
Gotcha, I'm going to have to try this, thanks for the insight!

