Hacker News new | comments | show | ask | jobs | submit login
MixPanel tracking API down (mixpanel.com)
48 points by mef on July 31, 2012 | hide | past | web | favorite | 38 comments

Hi everyone,

We believe we have figured out the problem and have already gone ahead with a solution. The issue, at this time, appears to be that the traffic we normally get to our API has increased substantially. Whether it's legitimate or not is unknown but we're unfortunately at the mercy of waiting for DNS to update. It's also possible we may see bottlenecks further down our infrastructure path but we're actively thinking through what's next - fortunately, we have more control there.

We'll have a full transparent write up once things are back to normal.

If you're using our JavaScript libraries, your website should not be greatly affected by our downtime other than the data that will be lost. We're deeply sorry and disappointed ourselves.

Our support team is happy to talk to you at status.mixpanel.com to give you updates as they get them.


Suhail, with regard to impact on installed sites I agree that lost event data is not the end of the world. However, it's currently taking 20-30 seconds for requests (inserting the mixpanel js) to timeout. During the outage, it would be much less painful if the requests were failing more immediately.

I just sent this same message to support (at) mixpanel.com.

EDIT: just to reiterate -- the major problem isn't the API being down. It's the extreme timeouts being experienced trying to load a static JS file.

I completely agree. I woke up to some crazy long page rendering time on NewRelic and couldn't figure out what was going on... until I saw the huge backlog of mixpanel events on my db. Not only is it making page rendering slow, even worse, it's tying up my background workers for 30 seconds at a time, causing a huge backlog on ALL my background tasks.

Same here. I wish we had an MP widget that insulated us from downtime from the mixpanel API.

Which might also be a factor in the apparent request load if stale requests are ballooning the numbers, if not consuming resources that cause the apparent problem to appear larger. A bunch of elephants standing around blocking things is quite a different problem than a bunch of pigs running down the road.

I do use the JS api's and it was affecting my website. I was seeing a high number of JS errors and even worse, my users were seeing odd behavior because I was depending on your /track callbacks to fire and they weren't.

This is the second time you guys have had a major outage in recent memory and your status page looks like a christmas tree. I really do love mixpanel, but unfortunately I'm going to relegate your service in my mind to 'untrustworthy third party service' every time I write code against your api.

Also the fact that I found a pretty major bug in your loader code which prevents it from working correctly with IE browsers (https://mixpanel.com/docs/integration-libraries/javascript) last week and today you are still distributing the broken code has me worried. Do you not take this stuff seriously?

Could you elaborate on the loader code bug?

Pretty print their loader JS that is quoted in the url above. I like using the Developer Tools in chrome to do it for me: Load up that JS, go to the Sources tab and click on the {} near the bottom of the screen to auto format it for you.

The issue is that you can't insert b (the <script> tag) into the DOM until you've declared window.mixpanel because the code within mixpanel.2.js depends on window.mixpanel to be defined.

It is a subtle race condition that only seems to happen with IE, but when it would fail, it would cause one of those ugly IE dialogs to popup to the end user. Not good.

The correct code should look like this near the end... just move the insertBefore to be after the window.mixpanel:

a.__SV = 1.1; window.mixpanel = a; d.parentNode.insertBefore(b, d);

"If you're using our JavaScript libraries, your website should not be greatly affected by our downtime other than the data that will be lost."

I wish that was true. But since the library doesn't handle service downtime at all gracefully, we're stuck with the callbacks not firing -- something we depended on.

Just an idea - but when your event collectors start to become unresponsive, why not redirect your load balancer to a SnowPlow-style "no moving parts" collector, e.g. CloudFront + S3? You would have to merge the S3 log data into your main data store after the outage was over, but at least there would be no data loss (at least, not for events logged by GETs - POSTs are not supported by CloudFront), and webpages wouldn't be hanging.

The SnowPlow architecture diagram is here: https://github.com/snowplow/snowplow/wiki/Technical-architec...

Nice response, and glad to see a good one. We know how it goes, but I agree with timeout length being an issue for us as well. It would almost be better if we were 'errored out' so our app performance wouldn't lag.

Might I suggest something like Chaos Monkey? Real-time APIs are highly vulnerable to DDOS and other traffic conditions because they'll so readily accept new connections, might be something worth throwing on the crew regular-like. :)

In case you're not doing this already, it's generally a good idea to set a low TTL like 60 seconds on your DNS so that you can make huge configuration changes relatively quickly.

Looking forward to the postmortem.

This is unfortunate. I was about to implement MixPanel on my startups site (it already laced in our development code), but their API being down completely locks our application ... actually shipping MixPanel in production not looking so hot right now.

EDIT: As others have voiced my issue was that it was taking up to 30 seconds for requests to timeout disrupting various things in my JS code (it is a Backbone app) -- it didn't actually "lock" it. I was using the JS API.

As somebody who consumes a few of these things, and got a pager about it prior to it being on HN, this is something that a) will inevitably happen and b) should not block the application. Think very, very carefully before you ever block the request/response cycle on an external API. (I'd say "Never do it" but I think I could come up with conceivable apps and APIs where that makes sense if I was fully awake.) Since analytics callbacks don't generate immediate customer value and can fail totally without discomfitting anyone, you should never block a request for them.

Instead, you deal with them asynchronously. There's a few ways to do it mechanically. I offload them to my job queue and set the priority to "lowest possible." When the job queue is otherwise empty, a worker process (that no actual human is waiting on) slurps up a few of the events and fires them off.

Only downside, which I never bothered fixing: I have an "OMG the queue is stuffed to overflowing... the queue worker must be non-responsive!" monitoring test which, since that symptom has featured in most of my customer-visible downtime, generates a red alert. (i.e. Immediate SMS followed by phone call escalation, as opposed to an "FYI check this" email.) Any downtime at Mixpanel or KissMetrics lasting longer than a few minutes reliably triggers this alert.

On the plus side, this means that when I tell you "Mixpanel really doesn't fail at 2 AM Japan time all that often" you should trust that I'd have noticed.

I (and I'm sure other HNers) would love it if you couldgo into a little bit more detail as to how your job queue is constructed.

Code-wise: Delayed::Job https://github.com/collectiveidea/delayed_job

Design-wise: Two queue worker processes independent of web server processes. Delayed::Job lets you assign tasks a priority and lets you assign queue workers to priority levels it is allowed to look at. By convention, priority 0 (highest) is interactive tasks in my applications (a user is at their keyboard waiting for an answer) and priority 10 (least priority) is, well, Mixpanel. Not because I don't love them, just because if Mixpanel blocks for an entire month my rent still gets paid and that isn't true of any other priority level.

Queue worker A only works on priorities 0 through 9. Queue worker B works on 0 through 10. This ensures that even if Mixpanel (or Kissmetrics, also on 10) perpetually times out the higher-priority levels will never be totally blocked.

Delayed::Job worker processes are monitored by god, which resets them if they fail, become bloated, etc. They're also separately monitored by Scout monitoring's DJ plugin, which fires a yellow alert if a job is ever older than an hour (shouldn't happen but can in event of e.g. an API outage) or a red alert if there's ever more than X jobs in the queue (high probability that this represents a crash god can't recover from).

Because even that setup was letting through about one outage every six months to a year, I have one other ace-in-the-hole: my ajax polling actions which check whether particular jobs are complete will, if they fail Y times in a row, a) instantiate a queue worker within the web server process (this degrades request processing but can't totally break the site) and b) fire off an independent-from-everything-else phone call to my cell phone saying "Your first, second, and third line of defense have failed. Get ye to an SSH terminal."

I'm not the GP but in my startup we just use Resque, it works wonderfully. I've also had semi-good experiences with RabbitMQ via AMQP. There's nothing special about the structure. Depending on the system you'll use channels/queue names/routing keys/whatever when enqueuing a job and then workers will pull the jobs they're in charge off out of the queue. In the case of mixpanel jobs you should set it up so they stick around when they fail

Speaking as someone who uses a ton of mixpanel in production (http://www.twitch.tv/) the way you handle this is to make all your calls asynchronous.

I am bummed that we are going to have missing data though :-(

This works fine for code where you are just firing a /track 'ping' back to the server on a page load. But, if you are trying to do things like track login's where you send a http redirect right after /track, then you need to use the optional callback parameter to /track and do the redirect in the callback.

The problem with this is that if their JS doesn't load at all (which is the problem right now), then /track is never called and thus the callback is never fired.

So because you can't depend on MP to reliably serve up a simple .js file, you have to write your own wrapper around all of that code to do a timeout incase the callback never fires. What a mess.

I don't use Mixpanel frequently, but could you not just put the Javascript file somewhere more reliable? Why not pull their code down and put it on your own CDN? I've done this with Mixpanel in the past, I believe. This wouldn't solve their API endpoints not responding obviously, but it's an easy way to take the issue into your own hands for the time being.

Also, I don't think I would ever make your core logic depend on events from something as dispensable as a third party analytics service. It just sounds like a poor design choice to have some core site functionality occur after receiving a response from something you do not control. I'd try to make those Mixpanel requests more "fire and forget", if you will.

Edit in response to you response below:

I think you're missing the point here. The scope of their documentation probably does not include best Javascript practices. You made some poor implementation decisions and now you're getting burned. At least now you have experience using third party services and probably won't be so trusting of them, again. Trivializing service stability and other things does not make your point any more credible, though. Personally, I think it makes people overlook your point because it makes it sound as if you have no idea what you're talking about.

Yup, lesson learned. I definitely won't ever depend on Mixpanel again to do something as simple a reliably serve up a single .js file.

I don't want to include their JS file because what happens if they make a change that I need?

That said, their documentation should account for this better imho. There should be sample code which clearly illustrates that if you are going to rely on callbacks being fired, that you better also setup a timer to make sure that they actually get fired.

Edit in response to your response above:

Mixpanel sells themselves on being simple to use, from their homepage: "It takes less than 10 minutes and is incredibly simple." The reality is far from that.

Yes, I made a mistake of assuming that their JS would always get loaded and their callback would always get fired. I fully own up to that and I've learned my lesson. That said, their documentation should also reflect that the callback may not fire and you should be prepared for that by doing XYZ. It is two lines in bold red text and I would have thought harder about depending on that callback to fire. It honestly didn't cross my mind that their js wouldn't load and it hasn't been an issue until now.

I'm also not trivializing service stability, yes, things go down and that is a fact of life. That said, when you've got $10+ million in the bank, you can prioritize a bit of the money towards setting up reliably serving of a single file, that doesn't go down for hours on end. There are services, such as CloudFlare and CloudFront which are pretty damn reliable for exactly this purpose and yes, are trivial to implement.

shouldn't the API calls you make to a tracking service be done in such away that they would not effect your application whether they're available or not available? Analytics is nice, but not having it shouldn't cause your whole app to lock up right? Kind of similar to say a caching infrastructure - if memcached goes down your app just slows down, but doesn't crash right?

Their documentation is either wrong or misleading:

Basic Javascript integration

We load the library onto the page asynchronously, which keeps your website loading quickly even if placed in the <head> of the page.

In our case just commenting out the library load solved all our issues.

Yea, the problem is that making actual calls to their API servers is taking a while.

As others have pointed out, no SaaS provider will guarantee 100% uptime. You are blaming a very poor design on your part on a 3rd party service. Ironically one should be far more concerned about using your startup for anything than using MixPanel.

There is no sane reason why Mixpanel can't give 100% uptime for a single .js file. https://api.mixpanel.com/site_media/js/api/mixpanel.2.js is down right now and that is the biggest part of the issue.

There are plenty of sane reasons, DDOS included (which seems like what happened here), since our digital world is built on physical systems that have limitations and suffer from external influences that can't be contained.

This is exactly why services like CloudFlare exist. You can get pretty much the same behavior using CloudFront too. You may get a bit of downtime, but nothing like the hour+ that is going on now. You can even chain them together... CloudFlare in front of your CloudFront. CloudFlare screws up, just turn it off, now everything goes to CloudFront.

This is why I queue all my mixpanel requests on the server. I'll just re-run them once it's back up.

The iOS SDK also desperately needs some sort of queuing as well (fortunately it's open source so someone else (like me) can eventually build it in).

This is a great idea! Do you mind letting us know what your setup's like?

We just issue Celery jobs to track events in Mixpanel, so if their API is failing, the jobs fail and get retried later on.

If we get too many messages accumulated in the queue where these jobs go, we just purge it (better lose that data than let RabbitMQ die because of memory exhaustion). Although there have to be a whole damn lot of events there for us to actually notice

Really neat! I might be a newbie to this, but how do you get those client-side javascript function calls over to the serverside? I am assuming the client's browser is running the Mixpanel Javascript and that your events are serverside.

Yep, like rprime said, we're sending tracking events from the server (let me plug http://libsaas.net here).

So instead of using mixpanel.track from Javascript, you do an AJAX call to your own server and schedule a Celery job there.

There's an issue with passing Mixpanel's super properties to your server-side handler, but that's the general idea.

Gotcha, I'm going to have to try this, thanks for the insight!

I think he is not talking about the javascript events. Probably they fire the events in the backend. [1]

[1] https://mixpanel.com/docs/integration-libraries/

I mentioned it in an answer to a different post in this thread. The short answer is I use Redis via Resque.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact