We believe we have figured out the problem and have already gone ahead with a solution. The issue, at this time, appears to be that the traffic we normally get to our API has increased substantially. Whether it's legitimate or not is unknown but we're unfortunately at the mercy of waiting for DNS to update. It's also possible we may see bottlenecks further down our infrastructure path but we're actively thinking through what's next - fortunately, we have more control there.
We'll have a full transparent write up once things are back to normal.
Our support team is happy to talk to you at status.mixpanel.com to give you updates as they get them.
I just sent this same message to support (at) mixpanel.com.
EDIT: just to reiterate -- the major problem isn't the API being down. It's the extreme timeouts being experienced trying to load a static JS file.
This is the second time you guys have had a major outage in recent memory and your status page looks like a christmas tree. I really do love mixpanel, but unfortunately I'm going to relegate your service in my mind to 'untrustworthy third party service' every time I write code against your api.
The issue is that you can't insert b (the <script> tag) into the DOM until you've declared window.mixpanel because the code within mixpanel.2.js depends on window.mixpanel to be defined.
It is a subtle race condition that only seems to happen with IE, but when it would fail, it would cause one of those ugly IE dialogs to popup to the end user. Not good.
The correct code should look like this near the end... just move the insertBefore to be after the window.mixpanel:
a.__SV = 1.1;
window.mixpanel = a;
I wish that was true. But since the library doesn't handle service downtime at all gracefully, we're stuck with the callbacks not firing -- something we depended on.
The SnowPlow architecture diagram is here: https://github.com/snowplow/snowplow/wiki/Technical-architec...
Might I suggest something like Chaos Monkey? Real-time APIs are highly vulnerable to DDOS and other traffic conditions because they'll so readily accept new connections, might be something worth throwing on the crew regular-like. :)
EDIT: As others have voiced my issue was that it was taking up to 30 seconds for requests to timeout disrupting various things in my JS code (it is a Backbone app) -- it didn't actually "lock" it. I was using the JS API.
Instead, you deal with them asynchronously. There's a few ways to do it mechanically. I offload them to my job queue and set the priority to "lowest possible." When the job queue is otherwise empty, a worker process (that no actual human is waiting on) slurps up a few of the events and fires them off.
Only downside, which I never bothered fixing: I have an "OMG the queue is stuffed to overflowing... the queue worker must be non-responsive!" monitoring test which, since that symptom has featured in most of my customer-visible downtime, generates a red alert. (i.e. Immediate SMS followed by phone call escalation, as opposed to an "FYI check this" email.) Any downtime at Mixpanel or KissMetrics lasting longer than a few minutes reliably triggers this alert.
On the plus side, this means that when I tell you "Mixpanel really doesn't fail at 2 AM Japan time all that often" you should trust that I'd have noticed.
Design-wise: Two queue worker processes independent of web server processes. Delayed::Job lets you assign tasks a priority and lets you assign queue workers to priority levels it is allowed to look at. By convention, priority 0 (highest) is interactive tasks in my applications (a user is at their keyboard waiting for an answer) and priority 10 (least priority) is, well, Mixpanel. Not because I don't love them, just because if Mixpanel blocks for an entire month my rent still gets paid and that isn't true of any other priority level.
Queue worker A only works on priorities 0 through 9. Queue worker B works on 0 through 10. This ensures that even if Mixpanel (or Kissmetrics, also on 10) perpetually times out the higher-priority levels will never be totally blocked.
Delayed::Job worker processes are monitored by god, which resets them if they fail, become bloated, etc. They're also separately monitored by Scout monitoring's DJ plugin, which fires a yellow alert if a job is ever older than an hour (shouldn't happen but can in event of e.g. an API outage) or a red alert if there's ever more than X jobs in the queue (high probability that this represents a crash god can't recover from).
Because even that setup was letting through about one outage every six months to a year, I have one other ace-in-the-hole: my ajax polling actions which check whether particular jobs are complete will, if they fail Y times in a row, a) instantiate a queue worker within the web server process (this degrades request processing but can't totally break the site) and b) fire off an independent-from-everything-else phone call to my cell phone saying "Your first, second, and third line of defense have failed. Get ye to an SSH terminal."
I am bummed that we are going to have missing data though :-(
The problem with this is that if their JS doesn't load at all (which is the problem right now), then /track is never called and thus the callback is never fired.
So because you can't depend on MP to reliably serve up a simple .js file, you have to write your own wrapper around all of that code to do a timeout incase the callback never fires. What a mess.
Also, I don't think I would ever make your core logic depend on events from something as dispensable as a third party analytics service. It just sounds like a poor design choice to have some core site functionality occur after receiving a response from something you do not control. I'd try to make those Mixpanel requests more "fire and forget", if you will.
Edit in response to you response below:
I don't want to include their JS file because what happens if they make a change that I need?
That said, their documentation should account for this better imho. There should be sample code which clearly illustrates that if you are going to rely on callbacks being fired, that you better also setup a timer to make sure that they actually get fired.
Edit in response to your response above:
Mixpanel sells themselves on being simple to use, from their homepage: "It takes less than 10 minutes and is incredibly simple." The reality is far from that.
Yes, I made a mistake of assuming that their JS would always get loaded and their callback would always get fired. I fully own up to that and I've learned my lesson. That said, their documentation should also reflect that the callback may not fire and you should be prepared for that by doing XYZ. It is two lines in bold red text and I would have thought harder about depending on that callback to fire. It honestly didn't cross my mind that their js wouldn't load and it hasn't been an issue until now.
I'm also not trivializing service stability, yes, things go down and that is a fact of life. That said, when you've got $10+ million in the bank, you can prioritize a bit of the money towards setting up reliably serving of a single file, that doesn't go down for hours on end. There are services, such as CloudFlare and CloudFront which are pretty damn reliable for exactly this purpose and yes, are trivial to implement.
We load the library onto the page asynchronously, which keeps your website loading quickly even if placed in the <head> of the page.
In our case just commenting out the library load solved all our issues.
If we get too many messages accumulated in the queue where these jobs go, we just purge it (better lose that data than let RabbitMQ die because of memory exhaustion). Although there have to be a whole damn lot of events there for us to actually notice
There's an issue with passing Mixpanel's super properties to your server-side handler, but that's the general idea.