I agree. That's why we don't do uptime monitoring, or any kind of monitoring really.
PagerDuty is an alerting system which plugs into any monitoring system (Pingdom, Nagios, Cloudkick, etc) and alerts your team via phone, SMS and email when problems are detected. We add advanced alerting features, like 2-way voice and SMS alerts, automatic alert escalation, and on-call duty scheduling to these existing tools.
You're right though, in that many people, on first glance, confuse us with a server monitoring or website pinging system. The "pitch" has gotten better over time, but it's still something we have to work on to improve.
If you haven't done it then I'm going to suggest that you also work on mobile apps to make phones be as loud and annoying as possible. Yeah, it sounds stupid. But a random Blackberry can be as loud as a Skytel pager, even if it isn't by default. And it is worthwhile for someone on duty to make it so.
you know what I want? I want some sort of arm band to put my cellphone (or, better, a giant bluetooth vibrator) so that when I sleep, I can be woken by my pager without waking other people who may also be in the bed.
Pingdom doesn't provide our full range of alerting features, such as phone alerts, on-call scheduling, and two-way SMS (so you can hand off problems to other engineers straight from your phone).
Actually, Pingdom is one of the most common services used in conjunction with PagerDuty.
At a higher level, what we're trying to provide is an on-call management and alert dispatching tool. What PagerDuty does is let you control who, how, and when people are notified when problems occur. In contrast, monitoring tools like Pingdom and Nagios focus more on detecting problems. While they have some native alerting functionality, we think with PagerDuty's advanced alerting, they can function all the better.
PagerDuty also includes voice calls, which we've seen through experience are more reliably delivered than SMS messages (esp. SMSes through email-to-SMS gateways). As with the SMS messages, you can immediately acknowledge or escalate during the phone call using touch-tone.
I also think PagerDuty's ability to graphically define the on-call schedule and escalation rules is much nicer than mucking around with Nagios's configuration files, but I'm a bit biased :)
Main Question What uptime guarantees do you guys make? I saw on your answer on your FAQ but if I was selling this to the powers that be I don't know if your answer would cut it.
Couple of other questions for the team:
A Zabbix plugin forthcoming? Do you have to respond to alerts in your interface or can our monitoring software let pagerduty know the alert has been handled?
Though we already have a lot of the functionality you provide through a few custom scripts we don't have the scheduling of engineers which I've been meaning to write for a while (but doing it manually with a small team wasn't enough of an issue). So certainly a service I would consider using, if not on this project, my next one.
We've taken steps to minimize outages as much as possible. The system is distributed across 3 data centers, with fast automatic rollover in case of a data center outage. We've architected the system to ensure we never drop alerts. PagerDuty integrates with monitoring via email or API; if we receive the message on our end, we guarantee you will be alerted. We've had a few incidents where we have delayed sending out the phone call or SMS alert for a few minutes, but we've never dropped an alert.
In terms of setting a formal SLA, we haven't done so mainly because we're not sure how to go about implementing this. I've checked the SLAs of a few hosting and cloud providers including AWS, Rackspace, Linode and Slicehost, and I haven't found a compelling example to work from. Some of these guys don't have an SLA (they try their best) and the others give you only a portion of your money back.
The whole point of an SLA is to incentivize us to never go down. In our case, we know that if we ever go down, we will lose our customers; that's incentive enough :). Having said that, we may still add an SLA guarantee as part of a larger "enterprise" pricing plan.
We definitely plan on adding plugins for all the popular monitoring systems. We've also released an integration API to allow PagerDuty to integrate with any system that can make an HTTP API call (or call a command-line script that can do this).
I'm pretty sure Zabbix will work with PagerDuty right now, via the integration API. We'd love to work with you to set this up. Please send me an email at email@example.com.
There are lies, damned lies, and SLAs. Personally I only find an SLA useful if it is worthwhile. Most of the SLAs out there aren't. And for good reason. You should probably offer one, but like a smart company shouldn't make the burden too bad.
Suppose someone doesn't respond to a page. Is it because they were too far asleep to hear the paging device? Because the paging device didn't work? Because some other problem kept them from working on the page remotely? Because their carrier blocked the page? Because you broke down? Because the problems in their system kept them from sending you the information in the first place?
There are a lot of points of failure. And your service is not one of the more likely ones to break. Furthermore if there is a dispute, whose records win? They didn't respond to a page, your records say they never sent the page. They blame you, how do you resolve that?
Therefore I'd suggest offering an SLA, but make it be something like, "If you missed a page and are convinced that it was our fault, we'll refund the last X months." From your point of view it is a no questions asked refund policy, that carries with it the consequence that that person is not allowed to sign up for your service. (Unless, of course, you're convinced it was your fault they didn't receive their page.) But whatever you do, be careful not to accept potential liability for something that likely was their problem.
I would also suggest that you share best practices. For instance an important one is that companies need to provide a well-defined escalation path. Recognize that humans fail (whether because of not waking up, being in the process of driving, etc) and so people are unreliable components that need a fall-back mechanism. The act of educating your clients about things like this will help them avoid problems that could cause them in an imperfect world (ie the one we live in) to become unhappy with you.
It's not meaningless? If "working" is dependent on several pieces working, and only some of them is under your control, you can be in a state of "not working" without being at fault.
I've had a server go down for a large group of users because of a malconfigured routing table between them and the server. If we'd had an expensive SLA, there would have been significant "what the heck is it we're paying for, then?" discontent.
right. my point is that if you are selling the customer a service, and you say 'I will get you network connectivity' and then, for reasons outside your control, you don't get them network connectivity, it doesn't make much difference to the customer if the network is broken because you did something dumb or if the network is broken you are getting DDos'd from china. the point is that the network is broken.
last month I paid out almost fourteen grand in SLA credits because I didn't stop a DDos within my allowed 0.5% downtime. Was it my fault I got DDos'd? no. However, i was the only one in a position to do something about it. (and really, if I wasn't tired and generally an idiot, we would have been down for an hour rather than 8.)
You do need clear lines, though. if you need connectivity from point A to point B, that's easy, I can guarantee that. But defining connectivity to 'the internet' is harder. there are cases where I've got good connectivity to most places, but you can't get to some ISP in dallas, because they've hoarked up the routing table.
Right now, I play that sort of thing by ear. If only one customer is having the problem, I try to figure out where it is and if I can't figure it out, it's not that big of a deal to give them a credit. If many customers are having the problem, well, then I have a problem, and really, it's my job to figure out where that problem is and to work around it... even if that problem is a misconfigured router at some other ISP. I mean, really, what is the customer going to do about that sort of thing?
this is the point of having a SLA; it aligns the interests of the service provider with the interests of the customer.
However, what ultimately made the deal was not an SLA, but a new vendor that showed substantially deeper pockets than we expected. A meeting was organized with their sales guy and FOUR suits showed up in a black sedan. Most gangsta display of power, and our glass cubicles were gassed down with the musk of cigar, Brut and Drakkar Noirs.
Interesting. They don't include a free/freemium account, only paid ones with a free trial. I have been wondering about this.
I've always assumed the best business model is to offer a free plan for everyone that is not limited to time but with fewer features or some other limit/constraint like number of users, amount of storage etc.
I wonder how the two models compare. Because I know a lot of people simply will not sign up for anything, even if there is a free trial. People just want something free they can start using and that they don't run into walls - a la Google Docs, Gmail, Basecamp free account, etc...
The main reason we haven't offered a perpetually free account is because we're a bit different than other SaaS companies: hosting isn't our only cost, we also have to pay for each phone and SMS alert we send.
The other reason is that we see PagerDuty as solving a real "hair on fire" problem, and we think if you're one of the businesses that needs this, it's reasonable to pay a certain amount for the service. I'd like to hear your thoughts on this.
Might I suggest a free account that is limited to email alerts? It probably wouldn't cost you much, and it wouldn't cut into your 'business class' business... but it'd be a nice way for small timers to get a taste of your service monitoring their personal stuff (and then maybe recommend it to the boss)
It seems like more companies are moving from the freemium model to the free trial model. We recently switched over for http://www.theweddinglens.com/ and have been pleased with the increased number of conversions.
Just because someone signs up for a free service doesn't mean they're going to use and spread it. When you get them committed through paying, then there's an even higher usage rate as they figure out the best way to maximize it.
Under the original free plans we saw a lot of people just signing up for free to play around with it, only to end up paying a couple weeks later. By making it an actual free trial system, we're able to put a lot more pressure and messaging throughout the upgrade process.
Congrats Alex and Andrew!! That makes two UW SE2006 startups covered on TC :)
Though I think you may already have us beat by being YC funded, the jury is still out on that!
And I agree with the free trial model over freemium. Your service is worth paying for. Period. The trial is used for determining if your service actually works as expected. And you don't get network effects the more people that use your system. So there's really no point for freemium.
Thanks Danielle! Twilio plays a big role here at PagerDuty. You guys have been great. More importantly, on the few occasions where's there's been service hiccups, we've never had problems getting hold of someone at Twilio. Can't say the same about other providers we've tried.
This is actually one of the big reasons we built PagerDuty. SMS is not as reliable as people think -- messages get dropped or delayed by hours all the time.
We've found the automated phone calls to be a much more reliable way of getting the alert out. We can tell right away that the message has been received by asking the listener to press a button on their phone, and repeat or escalate as needed if we don't hear the tone.