Hacker News new | past | comments | ask | show | jobs | submit login
PagerDuty (YC S10) Makes Sure Your Team Knows When A Server Goes Down (techcrunch.com)
94 points by alexsolo on July 17, 2010 | hide | past | web | favorite | 52 comments

I'm usually pretty optimistic on new YC start-ups, but I'm not sure the world needs another uptime monitor.

There are so many free, cheap, and other services like this that it's hard to imagine what a company might differentiate on to become the "next big thing."

I agree. That's why we don't do uptime monitoring, or any kind of monitoring really.

PagerDuty is an alerting system which plugs into any monitoring system (Pingdom, Nagios, Cloudkick, etc) and alerts your team via phone, SMS and email when problems are detected. We add advanced alerting features, like 2-way voice and SMS alerts, automatic alert escalation, and on-call duty scheduling to these existing tools.

You're right though, in that many people, on first glance, confuse us with a server monitoring or website pinging system. The "pitch" has gotten better over time, but it's still something we have to work on to improve.

If you haven't done it then I'm going to suggest that you also work on mobile apps to make phones be as loud and annoying as possible. Yeah, it sounds stupid. But a random Blackberry can be as loud as a Skytel pager, even if it isn't by default. And it is worthwhile for someone on duty to make it so.

you know what I want? I want some sort of arm band to put my cellphone (or, better, a giant bluetooth vibrator) so that when I sleep, I can be woken by my pager without waking other people who may also be in the bed.

Ooh, thanks! that looks like it might solve my problem.

>> "PagerDuty is an alerting system which plugs into any monitoring system (Pingdom, Nagios, Cloudkick, etc) and alerts your team via phone, SMS and email when problems are detected."

Pingdom already does that. I'm not clear what you're offering here...

Pingdom doesn't provide our full range of alerting features, such as phone alerts, on-call scheduling, and two-way SMS (so you can hand off problems to other engineers straight from your phone).

Actually, Pingdom is one of the most common services used in conjunction with PagerDuty.

At a higher level, what we're trying to provide is an on-call management and alert dispatching tool. What PagerDuty does is let you control who, how, and when people are notified when problems occur. In contrast, monitoring tools like Pingdom and Nagios focus more on detecting problems. While they have some native alerting functionality, we think with PagerDuty's advanced alerting, they can function all the better.

ok, so PagerDuty's main focus/differentiation is duty and scheduling management? I do see the value of such system, you want to wake only Joe instead the whole team up if he is on duty that night.

Maybe you should highlight that? My first impression was "what, yet another pingdom"? Definitely need a bit fine tuning there.

Wondering who will be your main target customer? Any website that has more than one sysadmin?

Yeah, we're going after companies with big uptime requirements that have already grown to the point that they have an operations team of more than one person.

Nagios does that too. You can define schedules to your heart's content, and it will send email anywhere. Given that basically all pagers/mobiles have email addresses, you're set.

The only thing that this does that is really new is two-way SMS.

PagerDuty also includes voice calls, which we've seen through experience are more reliably delivered than SMS messages (esp. SMSes through email-to-SMS gateways). As with the SMS messages, you can immediately acknowledge or escalate during the phone call using touch-tone.

I also think PagerDuty's ability to graphically define the on-call schedule and escalation rules is much nicer than mucking around with Nagios's configuration files, but I'm a bit biased :)

It's all about execution. Did we need another file storage service? Did we need another web site builder? There is a lot of room for companies to do existing things better.

Main Question What uptime guarantees do you guys make? I saw on your answer on your FAQ but if I was selling this to the powers that be I don't know if your answer would cut it.

Couple of other questions for the team:

A Zabbix plugin forthcoming? Do you have to respond to alerts in your interface or can our monitoring software let pagerduty know the alert has been handled?

Though we already have a lot of the functionality you provide through a few custom scripts we don't have the scheduling of engineers which I've been meaning to write for a while (but doing it manually with a small team wasn't enough of an issue). So certainly a service I would consider using, if not on this project, my next one.

We've taken steps to minimize outages as much as possible. The system is distributed across 3 data centers, with fast automatic rollover in case of a data center outage. We've architected the system to ensure we never drop alerts. PagerDuty integrates with monitoring via email or API; if we receive the message on our end, we guarantee you will be alerted. We've had a few incidents where we have delayed sending out the phone call or SMS alert for a few minutes, but we've never dropped an alert.

In terms of setting a formal SLA, we haven't done so mainly because we're not sure how to go about implementing this. I've checked the SLAs of a few hosting and cloud providers including AWS, Rackspace, Linode and Slicehost, and I haven't found a compelling example to work from. Some of these guys don't have an SLA (they try their best) and the others give you only a portion of your money back.

The whole point of an SLA is to incentivize us to never go down. In our case, we know that if we ever go down, we will lose our customers; that's incentive enough :). Having said that, we may still add an SLA guarantee as part of a larger "enterprise" pricing plan.

We definitely plan on adding plugins for all the popular monitoring systems. We've also released an integration API to allow PagerDuty to integrate with any system that can make an HTTP API call (or call a command-line script that can do this).

I'm pretty sure Zabbix will work with PagerDuty right now, via the integration API. We'd love to work with you to set this up. Please send me an email at alex@pagerduty.com.

I'd love to hear what some of you think about SLAs. Is it worth implementing one?

There are lies, damned lies, and SLAs. Personally I only find an SLA useful if it is worthwhile. Most of the SLAs out there aren't. And for good reason. You should probably offer one, but like a smart company shouldn't make the burden too bad.

Suppose someone doesn't respond to a page. Is it because they were too far asleep to hear the paging device? Because the paging device didn't work? Because some other problem kept them from working on the page remotely? Because their carrier blocked the page? Because you broke down? Because the problems in their system kept them from sending you the information in the first place?

There are a lot of points of failure. And your service is not one of the more likely ones to break. Furthermore if there is a dispute, whose records win? They didn't respond to a page, your records say they never sent the page. They blame you, how do you resolve that?

Therefore I'd suggest offering an SLA, but make it be something like, "If you missed a page and are convinced that it was our fault, we'll refund the last X months." From your point of view it is a no questions asked refund policy, that carries with it the consequence that that person is not allowed to sign up for your service. (Unless, of course, you're convinced it was your fault they didn't receive their page.) But whatever you do, be careful not to accept potential liability for something that likely was their problem.

I would also suggest that you share best practices. For instance an important one is that companies need to provide a well-defined escalation path. Recognize that humans fail (whether because of not waking up, being in the process of driving, etc) and so people are unreliable components that need a fall-back mechanism. The act of educating your clients about things like this will help them avoid problems that could cause them in an imperfect world (ie the one we live in) to become unhappy with you.

SLAs with exceptions based on "fault" are meaningless. Either you guarantee you will keep your shit working, or you don't.

(Either way is fine, really... but arguing over "fault" is not a productive activity.)

It's not meaningless? If "working" is dependent on several pieces working, and only some of them is under your control, you can be in a state of "not working" without being at fault.

I've had a server go down for a large group of users because of a malconfigured routing table between them and the server. If we'd had an expensive SLA, there would have been significant "what the heck is it we're paying for, then?" discontent.

right. my point is that if you are selling the customer a service, and you say 'I will get you network connectivity' and then, for reasons outside your control, you don't get them network connectivity, it doesn't make much difference to the customer if the network is broken because you did something dumb or if the network is broken you are getting DDos'd from china. the point is that the network is broken.

last month I paid out almost fourteen grand in SLA credits because I didn't stop a DDos within my allowed 0.5% downtime. Was it my fault I got DDos'd? no. However, i was the only one in a position to do something about it. (and really, if I wasn't tired and generally an idiot, we would have been down for an hour rather than 8.)

You do need clear lines, though. if you need connectivity from point A to point B, that's easy, I can guarantee that. But defining connectivity to 'the internet' is harder. there are cases where I've got good connectivity to most places, but you can't get to some ISP in dallas, because they've hoarked up the routing table.

Right now, I play that sort of thing by ear. If only one customer is having the problem, I try to figure out where it is and if I can't figure it out, it's not that big of a deal to give them a credit. If many customers are having the problem, well, then I have a problem, and really, it's my job to figure out where that problem is and to work around it... even if that problem is a misconfigured router at some other ISP. I mean, really, what is the customer going to do about that sort of thing?

this is the point of having a SLA; it aligns the interests of the service provider with the interests of the customer.

At a former employer, it took weeks to negotiate the terms of an SLA with a solution provider and it ultimately fell through. We glossed over several vendors because of the lack of one (it's usually negotiated, not pre-written like a privacy policy or T&Cs.)

However, what ultimately made the deal was not an SLA, but a new vendor that showed substantially deeper pockets than we expected. A meeting was organized with their sales guy and FOUR suits showed up in a black sedan. Most gangsta display of power, and our glass cubicles were gassed down with the musk of cigar, Brut and Drakkar Noirs.

Nice. Good job guys.

Interesting. They don't include a free/freemium account, only paid ones with a free trial. I have been wondering about this.

I've always assumed the best business model is to offer a free plan for everyone that is not limited to time but with fewer features or some other limit/constraint like number of users, amount of storage etc.

I wonder how the two models compare. Because I know a lot of people simply will not sign up for anything, even if there is a free trial. People just want something free they can start using and that they don't run into walls - a la Google Docs, Gmail, Basecamp free account, etc...

Any thoughts?

The main reason we haven't offered a perpetually free account is because we're a bit different than other SaaS companies: hosting isn't our only cost, we also have to pay for each phone and SMS alert we send.

The other reason is that we see PagerDuty as solving a real "hair on fire" problem, and we think if you're one of the businesses that needs this, it's reasonable to pay a certain amount for the service. I'd like to hear your thoughts on this.

Ok, that's understandable. My default thinking would be: offer e-mail only notifications for free users, but you made a good point.

Your target audience/market is obviously not the casual user/blogger type so it makes perfect sense.

Might I suggest a free account that is limited to email alerts? It probably wouldn't cost you much, and it wouldn't cut into your 'business class' business... but it'd be a nice way for small timers to get a taste of your service monitoring their personal stuff (and then maybe recommend it to the boss)

I'd guess that anyone who needs to receive text messages already knows about the email to sms gateways that their phone carriers provide.

But you don't really need this service for that. Nagios sends directly to those (just like any other email address).

Of course, the two-way SMS that lets you wake up the other guys if needed would break under this.

yeah, and many people disable those due to spam problems.

I think it's a no-brainer for b2b SaaS.

PagerDuty starts at $12/month. For a personal account that's a huge chasm for me to cross, but for business it feels like nothing. It's probably not even my money.

In business you're just not use to getting much for free, especially service - my bank charges me to write a check, my ISP charges me more to get business DSL to the office than home DSL, etc.

Maybe we're just resigned to that, but having a no free account policy just rides that waves and presumably increases profits (forces conversions to paid, ensures no loss-making free accounts)

It seems like more companies are moving from the freemium model to the free trial model. We recently switched over for http://www.theweddinglens.com/ and have been pleased with the increased number of conversions.

Interesting. Why did you make the switch? I'd love to hear details about that.

My thinking is that if more people are using the service, that also doubles as an advertisement if the users tell some of their friends, recommend to coworkers etc.

Just because someone signs up for a free service doesn't mean they're going to use and spread it. When you get them committed through paying, then there's an even higher usage rate as they figure out the best way to maximize it.

Under the original free plans we saw a lot of people just signing up for free to play around with it, only to end up paying a couple weeks later. By making it an actual free trial system, we're able to put a lot more pressure and messaging throughout the upgrade process.

Congrats Alex and Andrew!! That makes two UW SE2006 startups covered on TC :)

Though I think you may already have us beat by being YC funded, the jury is still out on that!

And I agree with the free trial model over freemium. Your service is worth paying for. Period. The trial is used for determining if your service actually works as expected. And you don't get network effects the more people that use your system. So there's really no point for freemium.

Keep up the good work guys! Very exciting!

Thanks Omar. With Baskar, I think we might also have the distinction of being the first Comp Eng 2006 startup on TC, too :)

Congrats on the coverage you guys. Looking forward to seeing what else is up your sleeves.

Yay for UW!

I love PagerDuty, and it has already paid for itself many times over. The fact that it will phone my house if I miss the text messages has been a big win over AT&T's lousy coverage in my area.

Thanks very much! It's always great to hear that PagerDuty is working well for people.

These guys were the Twilio developer contest winner back in November http://contests.twilio.com/2009/11/outbound-notifications-al...

congrats Andrew, Alex, Baskar and the rest of the team!

Thanks Danielle! Twilio plays a big role here at PagerDuty. You guys have been great. More importantly, on the few occasions where's there's been service hiccups, we've never had problems getting hold of someone at Twilio. Can't say the same about other providers we've tried.

You're welcome, and I'm happy to hear you've been satisfied with Twilio's service - we're really passionate about doing it right and we'll always be here to help

Btw, I've left you a "gift" in your Twilio account... whenever you happen to check your account balance :)

Thanks Danielle!

We just picked up a couple of real, physical pagers because we're unwilling to trust SMS delivery.

This is actually one of the big reasons we built PagerDuty. SMS is not as reliable as people think -- messages get dropped or delayed by hours all the time.

We've found the automated phone calls to be a much more reliable way of getting the alert out. We can tell right away that the message has been received by asking the listener to press a button on their phone, and repeat or escalate as needed if we don't hear the tone.

What a great name with a nice inside-joke component! The first question that went through my mind was "did they used to work at Amazon?" Was PagerDuty.com available or did you guys have to buy it?

It's a similar level of cleverness as Lobby7, if anyone remembers that one...

Amazingly enough, pagerduty.com was available. We were totally blown away by that -- thought for sure it would've been taken.

Nice! DO you have a link to the API details?

But can you send to my pager?

Congrats! PagerDuty rocks.

Congrats guys!

congrats guys - awesome work!

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact