Hacker News new | past | comments | ask | show | jobs | submit login

There's a large category of complex systems where you have to get it right in advance because of the consequences of problems; anything avionics or real-time, for example. Or you can lose a lot of money without blowing things up but still faster than the on-call humans can respond: https://dougseven.com/2014/04/17/knightmare-a-devops-caution...

(High-reliability engineering is very much a different, more expensive, less agile culture from software startups, and I worry that the culture is bleeding across in inappropriate ways. The "self-driving" car with a "safety driver" is an extreme example of this: an on-call human that's supposed to respond to operational problems in an extremely short timeframe, but also provides an opportunity to blame the human rather than the software)




High realiabity and high availability are not the same thing though. There are still problems in aviation, like the dreamliner who had to be restarted every x days or it would go full system shutdown. In these kind of systems you often sacrifice availability for reliability. You also sacrifice progress for reliability, which is absolutely the right thing to do for these projects.

An on call engineer shouldn't be the solution for bad reliability, because as you said it doesn't help. He is primarily there for availability.

Instead of high throughput I maybe should've said highly available systems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: