I'm reminded of the phrase - if your intern deleted the production database you don't have a bad intern; you have a bad process.
Whether this was a process problem or a human one we don't really get to judge since we do expect more from a FTE.
I'll just say putting myself into his shoes made me tear up as I read the dread and pangs of pain upon realizing what happened - then to have life again after the failure of the ray of hope. That weight, I've never had a project that so many people depended on.
At a major brokerage firm I accidentally hit prod with a testing script that did several millions of dollars of fake FX test trades.
The first thing mentioned in the post mortem call was “No one is going to blame the guy who did those trades. It was an honest mistake. What we are going by to do is discuss why a developer can hit the production trading API without any authentication at all”.
No, it was caught by our trading ops guys. A few minutes after I hit enter I got. Rather chilling phone call from them. So that part of the system worked
Back in school, my roommate's mom worked for a hedge fund and he did part-time work for them. He factored out a common trading engine from individual strategies, and one day the head of the fund asked him to run a strategy that had made a bunch of money in the past, but had been retired after failing to make money for a while. So, he put the strategy back in production without any testing, forgetting that he had recently done some minor refactoring of the trading engine. He typo'd one variable name for a similar variable name, so in the loop where it broke down large orders into small orders, it actually had an infinite loop. Luckily the engine had an internal throttle, so it wasn't trading as fast as it could send messages over the network.
I was chatting with him when he noticed the stock the strategy was trading (KLAC) was gradually declining linearly. He looked at the L2 quotes and saw that someone using his brokerage was repeatedly putting out small orders, and then he realized they were his orders.
The fund got a margin call and had to shift some funds between accounts to make margin, and they had to contact regulators and inform them of the bug, and they had to manually trade their way out of the massive short position they traded. However, they ended up making $60,000 that day off of his mistake.
You should never blame the individual for organizational failures like this. I see two process issues:
1. The plug was allowed to be connected backwards. Either this should be impossible or this hazard should be identified and more than one human should verify orientation
2. In use tools like multimeters should never be disconnected. At worst you get problems like this at best you annoy whoever was using it
Blaming individuals only gets them fired and weakens the entire organization. You just fired the one person who learned an expensive lesson.
The only time when an individual should be blamed is when they intended harm, at which case the law could kick in
You can't apply process thinking here, where the scenario is custom testing a unique probe, and you don't know what other constraints are in play (for example, the reason for the plug design). If NASA were sending these things to Mars by the dozen, then you can start to formalize things like test procedures and look for places mistakes can happen. But in this scenario, you're just disempowering your staff by not letting them choose the most effective and low-risk way to do one-off, highly specialized testing work.
I can't say about NASA, but I can say about my experience at ESA (European Space Agency), where I worked on Mars lander hardware. You have very very formal procedures and detailed checks as soon as you approach any parts which is going to fly.
The simplest task you can imagine takes incredible proportions (for good reasons).
Disconnect and reconnect that plug? Please inform persons X and Y, person Z must be present, only person W can touch that plug, and do perform a functional test according to the procedure in this document before and after and file these reports etc ...
Cleaning a part? Oh glob. Get ready for 3 months of adventure talking to planetary protection experts and book the cleanest room in the continent.
The Hacker News mic drop strikes again. I have nothing super substantive to add except to agree with your point and add that yes, it feels like work to put in the formal policies and procedures, but when the stakes are high enough (rocket to mars? its high enough), even the work that doesn't intuitively feel 'worth it' to someone is DEFINITELY worth it.
"It's a waste of time" is very often a fallacy, especially when the risk cannot be easily undone.
I (mostly mentally) complete the phrase "It's a waste of time" with "what's the worst that could happen?", and when I'm actually saying the phrase out loud, stare at whoever said that for 5 full seconds.
Exactly :). The funny part is, the thing actually crashed! [1]
Why? Bad error handling in the software (primarily). What is the worst that could happen? An instrument saturate, a variable gets stuck at a value, but keeps being integrated, the spacecraft computes a negative altitude and thinks it'a below ground level (negative altitude) but is in fact in full descent and at 3+ km from the surface. Oopsie !
Aerospace-grade connectors are specifically designed to support multiple keyings that prevent this kind of thing. It's definitely a problem preventable by careful design if the interface supports making this kind of mistake.
Can confirm. Source: I used to work for NASA, and I'm a private pilot. There are literally millions of electrical connections that get made on aircraft and spacecraft on a regular basis and I can't think of ever hearing of an incident caused by one of them being made backwards. (Now, mechanical connections getting made backwards is not unusual. That's why you check to make sure that the flight control surfaces move in the right direction as one of the last checklist items before you take off. Every. Single. Time.)
I can think of very few kinds of connectors for which this type of error is even possible. You would need two cable terminations which can connect to each other, for which either side can plug into the same jack.
So either the ends are literally the same (e.g. Anderson Powerpole), or there is some kind of weird symmetry or inadequate keying. Or maybe the two cables don’t connect directly and instead go through some of kind of interface? The latter is fairly common in networking, e.g. “feed-through” patch panels and keystone jacks and quite a few kinds of fiber optic connectors.
All of these seem like utterly terrible ideas in an application where you would take the thing apart after final assembly and where the person doing the disassembly or reassembly could possibly access the wrong side of the panel.
One guy in our workshop had to provide DC to a display with a round 4-pin connector. He soldered two neighboring pins to Gnd and the other two to Vcc. There were two chances to short the powersupply, one to brick the display and one to get it right. Guess what we had to replace until we found out.
A break out box could very sensibly have both sides of the connector on it and then have the various pins broken out into individual connections for flexibility.
In that case keying or whatever isn't going to prevent you from connecting to the wrong side, because both sides are present.
Looks like the author tried to double-power the motor with both the spacecraft motor driver and through the breakout box that MITMs the driver and the motor. In such event, the free-wheeling diode in the driver will allow reverse current to be fed back to the driver's power supply up to certain amounts. This will absorb back-EMF, or energy from "regen" from the motor.
I'm suspecting the breakout wasn't literally sitting between the driver and the motor, but rather all internal connections are broken out to the box for testing; and likely the author's mistake was to not mess with the spacecraft to temporarily disconnect the driver.
But I'm not sure if I'd "just" made the right call and done so nonchalantly on a Mars rover to launch in few weeks.
There have been several cases of the landing gear up/down lever getting wired backwards during maintenance. Not to worry, the gear has a 'squat switch' sensor that prevents the gear from being raised when the plane is on the ground. Unless you taxi over a bump and the switch decides it's now airborne. Crunch.
It depends on what you mean by "that". Getting control surfaces actually reversed is not very common, but it does happen, typically after maintenance when a mechanic inadvertently re-connects a control cable backwards.
Control cables also can and do break, but that too is fairly rare.
What is not rare is control mechanisms jamming. Here is an example:
The one process part that can be controlled and jumped out from the first paragraphs is not letting people touch billion dollar equipment at the tail end of a 12-hour shift.
If you are putting people in a situation with absolutely no safeguards, you can’t have them go into it fatigued.
I’m guessing the people working on that team also weren’t getting great sleep by the discussion of high stress and long hours. Recipe for disaster.
Agree on the blame point, but not on firing point. As a manager, sometimes you need to fire people, that's a necessary part your job. And no, changing the hiring process cannot prevent that.
For one incidental mistake of course not. For repeated inattention (like plugging mars rover's cables wrongly several times) at an attention-demanding job -- yes.
At my first real job as a web dev after school, I crashed the production website on my very first day. Tens of thousands of visitors were affected, and all our sales leads stopped.
Thankfully, we were able to bring it back up within a few minutes, but it was still a harrowing ordeal. The entire team (and the CEO in the next room) was watching. It ended up fine and we laughed about it after some minor hazing :)
But by the time I left that job a couple years later, we had turned that fragile, unstable website into something with automatic testing, multiple levels of backups and failover systems across multiple data centers, along with detailed training and on-boarding for new devs. (This was in the early days of AWS, and production websites weren't just a one click deploy yet.)
That one experience led to me learning proper version control, dev environments, redis, sharding and clustering, VMs, Postgres and MySQL replication, wiki, monit, DNS, load balancers, reverse proxies, etc. All because I was so scared of ever crashing the website again.
That small company took a chance on me, a high school dropout with some WordPress experience, and paid me $15/hour to run their production website, lol. But they didn't fire me after I screwed up, and gave me the freedom and trust to learn on the job and improve their systems. I'm forever grateful to them!
Not in this case. It's a one-off very custom-built rover, the first of its name. There's already all kinds of processes established, but no one can foresee everything. Yes, they probably fixed the process after that, but remember that it was their first time.
PS: Also, more rules and better processes are not necessarily a good thing. Sometimes there are just too much red tape and bureaucracy that makes already super-slow NASA even slower. In those first-of-its-kind missions sometimes you need to risk and depend on people, not processes.
I can't help but think about what would've happened if the Rover had indeed been destroyed though. It seems the only thing that stopped that from happening was sheer luck as they could as easily ( I guess ) have connected to another wrong lead that wouldn't have the protection required to survive the charge? That is, it was outside the author's actual abilities to have stopped that and he could just as easily have been the destroyer of the Rover and forever remembered for that fact, as he had feared he would.
The fact that they were still being made to work after completing a 12 hour shift (which is already too long to be safe) means this was a process error
Whether this was a process problem or a human one we don't really get to judge since we do expect more from a FTE.
I'll just say putting myself into his shoes made me tear up as I read the dread and pangs of pain upon realizing what happened - then to have life again after the failure of the ray of hope. That weight, I've never had a project that so many people depended on.
All heroes in my book.