
How to lose $172k per second for 45 minutes (2013) - sunasra
https://sweetness.hmmz.org/2013-10-22-how-to-lose-172222-a-second-for-45-minutes.html
======
wcoenen
I'm amused by the tone. It's like the author doesn't realize that 99% of
software development and deployment is done like this, or much much worse.
Welcome to the real world.

We work in an incredibly immature industry. And trying to enforce better
practices rarely works out as intended. To give one example: we rolled out
mandatory code reviews on all changes. Now we have thousands of rubber-stamped
"looks good to me" code reviews without any remarks.

Managers care about speed of implementation, not quality. At retrospectives, I
hear unironic boasts about how many bugs were solved last sprint, instead of
reflection on how those bugs were introduced in the first place.

~~~
seanwilson
> I'm amused by the tone. It's like the author doesn't realize that 99% of
> software development and deployment is done like this, or much much worse.
> Welcome to the real world.

Agree with this, a lot of developers are in a filter bubble where they stick
to communities that advocate modern practices like automated testing,
continuous integration, containers, gitflow, staging environments etc.

As a contractor, I get to see the internals of lots of different companies -
forget practices even as basic as doing code reviews, I've seen companies not
using source control with no staging environments and no local development
environments where all changes are being made directly on the production
server via SFTP on a basic VPS. A lot of the time there's no internal experts
there that are even aware there's better ways to do things instead of it being
the case they're lacking resources to make improvements.

~~~
Reedx
> I've seen companies not using source control

> ...all changes are being made directly on the production servers via SFTP

I know this used to be common, but recently? Curious how often this is still
the case.

~~~
frosted-flakes
I have seen it recently. I did my best to change the practice before I left
the company, but was mostly unsuccessful. Given that they were still running
some of the spaghetti-code PHP scripts that were written in 1999 and still
used PHP4 in _new_ development they were stuck in the stone ages. To give a
little perspective, support for PHP4 ended in 2008, so they had almost a
decade to update, but didn't.

~~~
HenryBemis
"If it ain't broken, don't fix it". And then one day the server goes boom, the
backup was incomplete, and everyone is trying to find the usb flash disk with
Spinrite in it.

Meanwhile the CEO who was rejecting the €¥$£ in yh budget since 2000 is angry
at everyone!

Oh the times I have seen this!!!

~~~
frosted-flakes
Oh, now that you mention backups, that was a nightmare too. Thankfully, the
production database was backed up daily on magnetic tape and stored offsite,
but the code was generally edited live on the server, and backups consisted of
adding ".bak20190402" to the end of the file. Needless to say, losing code
wasn't uncommon.

This was for a 100+ year old company with millions of dollars in annual
revenue that was owned by the government. So, yeah. 100% the IT director's
fault, who'd been there since the early 90s.

------
chollida1
discussed previously at:

[https://news.ycombinator.com/item?id=6589508](https://news.ycombinator.com/item?id=6589508)

I remember the week after this. Everyone I knew who worked at a fund was going
over their code and also updating their Compliance documents covering testing
and deployment of automated code.

 __As a side note __one of hte biggest ways funds tend to get in trouble from
their regulators is to not follow the steps outlined in their compliance
manual. Its been my experience that regulators care more that you follow the
steps in your manual than those steps necessary being the best way to do
something.

I came away from this thinking the worst part of this was that their system
did send them errors, its just that when you deal with billions of events
emailing errors just tend to get ignored as at that scale you generate so many
false positives with logging.

I still don't know the best way to monitor and alert users for large
distributed systems.

The other take away was that this wasn't just a software issue but a
deployment issue as well. It wasn't just one root cause but a number of issues
that built up to cause the issue.

1) New exchange feature going live so this is the first day you are actually
running live with this feature

2) old code left in the system long after it was done being used

3) re-purposed command flag that used to call the old code, but now is used in
the new code

4) only a partial deployment leaving both old and new code working together.

5) inability to quickly diagnose where the problem was

6) you are also managing client orders and have the equivalent of an SLA with
them so you don't want to go nuclear and shut down everything

~~~
LeifCarrotson
> I came away from this thinking the worst part of this was that their system
> did send them errors, its just that when you deal with billions of events
> emailing errors just tend to get ignored as at that scale you generate so
> many false positives with logging.

I write apps that generate lots of logs too...I think an improvement lies in
some form of automated algorithmic/machine learning (to incorporate a buzzword
in your pitch) log analysis.

When I page through the log in a text editor, or watch `tail` if it's live,
there's a lot of stuff that looks like

    
    
        TRACE: 2019-04-01 09:45:03 ID A1D65F19: Request 1234 initiated
        ERROR: 2019-04-01 09:45:04 ID A1D65F19: NumberFormatException: '' is not a valid number in ProfileParser, line 127
        WARN : 2019-04-01 09:45:04 ID A1D65F19: Profile incomplete, default values used
        WARN : 2019-04-01 09:45:14 ID A1D65F19: Timeout: Service did not respond within 10 seconds
        TRACE: 2019-04-01 09:45:14 ID A1D65F19: Request 1234 completed. A = 187263, B = 1.8423, C = $-85.12, T = 11.15s
    

Visually (or through regex), you can filter out all the "Request initiated"
noise. Maybe the default value warning occurs 10% of the time, and is usually
accompanied by that number format exception (which somebody should address,
but it still functions, and there's other stuff to fix). But maybe the
"Timeout error" hasn't been seen in weeks, and the value of C has always been
positive - that is useful information!

Don't email me when there's a profile incomplete warning. Don't email me any
time there's an "ERROR" entry, because that just makes people reluctant to use
error level logging. Definitely don't email me when there's a unique request
complete string, that's trivially different every time. But do let me know
when something weird is going on!

~~~
exelius
Don’t mean to sound snarky, but there are tools that do this and have been for
years. If you’ve been grepping through logs for the last 3 years, you’re doing
it wrong for the cloud era.

Often times the answer is writing better alert triggers that take historical
activity into account to cut down on false positives. Other times it’s simply
to reduce the number of alerts. In every case you need an alerting strategy
that takes balances stakeholder needs, and you need to realign on that
strategy quarterly. It’s ultimately an operational problem, not a technical
one.

Alas, back in the real world, logging is always the last thing teams have time
to think about...

~~~
mattdodge
Care to share what types of tools do this? I'm genuinely interested. I haven't
come across a log management tool that uses AI to detect abnormal conditions
based on the log message contents like the OP describes. I stick to Papertrail
for the most part though so I'm likely out of the loop.

~~~
exelius
I’ve used and really liked DataDog in the past. It has some rudimentary ML
functionality for anomaly detection of certain fields, but it’s only getting
better.

I’ve also had clients in the past use Splunk with ML forecasting models that
inject fields as part of the ingest pipeline. I don’t know the details of that
implementation; I just know how the dev teams were using it.

------
ajuc
Deployment is where the really scary bugs can happen the easiest.

I've been working on a warehouse management software (that was running on
mobile barcode scanners each warehouse worker had, as he moved stuff around
the warehouse and confirmed each step with the system by reading barcodes on
shelves and products).

We had a test mode, running on a test database, and production mode, running
on the production database, and you could switch between them in a menu during
the startup.

During testing/training users were running on the test database, then we
intended to switch the devices to production mode permanently, so that the
startup menu wouldn't show.

A few devices weren't switched for some reason (I suspect they were lost when
we did the switch and found later), and on these devices the startup menu
remained active.

Users were randomly taking devices each day in the morning, and most of them
knew to choose "production" when the menu was showing. Some didn't, and were
choosing the first option instead.

We started getting some small inaccuracies on the production database. People
were directed by the system to take 100 units of X from the shelf Y, but there
was only 90 units there. We looked at the logs on the (production) database,
and on the application server, but everything looked fine.

We were suspecting someone might just be stealing, but later we found examples
where there was more stuff in reality on some shelves than in the system.

At that time we introduced a big change to pathfinding, and we thought the
system was directing users to put products in the wrong places. Mostly we were
trying to confirm that this was the cause of the bugs.

Finally we found the reason by deploying a change to the thin client software
running on the mobile devices to gather log files from all the mobile devices
and send to server.

~~~
hinkley
I bet you had one engineer who claimed that the real problem was that the
users were stupid and not that the deployment process was error prone.

I've heard about this case many times before but somehow in the other
renditions they downplayed or neglected to mention that the __deployments were
manual __. As this story was first explained to me, one of the servers was not
getting updated code, but I was convinced by the wording that it was a
configuration problem with the deployment logic.

Performing the same action X times doesn't scale. Somewhere north of 5 you
start losing count. Especially if something happens and you restart the
process (did I do #4 again? or am I recalling when I did it an hour ago?)

~~~
sucrose
Was the deployment process of your parent post actually error-prone? From what
I gathered, the developers were unaware of the lost handheld scanners. I
imagine if they did they could've proactively put them out-of-service until
found.

~~~
ajuc
We had automatic updates in the thin clients (that's how we were able to add
"logging to server" on all of them at once).

The problem was - the startup menu with testing/production choice was enabled
independently of the autoupdate mechanism (separate configuration file ignored
by autoupdates) for some technical reason (I think to allow a few people to
test new processes while most of the warehouse works on the old version on
production database).

------
time0ut
My company's legacy system (which still does most revenue producing work) has
deployment problems like this. The deployment is fully automated, but if it
fails on a server it fails silently.

I rarely work on this system, but had to make an emergency change last summer.
We deployed the change at around 10 pm. A number of our tests failed in a
really strange way. It took several hours to determine that one of the 48
servers had the old version still. It's disk was full, so the new version
rollout failed. The deployment pipeline happily reported all was well.

We got lucky in that our tests happened to land on the affected server. The
results of this making past the validation process would be catastrophic. Not
as catastrophic as this case I hope, but it'd be bad.

We made a couple human process changes, like telling the sysadmins to not
ignore full disk warnings anymore (sigh). We also fixed the rollout script
toactually report failures, but I don't actually trust it still.

~~~
C1sc0cat
Ignoring disk full condition ! really.

Handling an out of space condition should be part of your test suite - it
certainly was back when I looked after a Map reduce based Billing system at BT
and that was back in the day when a cluster of 17 systems was a really big
thing.

~~~
lifeisstillgood
I think the parent did well openly and honestly raising a personal example
where missing a "basic" check caused near career changing problems - I applaud
them for sharing a difficult situation.

I was concerned that it's possible to read your comment as if it was critical
of the parent - was that your intention?

~~~
time0ut
To clarify a little, I was neither responsible for monitoring the underlying
hardware or the deployment systems in this case. I also didn't have access to
fix it myself. It took me a couple hours to go from "random weird test
results" to "full disk broke the deploy".

------
HenryBemis
> Knight did not design these types of messages to be system alerts, and
> Knight personnel generally did not review them when they were received

So they received these 90mins before they were executed, and as it so happens
in many organizations, automated emails fly back and forth without anyone
paying attention.

Also.. running a new trading code, and NOT have someone looking at it LIVE on
the kick-off, that is simply irresponsible and reckless.

------
hinkley
I bring up this story every time someone talks about trying to do something
dumb with feature toggles.

(Except I had remembered them losing $250M, not $465M, yeow)

The sad thing about this is if the engineering team had insisted on removing
the old feature toggle first, deploying that code and letting it settle, and
only _then_ started work on the new toggle, they may well have noticed the
problem prior to turning on the flag, and it certainly would have been the
case that rolling back would not have caused the catastrophic failure they
saw.

Basically they were running with scissors. When I say 'no' in this sort of
situation I almost always get pushback, but I also can find at least a couple
people who are as insistent as I am. It's okay for your boss to be
disappointed sometimes. That's always going to happen (they're always going to
test boundaries to see if the team is really producing as much as they can).
It's better to have disappointed bosses than ones that don't trust you.

------
alexeiz
I had a chance to get familiar with deployment procedures at Knight two years
after the incident. And let me tell you, they were still atrocious. It's no
surprise this thing happened. In fact, what's more surprising is that it
didn't happen again and again (or perhaps it did, but not on a such large
scale).

Anyway, this is what the deployment looked like two years after:

* All configuration files for all production deployments were located in a single directory on an NFS mount. Literally, countless of _.ini files for hundreds of production systems in a single directory without any subdirectories (or any other structure) allowed. The_.ini files themselves were huge as it typically happens in a complex system.

* The deployment config directory was called 'today'. Yesterday's deployment snapshot was called 'yesterday'. This is as much of a revision control as they had.

* In order to change your system configuration, you'd be given write access to the 'today' directory. So naturally, you could end up wiping out all other configuration files with a single erroneous command. Stressful enough? This is not all.

* Reviewing config changes were hardly possible. You had to write a description of what you changed, but I've never seen anybody attach an actual diff of changes. Say you changed 10 files, in the absence of a VCS, manually diff'ing 10 files wasn't anybody wanted to do.

* The deployment of binaries was also manual. Binaries were on the NFS mount as well. So theoretically, you could replace your single binary and all production servers would pick it up the next day. In practice though, you'd have multiple versions of your binary, and production servers would use different versions for one reason or another. In order to update all production servers, you'd need to check which version each of the server uses and update that version of the binary.

* There wasn't anything to ensure that changes to configs and binaries are done at the same time in an atomic manner. Nothing to check if the binary uses the correct config. No config or binary version checks, no hash checks, nothing.

Now, count how many ways you can screw up. This is clearly an engineering
failure. You cannot put more people or more process over this broken system to
make it more reliable. On the upside, I learned more about reliable deployment
and configuration by analyzing shortcomings of this system than I ever wanted
to know.

------
padseeker
I realize that the consensus is that lots of companies do this kinda thing. I
don't know if it's 99% - but the percentage is pretty high.

However what's neglected to mention is the risk associated with a catastrophic
software error. If you are say instagram and you lose your uploaded image of
what you ate for lunch, that is undesirable and inconvenient. The consequences
of that risk should it come to fruition is relatively low.

On the other hand if you employee software developers that are literally the
lifeblood of your business for automatic trading, you'd think that a company
like that would understand the consequences of treating this "cost-center" as
a critical asset rather than just a commodity.

Unfortunately you would be wrong. Nearly every developer I have ever met that
has worked for a trading firm has told me that the general attitude towards
nearly all it's employees that are not generating revenue as a disposable
commodity. It's not just developers but also research, governance,
secretarial, customer service, etc. This is a bit of a broad brush but
generally the principles and traders of those aforementioned firms are
arrogant and greedy and cut corners whenever possible.

In this case you'd think these people would be rational enough to know that
cutting corners on your IT staff could be catastrophic. This is where you
would be wrong. Most of the small/mid sized financial firms that I have had
friends who worked there have told me they generally treat their staff like
garbage and routinely push people out who want decent raises/bonuses, etc.
These people are generally greedy and also egocentric and egomaniacal, and
they believe all their employees are leaching from their yearly bonus
directly.

This story is not a surprise to me in the least. What's shocking is no one in
the finance industry has learned anything. Instead of looking at this story as
a warning, most of the finance people hear this story and laugh at how stupid
everyone else is and that this would never happen to them personally because
they're so much smarter than everyone else.

~~~
user5994461
>>> Instead of looking at this story as a warning, most of the finance people
hear this story and laugh at how stupid everyone else is and that this would
never happen to them personally because they're so much smarter than everyone
else.

What if we're smarter than everyone else? When I was in big bank, we had
mandatory source control, lint, unit tests, code coverage, code review,
automated deployment, etc... pretty good tools actually. Not everybody is
stuck in the stone age.

Even in a small trading company before that, we had most of the tooling
although not as polished. Very small company with a billion dollars a month in
executed trade. One could say amateur scale.

~~~
padseeker
Big bank is not the same as a small/mid sized trading firm. Banks have
regulations they need to meet, and typically do things by the book.

I'm not an expert here. Part of what I said is based on the 6 different people
I've met who have worked in the industry. I'm just saying if you have $400+
million to lose and you rely on the IT infrastructure allows you to make that
money then you can spend a few million on top notch people and processes to
prevent this kind of thing. I worked at a relatively large media company, and
every deployment has a go/no-go meeting where knowledgable professionals asked
probing questions, you defended your decisions. I've love to know what they
did in Knight Capital. The idea of re-using an existing variable for code that
was out of use strikes me as a terrible idea.

------
malux85
What baffles me is how they got his far into operations with such dreadful
practices, 100-200k could have got them a really solid CI pipeline with
rollbacks, monitoring, testing etc,

But spend 200,000 on managing 460,000,000? No way!

~~~
free652
How would CI help in this case? It isn't even software bug, it's a process
issue - they had old code running on one of out of 8 servers. The monitoring
was triggered, but no action was taken.

~~~
werbel
I disagree that it isn't a software bug.

"The new RLP code also repurposed a flag" \- this is the moment when terrible
software development idea was executed that resulted in all of the mess.

Of course I don't know the full context and maybe, just maybe there was a
really solid reason to reuse a flag on anything.

What I observe more often is something like this though:

    
    
      1. We need a flag to change behaviour of X for use case A, let's introduce enable_a flag.
      2. We want similar behaviour change of X also for use case B, let's use the enable_a flag despite the fact the name is not a good fit now.
      3. Turns out use case B needs to be a bit different so let's introduce enable_b flag but not change the previous code so basically we need them both true to handle use case B.
      4. Turns out for use case A we need to do something more but things should stay the same for B.
      5. At this point no one really knows what enable_a and enable_b really do. Hopefully at least someone noticed that enable_a affects use case B.
    

If you have an use case A, create a handle_a flag. If you have a use case B
create handle_b flag _even if they do exactly the same thing_ as more than
likely they do exactly the same thing only for now.

What would probably be even better is separate, properly named config flags
for each little behaviour change and just use all 5 of those to handle
different use cases.

edit: formatting

~~~
reificator
> _If you have an use case A, create a handle_a flag. If you have a use case B
> create handle_b flag even if they do exactly the same thing as more than
> likely they do exactly the same thing only for now._

A hard lesson to learn, and a hard rule to push for with others who have not
yet learned.

Imagine what our species could do if experience were directly and easily
transferable...

~~~
werbel
Hah exactly :)

Same goes for functions, classes, React components, DB tables and everything
else.

Just model it as close as possible to the real world. The world doesn't really
change that often. What does is how we interpret and behave within it
(logic/behaviour/appearance on top).

If you have a Label and Subheader in your app, create separate components for
them. It doesn't matter that they look exactly the same now. Those are two
separate things and I guarantee you more likely than not at some point they
will differ.

My rule of thumb is: If it's something I can somehow name as an entity
(especially in product and not tech talk) it deserves to be its own entity.

~~~
reificator
It's funny though, because my experience has led me to the exact opposite
approach. Modeling based on real world understanding has been very fragile and
error prone, and instead modeling as data and systems that operate on that
data has been very fruitful.

------
neals
Makes me feel less bad about rm -rf 'ing a product database and losing 1 hour
of client data, the other week. Maybe I should show them this...

~~~
throwawaymath
I would argue you shouldn't have been able to do that in your organization
without bypassing (several) significant safeguards.

Did you forget a where clause while deleting data on a table, or were you
actually on the production server hosting the database?

Any code you write that interacts with a database (or really any production
code at all...) should be reviewed before being merged. And developers
shouldn't be writing raw SQL commands on a production server. It's hard for me
to see this as anything other than an organizational failure rather than your
own.

EDIT: Based on the number of downvotes this has received, I can only imagine
we have a lot of devs on HN who cowboy SQL in production...holy hell how can
any of what I said be controversial.

~~~
penagwin
While I mostly agree, many companies have a tech department of half a dozen
people, and implementing and enforcing every good practice devops isn't always
realistic.

That said, I'd expect at least a backup of production, then again he said he
lost 1 hr of data so it was likely between backups.

~~~
Cthulhu_
If you haven't been able to invest the time to do database maintenance tasks
in a safe way, at the very least enforce a 4-eyes principle and write up a
checklist / script before hacking away in the production database.

I mean I get it, I've made mistakes like this as well knowing I shouldn't have
(we had test and prod running on the same server, about 40K people received a
test push notification). But the bigger your product gets, the less you can
afford to risk losing data.

~~~
penagwin
I totally agree, if feasible those steps should be done!

I was just trying to explain that many business like the one I'm at don't do
business in tech (mine sells wholesale clothing), with 6 people in the tech
dep, so understandably there's certain limitations on how far best practices
can go. While I would usually consider it a mistake, if you thought you were
just making a quick, what should be read only query, and it happens to hit
some random edgecase-bug and crash a db... _Edit_ , continuing - Sure you
should have tested that on the test DB first, but I'd be kinda understanding
of how that happened.

Depends on the business too, if you're a startup-tech company then yea, get
your -stuff- together! It's just a lot of business only need their website and
some order management, their focus isn't on the tech side of things.

------
phodge
Loosely related - this is what terrifies me about deploying to cloud services
like Google which have no hard limit on monthly spend - if background jobs get
stuck in an infinite loop using 100% CPU while I'm away camping, my fledgling
business could be bankrupt by the time I get phone signal back.

~~~
Bartweiss
Woah, how does Google Cloud _still_ not support budget capping?

It has budget alerting, so the capabilities are obviously there, but it's
never been added. Instead, there's just a vaguely insulting guide on writing a
script to catch the alert and trigger a shutdown...

~~~
shawabawa3
Pretty sure Google cloud does support it

Pretty sure aws still doesn't

~~~
Bartweiss
Google App Engine has spending cutoffs. Cloud allows API call cutoffs, but for
actual spend it only has alerts. Their 'controlling budget' page sends you to
a guide on writing your own triggers to respond to those alerts:

[https://cloud.google.com/billing/docs/how-
to/budgets](https://cloud.google.com/billing/docs/how-to/budgets)

------
pjc50
This is one of the classics of the genre. If you're interested in software
reliability/failure, you should read some of COMP.RISKS .. and then stop
before you get too depressed to continue.

------
snotrockets
> This is probably the most painful bug report I've ever read

I suggest further reading, starting with Therac-25.

~~~
PhantomGremlin
An honest bug report for the recent Boeing fuckup would be even worse. They
deployed unconscionably shitty software (MCAS system) that killed a total of
346 people in two perfectly airworthy planes.

------
kylek
Totally unrelated, but the title made me think back to one of my previous
roles in the broadcast industry. If you're using a satellite as part of your
platform, every second that you aren't transmitting to your birds
(satellites), you're losing a massive amount of money. There are always a lot
of buffers and redundant circuits in those situations, but things can always
go wrong.

Funny tangent- the breakroom at that job was somewhat near the base stations.
Some days around lunchtime we'd have transmission interruptions. The root
cause ended up being an old noisy microwave.

------
vxNsr
Needs (2013) tag. As usual, human negligence is to blame.

~~~
geofft
Is there any case where human negligence is not to blame?

~~~
philipov
Pompei?

~~~
folli
Building a city on a known lava field?

~~~
ineedasername
It had been "silent" for about 300 years when Pompeii was destroyed. And
before that, it had mainly had small series of low-level eruptions. Basically,
for much of known history at that time, it was a safe place to live.

------
brootstrap
Just popping in to say i believe the Equifax hack was also do to a 'bad manual
deployment' similar to this. They had a number of servers but they didnt patch
one of the servers in their system. Hackers were able to find this one server
with outdated and vulnerable software and took advantage of it.

I think deploys get better with time, but that initial blast of software
development at a startup is insane. You literally need to do whatever it takes
to get your shit running. Some of these details dont matter because initially
you have no users. But if your company survives for a couple years and builds
a user base, you still have the same shitty code and practices from early
times.

------
thanatos_dem
I have no sympathy for high frequency traders losing everything.

So many more interesting and meaningful uses of computing than trying to build
a system to-out cheat other systems in manipulating the global market for the
express interest of amplifying wealth.

~~~
shrimpx
A trading bot is a money-making machine and so is Facebook. What's worse, a
"headless" machine that is directly manipulating buy/sell orders to feed off
market inefficiencies, or a machine that lures humans in, then converts their
attention and time into money?

~~~
nilskidoo
I've thought before that transforming our soft grey matter into gold is
basically what the Rosicrucian alchemists were on about.

------
anonu
I watched the market go haywire on this day. Attentive people made a cool buck
or two as dislocations arose.

What's crazy is that there were already rules in place to prevent stuff like
this from happening - namely the Market Access Rule
[https://www.sec.gov/news/press/2010/2010-210.htm](https://www.sec.gov/news/press/2010/2010-210.htm)
which was in place in 2010.

When the dust settled, Knight sold the entire portfolio over via a blind
portfolio bid to GS. Was a couple $bn gross portfolio. I think they made a
pretty penny on this as well.

------
Havoc
>Knight relied primarily on its technology team to attempt to identify and
address the SMARS problem in a live trading environment.

Ah the good old "fk it we'll do it live" approach to managing billions.

------
spyspy
Could use a [2013] tag, but this story is fascinating and horrifying and I re-
read it every time it pops up. It's a textbook case of why a solid CI/CD
pipeline is vital.

~~~
java-man
... and a requirements tracker linked to the code.

------
nickthemagicman
Who did they hire to develop this software?

~~~
ukoki
Incidents such as these are rarely a people failure and nearly always a
process failure. People will always make mistakes — perhaps seniors will make
fewer mistakes than juniors, but no-one makes no mistakes.

~~~
randyrand
People are responsible for making the processe, no? It’s still a people
problem.

~~~
ashelmire
Right, it's usually something like this:

A: I'd like to hire some people to improve our processes. It will take time
and money and prevent future problems, but you will never notice.

B: Time and money and no new features? No way, I won't approve that.

A: _tries to sell it some more even though they are technical and not a
salesperson_

B: No.

