
The woes of incremental resource drains in big systems - luu
http://rachelbythebay.com/w/2018/09/06/test/
======
outworlder
> Finally, there's the non-technical angle to this, particularly when
> organizations grow to be large and there's no longer the "we're all on the
> same team" ethos.

 _Sigh_

This is when the tug-of-war for resources start. Be it headcount, time, money,
servers or db connections in this case.

> Every time something broke and people got together to talk about it in a
> previous life, inevitably a question would come up: could we have caught
> this with a test, or with testing, or some other before-it-goes-to-the-world
> step? Many times, the question was just rote, since the answer was
> frequently "yes" or similar.

This is the kind of question I get all the time, after you get a
multidisciplinary team debugging an issue for hours. Yes, had we spent the
required man hours to look into the system architecture and actual behavior
under normal conditions, possibly, and a whole other host of issues that have
yet to happen. Resources are rarely allocated to look into things that are not
yet (completely) breaking...

~~~
mmsimanga
> Resources are rarely allocated to look into things that are not yet
> (completely) breaking...

This is my experience too. The cynical person in me often entertains the
thought of letting a system crash so the powers that be give it attention and
resources. What is even more worrying is the hero worship that goes with
fixing the crash. Rarely do I see anyone recognised for preventing the system
from crashing.

------
JimboOmega
I've found there's a bit of a cycle to how this is handled.

We had some resource related issues and for a time we were required to write a
system impact doc (describing in detail any new resources used) with _every
change_. Even, say, fixing some copy.

We've backed off from that now, thankfully, but informally. The next time
somebody puts a bunch of additional load on a cache server without telling
anyone, and it falls over, that requirement will return.

"You should take time to think about the resource implications of your change
and inform the appropriate teams" is of course a good idea. But it's not
easily fixed with process; no amount of process will prevent somebody from
calling the fast_get_the_thing function without realizing it's a terrible hack
somebody had hoped would never see the light of day and attempts to use it
will open database connections like crazy because of the weird implications of
the obsolete driver it - and your system - rely on.

------
londons_explore
Rachel by the bay sounds like a Googler...

When you guys solve this problem, come tell us the answer please.

