
At Scale, Rare Events Aren’t Rare - mthurman
http://perspectives.mvdirona.com/2017/04/at-scale-rare-events-arent-rare/
======
CJefferson
One quote I've heard, that i find helps non-experts:

You don't have to personally prepare for winning the lottery, but the lottery
has to prepare for somebody winning.

------
kyleschiller
As a pretty good rule of thumb, a system that fails 1/nth of the time and has
n opportunities to fail has ~.63 probability of failure, where n is more than
~10.

Graph: [http://www.meta-
calculator.com/online/?panel-102-graph&data-...](http://www.meta-
calculator.com/online/?panel-102-graph&data-bounds-xMin=-10&data-bounds-
xMax=10&data-bounds-yMin=-7.28&data-bounds-yMax=7.28&data-
equations-0=%22y%3D1-\(1-\(1%2Fx\)\)%5Ex%22&data-rand=undefined&data-
hideGrid=false)

~~~
Rainymood
Very nice rule of thumb, honestly I did not expect it to (sort of) converge to
~63%. Does anyone have some intuition for this?

~~~
SamReidHughes
I'd say, hey, how do you calculate (1-h)^k? First take the natural log:
ln((1-h)^k) = k ln(1-h) = -kh. And then exponentiate back up: e^(-kh). (For
small values of h, ln(1-h) = -h by linear approximation.) (Edit: Wiped out
looong comment.)

~~~
StavrosK
I think by "intuition" the GP meant "for the non-mathematicians" :P

~~~
pc86
It's always amusing when someone asks for a layman/non-math/intuitive reason
why something works out and HN responds with a 3-paragraph long proof that
seems to always require university-level math. And it seems those comments
almost invariably start with "Oh, you just..."

~~~
TeMPOraL
'pedrosorio gave a nice one upthread[0].

Ultimately, it's hard to give a math-free explanation for something that comes
out straight from math. If you break down an explanation into small enough
steps, they should be comprehensible for anyone even if they have to take some
steps on faith.

\--

[0] -
[https://news.ycombinator.com/item?id=14040434](https://news.ycombinator.com/item?id=14040434)

~~~
StavrosK
He did, yes, I was just amused by the GP's answer!

------
relyio
Coincidentally, a couple months ago one of my professors told me about James
Hamilton, apparently, they met when he was studying at Waterloo. I started
reading his blog.

This guy is brilliant. And his comments are often gems. The articles are good
too but think of them as conversation openers. The real deal is in the
comments imho. I recommend it. A somewhat funny article is about Zynga, the
"prodigal kid" who left AWS for on-premises, only to come back later:
[http://perspectives.mvdirona.com/2015/05/the-return-to-
the-c...](http://perspectives.mvdirona.com/2015/05/the-return-to-the-cloud/)

In the comments, he debunks (or take a shot at it) the idea that cloud
providers like AWS aren't a good fit for organizations which have massive, but
stable workloads.

This guy is so cool. I only wish he had more time to write.

------
jasonallen
It reminds me of an old story about Microsoft Windows. Back in the early
2000's, compiling and building Windows from source code took many hours on
very specialized build hardware. Meanwhile there were thousands of developers
who contributed to the full Windows stack. If any developer checked in a build
failure, it would cause the build to be delayed. Well, at that scale (of
thousands of developers), you can't compile Windows even if devs only commit 1
build break per year. Bad times...

~~~
InclinedPlane
There was essentially instability and chaos in the big dev heavy divisions at
MS when they all worked in one branch, but that led very rapidly to more
sophisticated models with more points of validation between the average dev
and the common code/builds that qa and everyone else shared and used.

~~~
lolive
The common pattern nowadays is to code review each commit, build each commit
on Jenkins & pass tests, git bisect, etc... What was the procedure back then ?

~~~
tscs37
"don't break the build" was the procedure.

~~~
lolive
amen!

------
cookiecaper
"One in a million is next Tuesday" from Larry Osterman is another great
reflection on this from the software perspective. [1]

[1]
[https://blogs.msdn.microsoft.com/larryosterman/2004/03/30/on...](https://blogs.msdn.microsoft.com/larryosterman/2004/03/30/one-
in-a-million-is-next-tuesday/)

------
ath0
Love this article but think the headline actually makes the wrong point - this
is a product management issue, not a "it may never happen" issue. That it
takes someone like James to know two wildly different domains - both the
business-level details at risk (a $1M generator is worth potentially damaging
if the alternative is a guaranteed $100M revenue loss) and the details of
power engineering (overriding the switch only risks the generator, not a
datacenter fire or loss of life) - is a shame.

Could the power engineering team have made this tradeoff more clear to the
project managers doing the initial install? And yet, exposing a million little
configuration options to the end-user isn't the right approach either.

------
iNerdier
It may be silly but if something is unlikely I stop and remind myself: if
something is a 'one in a million' chance, then it happened to 7000 people
today.

~~~
cafebabbe
"one in a million" happens several times per second, in todays CPUs :)

Fun small experiment. Pick a random number between 1 and 15000000 (odds for
winning the lottery in my country). Loop until you pick it again (i know it's
pseudorandom and cylic; but the period seems big enough for the experiment).

Watch how freaking fast it happens (sub-second), and how many iterations it
took (dozens of millions).

~~~
AstralStorm
Also why ECC memory is used... and chances to corrupt a bit of memory are much
lower than that.

------
kev009
Storage devices regularly go berserk in really novel and interesting ways when
you have a large enough pool. Most projects I've worked on, I've known enough
to fix the bugs, had a working theory of what the issues were and could fix if
I really needed to after higher priorities, or could somehow work around. With
storage devices, I'm frequently bewildered and stuck maybe to the last
category at best. There are times when I sit back and just think, wow, how
amazing is it that computers work at all knowing the things that do go wrong.

~~~
jacquesm
> There are times when I sit back and just think, wow, how amazing is it that
> computers work at all knowing the things that do go wrong.

There is a special category of bugs named for that kind of feeling, they're
called schrödinbug. The idea is that once you've noticed that something
couldn't work it promptly stops working.

------
Avernar
The part about the switchgear vendor deciding to do something a certain way
that the customer didn't want because it can cause a rare failure reminded me
of something that happened to me. Way back I bought a 1500VA UPS that was not
an APC but still a known brand to protect my home server. The decision was
based on cost as it was significantly less money.

One night I was near the server when the power went out. So I sat there
waiting to see the auto shutdown. Soon enough the UPS told the server to
shutdown and it was well on it's way to power off. Just before it shut off the
power came back. And the UPS stopped beeping and went back to normal
operation... while the server completed shutdown.

And know I have a server that's off and if I wasn't around I would have not
known what happened. When I got the UPS I just did one test to make sure the
server shut off and the UPS shut itself off without draining its batteries.
This meant that I plugged it back in AFTER the UPS powered off. I never
considered that the manufacturer of the UPS would botch the power restored
after telling the server to shut down sequence.

I contacted the manufacturer about this. I told them that after telling the
server to shut down there was only a brief window where a power restored
signal would maybe abort shutdown. Once the UPS monitoring program is
terminated during shutdown there's no turning back. Nothing came of that.

So now I buy only APC gear. They do the proper thing that if the AC power
comes back after a shutdown command is issued the UPS will continue the
shutdown sequence. And when the UPS shuts off it sees the power back on and
restarts itself and the server comes back online.

Other manufacturers may do it correctly and the one I dealt with might have
clued in and fixed it but I'm not willing to gamble anymore.

------
dredmorbius
Though it's not all that rare, there's the question of dealing with death in
large online social or identity networks.

With Google having some 3 billion Android / Chrome / Gmail profiles, and
Facebook roughly as many users, standard actuarial statistics suggest that,
even if allowing for multiple profiles per human, the number of newly dead
accounts per day probably runs to the tens of thousands.

(Globally, deaths run about 120,000/day, so the figure's within reason.)

Which means you probably want to consider your processes for such matters, as
well as various related issues, such as mistakenly presuming a user has died,
or being falsely informed, how to handle data assets after death, etc., etc.

Scale matters.

------
snowwrestler
Related: Wired profile of the author

[https://www.wired.com/2013/02/james-hamilton-
amazon/](https://www.wired.com/2013/02/james-hamilton-amazon/)

------
graphememes
Unfortunately most people don't realize this until they are at scale.

~~~
sametmax
It's ok, since most people never get "at scale". When you start getting these
problems, it means you reached the stars you were shooting for. Having those
issues are a good thing.

~~~
nadermx
Assuming you have the resources to cover these events, sometimes you hit
scale, but not profitability..

~~~
sametmax
Yes but in that case the scaling technical issues are not your main problem.
Your main problem is your business model. Priorities, priorities...

------
mirimir
Reminds me of this: [http://spectrum.ieee.org/tech-talk/computing/it/nsa-data-
cen...](http://spectrum.ieee.org/tech-talk/computing/it/nsa-data-center-
electrical-problems-arent-that-shocking)

------
Turing_Machine
There was an article a while back that used the term "Walmart scale". If some
weird customer interaction happens one time in a million transactions, it
happens ten times a day at Walmart.

------
jzl
"You know, the most amazing thing happened to me tonight. I saw a car with the
license plate ARW 357. Can you imagine? Of all the millions of license plates
in the state, what was the chance that I would see that particular one
tonight? Amazing!" \-- Richard Feynman

------
oakridge
Well, Murphy's law. Any non-zero probability event will occur at least once
given infinite time.

~~~
throwaway2048
More like infinity times

x __* ∞ = ∞ where X > 0

------
jmcdiesel
In an infinite universe, nothing is rare, by that logic?

Rare is relative, so just because something happens an trillion times it can
still be nearly nonexistently rare, given the data sample is trillions of
trillions?

Seems like a silly thought..

------
doug1001
i would have chosen a slightly different title, perhaps "at scale, improbable
events are frequent" or something like that

"at scale" of course just means more much frequent sampling; it's not some
sort of alternate reality where good become evil, rare becomes not rare, etc

------
juskrey
Rare events are self-similar.

------
ianai
And yet we still haven't observed proton decay.

~~~
AstralStorm
Partly because observing it requires very specific conditions. That makes the
scale really tiny actually.

------
kriro
It's a one in a million shot. But it might just work.

------
stcredzero
There is a market opportunity here for serving customers who would rather risk
broken generators but ensure constant power.

