
“A Windows 7 deployment image was accidently sent to all Windows machines” - slyall
http://it.emory.edu/windows7-incident/
======
perlgeek
[https://twitter.com/DEVOPS_BORAT/status/41587168870797312](https://twitter.com/DEVOPS_BORAT/status/41587168870797312)

"To make error is human. To propagate error to all server in automatic way is
#devops."

Frankly, I'm surprised things like this don't happen more often. Kudos for the
incident management. Also a big plus for having working backups, it seems.

~~~
IgorPartola
@DEVOPS_BORAT is actually very insightful in about 1/5 tweets. Snide for sure,
but there are quite a few good points in there if you read carefully:

"In devops we have best minds of generation are deal with flaky VPN client."

"Single point of failure in private cloud is of usually Unix guy with
neckbeard."

These are gold.

Edit: based on the above advice I once grew out a neckbeard while going
through a multi-month rollout of a large product. It itched like crazy, but I
did work much faster to get rid of it.

~~~
Havoc
>best minds of generation are deal with flaky VPN client

So true. I'm on the receiving side of this..."No you can't work on that multi
million deadline project of yours...the only way to fix the VPN is to re-image
the machine back at head office [an international flight away]". Me..."Could
you repeat that?" And thats a Cisco Enterprise VPN...(turns out IT was
right...re-image & avoid conflicting software is the only solution). So much
for Cisco...

~~~
philtar
Are you calling yourself one of the best minds of our generation?

~~~
Havoc
Hardly. No I meant on the receiving side of techs trying to fix VPNs.

~~~
c0nsumer
Professionally I deal with much of the fallout from problems such as yours,
and leading techs doing this kind of work. It really sucks, but for many
problems like this the choice becomes spend-four-hours-reimaging-the-machine
or spend-unknown-period-of-time-trying-to-fix-new-problem. The latter would be
great if it was less than four hours, but it's often not, and until that time
you / the user are without a machine.

After an hour or so of troubleshooting it's usually better to go with the
reimaging, since all you / the user wants is to get back to working.

Ideally I try to get the entire broken machine captured and the user issued a
new, fixed machine because then a fix can be developed and documented, but for
those who end up in a new failure mode, it sucks. And with something like the
Cisco VPN Agent? That's not uncommon at all...

~~~
Havoc
>spend-four-hours-reimaging-the-machine or spend-unknown-period-of-time-
trying-to-fix-new-problem

Definitely. In our case its 8 hours minimum though for a re-image. Somehow the
FDE makes pulling the old data off the machine slow.

You've got my sympathies though - I'd not like to be the one doing the IT in
these cases. Can't be fun troubleshooting IT with that kind of time pressure.

~~~
c0nsumer
Thank you. It really, honestly is hard on our tech because they feel the
pressure from all sides. Eight hours sounds rough for a reimage. I think ours
are... maybe two or three? We've done a lot of work to get the reimage time
down, and Win7 (WIMs) have made this really nice.

If this is something that smells of a bigger problem (or has been seen
elsewhere) then I push for them to get the user a wholly new machine,
capturing the old one for analysis. If the user is given an upgraded machine,
then there is usually little resistance, even with the downtime that'll be
incurred.

On the upside, if the issue can be reproduced readily, from this we can almost
always get root cause and put a systemic fix in place. If it's sporadic...
Well... I'm sure you understand how it goes trying to fix something that you
can't yet reproduce. ;)

(I'd love to troubleshoot your slow data backup issue... That's the stuff I
rather enjoy.)

~~~
Havoc
>I'd love to troubleshoot your slow data backup issue... That's the stuff I
rather enjoy.

I'm not directly involved with the tech side so I don't know the details. I
gather they pull the old data off the disk using some offline low-level tool
though (like you would for harddrive damage recovery). Between that and the
encryption its somehow very slow. No idea why its like that though.

>get the user a wholly new machine

I wish it was the same here. They just give loan machines :/

------
miles
Snark and sarcasm aside, I am impressed with the level of detail that the IT
department is sharing; it is refreshing to see such a disaster being discussed
so openly and honestly, while at the same time treating customers like adults.

~~~
harrystone
It is impressive. I can't imagine dealing with that kind of nightmare.

I once worked as an admin on Solaris boxes at a big pharma company. There were
~77,000 users in their LDAP directory. I was very careful.

~~~
cfreeman
That sounds terrifying.

------
beloch
This reminds me of my undergrad CPSC days. The CPSC department had their own
*nix-based mainframe system that was separate from the rest of the University.
The sysadmin was a pretty smart guy who was making less than a third of what
he could get in industry. Eventually he got fed up and left. About a week or
two later the servers had a whole cascade of failures that resulted in
everyone losing every last bit of work they'd done over the weekend (This was
a weekend near the end of the semester when everyone was in crunch mode).

Long story short, the sysadmin was hired back and paid more than most of the
profs. Academia may tend to skimp on salaries for certain positions, but
sysadmins probably shouldn't be one of them.

~~~
gertef
The BOFH sabotaged his systems before departing? Didn't build a backup plan?
And they hired him back?

~~~
SDGT
You know what? Fuck them. Fuck higher education completely. Undervalued and
underpaid is the name of the game for any important IT roles in that shit hole
of an industry.

Disclaimer: dev at a university.

~~~
beloch
To be fair, a lot of the people working in support roles in academia are
pretty much unemployable in the real world. They show up at 10 am, take
constant smoke breaks all day, and leave at 3 pm. When I was in physics (which
used the main university servers for most things) we had a sysadmin who was in
charge of some printers and a couple of server boxes. He had inherited those
boxes from a former student who set them up, but he was functionally
illiterate in managing them. At one point I needed a package installed. Not
only could he not figure out how to install a package on a ubuntu server on
his own, he couldn't do it with emailed instructions either. I had to go up
and physically stand over him telling him what to click on and what to type.
To make matters worse, he was so hard to actually catch "in the office", that
I had to have the department secretary (whose office he was next to) alert me
when he showed up. Not surprisingly, the functions of those servers were soon
moved to desktop machines in various offices. As far as I know he's still
working there though. He's a union employee and it would be a ride through
deepest, hottest, hell to get rid of him.

Note: I am not saying all university support staff are like this. Some
definitely are though, and they're probably the reason why good people
sometimes find it hard to be properly remunerated in academia.

~~~
misnome
> Note: I am not saying all university support staff are like this. Some
> definitely are though, and they're probably the reason why good people
> sometimes find it hard to be properly remunerated in academia.

Certainly not everyone is like him - but I'd wager every university has at
least a couple of people like him (we definitely had one, again, in physics)

------
Fomite
Reminds me of some emails that went out at my old university during a cluster
outage, and got progressively more informal as the night went on, detailing
people leaving dinners with extended families, a growing sense of desperation,
etc. The last email might as well have ended with "Tell my wife I love her."

It was both direct and funny enough that I was only mildly annoyed that the
cluster was down.

------
facorreia
> A Windows 7 deployment image was accidently sent to all Windows machines,
> including laptops, desktops, and even servers. This image started with a
> repartition / reformat set of tasks.

Wow. That is very unfortunate, to say the least...

~~~
spiantino
> As soon as the accident was discovered, the SCCM server was powered off –
> however, by that time, the SCCM server itself had been repartitioned and
> reformatted.

~~~
malkia
I feel bad, but I laughed at this for few seconds...

~~~
__david__
I wouldn't feel bad. I guffawed at most of this story.

Not in a "haha, what a bunch of morons. Serves those jerks right!" kind of
way, but more in a "oh dear, that's the worst thing that can possibly happen!
Oh no it gets worse??". I've been through IT catastrophes (and caused a couple
myself) and I could easily see this happening to me. Still, it's funny as
anything.

------
Fuzzwah
I've just been hired to run a project using SCCM to upgrade ~5000 PCs from XP
to Win7.

This was amazing reading. Reading such a detailed wrap up of an IT team going
through my worst possible nightmare was enlightening.

~~~
sswaner
It must be reassuring that your new employer is so proactive in getting that
XP upgrade project running.

~~~
Sanddancer
Sometimes these products get bogged down because of three important problems
outside of the control of IT. First off, you need to get in touch with all the
upstream vendors to get updates to any sort of custom software that has
compatibility problems with newer versions of windows; Vista/7 got a lot more
strict about giving admin access, for example, which may cause problems in the
updates. Second, you've got to keep in mind the training costs. There are a
lot of users who may be brilliant financial minds that can make numbers dance
and bow to their whims, but get terribly locked up if an icon changes. Doesn't
make them horrible people by any means, but you've got to keep it in
consideration when planning a rollout. Finally, you have to keep in mind the
petty turf wars. If Joe in Accounting gets the upgrade to 7 before Bob in
Legal, Bob in Legal may feel slighted and start raising a holy shitstorm, even
if he's scheduled to be upgraded a week later. Upgrades are ugly, no matter
when they happen. Sometimes that proactive upgrade project takes many years
just because of all the moving parts involved.

~~~
lugg
I think you missed the joke.

------
stark3
There was a similar catastrophe at Jewel osco stores many years ago. Nightly,
items added to the store pos were merged back with the main item file at each
store location. The format of the merged data was exactly the same as loading
a new file, except the first statement would be /EDIT instead of /LOAD.

One of the programmers decided to eliminate some code by combining the two
functions, with a switch to control whether /LOAD or /EDIT was used for the
first statement.

There was a bug in the program, and the edits were sent down as loads.

A guy I knew, Barry, was the main operator that night. He started getting
calls from the stores after around 10 of them had been reloaded with 5 or 6
items.

Barry said it was the first time he got to meet the president of the company
that day.

~~~
pierlux
Failure should never be the only time you get to meet upper management :(

~~~
ceejayoz
A night IT operator for an organization with 176 locations is pretty unlikely
to ever meet the company president.

------
jonmrodriguez
Forgive my beginner question:

Since a reformat was done to the affected machines, does this mean that
researchers' datasets, drafts of papers, and other IP were lost? Or were
researchers' machines not affected?

~~~
wtallis
In my experience with campus networks, home directories are never stored
locally on any remotely-administered machines. Any specially-configured
researcher's machine that stored data locally would not have been subscribed
to get the automatically deployed OS images.

~~~
pbhjpbhj
>" _As soon as the accident was discovered, the SCCM server was powered off –
however, by that time, the SCCM server itself had been repartitioned and
reformatted._ " //

If the SCCM server was pushed the "update" then there doesn't seem much hope
for other machines? Surely no rule should be able to format the server running
the ruleset; seems like a failsafe failure there at least.

~~~
wtallis
None of the storage servers should have been storing the user data on the same
volume as the OS the way a client machine would. So the network-mounted home
directories should be intact and ready to use once the server OS is
reinstalled. And while I don't know how SCCM works, I'd be surprised if this
image push was affecting anything other than the primary physical drive (a
wipe-all, populate-one recipe would be too obviously wrong and dangerous,
right?).

------
8ig8
Mistakes are made. In related news...

Lawn care error kills most of Ohio college's grass

[http://www.wral.com/lawn-care-error-kills-most-of-ohio-
colle...](http://www.wral.com/lawn-care-error-kills-most-of-ohio-college-s-
grass/13650157/)

~~~
frogpelt
I can understand the Embry problem a lot more simply because the distance
between the decision and implementation of the decision is not that far.

How on earth do you "accidentally" load up enough weed killer to treat 54
acres of grass and never realize its the wrong stuff?

~~~
dfc
How do you click/not-click the button in SCCM...? How are these two events any
different?

------
pling
Not quite as disastrous but when I was at university the resident
administrators configured the entire site's tftp server (everything was
netbooted Suns) to boot from the network. This was fine until there was a
site-wide power blip and it was shut down. When it came back it couldn't tftp
to itself to boot because it wasn't booted yet (feel the paradox!). Cue 300
angry workstation users descend on the computer centre with pitchforks and
torches because their workstations couldn't boot either...

Bad stuff doesn't just happen to Windows networks.

------
rfolstad
On the bright side they are no longer running XP!

------
rfrey
My nomination of the top bullet point of 2014:

* As soon as the accident was discovered, the SCCM server was powered off – however, by that time, the SCCM server itself had been repartitioned and reformatted.

------
chromaton
Reminds me of The Website Is Down, episode 4:
[https://www.youtube.com/watch?v=v0mwT3DkG4w](https://www.youtube.com/watch?v=v0mwT3DkG4w)

------
Fuzzwah
I was just watching the "What’s New with OS Deployment in Configuration
Manager and the Microsoft Deployment Toolkit" session from TechED and hit the
section on "check readiness" option which MS have added to SCCM 2012 in R2. It
sounds like having this in part of the task sequence at Emory would have (at
the very least) stopped this OS push from at least hosing all the servers.

[http://channel9.msdn.com/Events/TechEd/NorthAmerica/2014/PCI...](http://channel9.msdn.com/Events/TechEd/NorthAmerica/2014/PCIT-B340)

------
randlet
Reading that just made me feel sick to my stomach and my heart goes out to the
poor gal/guy that pushed "Go" on that one. Shit happens, but a screw up that
big can be devastating to ones psyche.

------
grumblepeet
I _very_ nearly did this whilst working for a University back in the early
noughties. Luckily I managed to get to the server before the "advert"
activated and wiped out everything. It was so easy to do I am surprised that
it is stil possible. I feel for their pain, but it does sound like they are
doing a good job of mopping up. I did allow myself a snort of laughter when I
read the bit about the server being re imaged as well. That is pretty darn
impressive carpet bombing the entire campus.

------
mehrdada
_As soon as the accident was discovered, the SCCM server was powered off –
however, by that time, the SCCM server itself had been repartitioned and
reformatted._

I guess that's how robot apocalypse is gonna look like.

------
smegel
Automation can also mean automated disaster.

~~~
Already__Taken
Bigger tools have bigger sharper edges. It's important the message is stressed
as the developers to operators worlds merge in the middle.

------
sergiotapia
Isn't this more the fault of the system architect than the guy who
accidentally fired the bad deploys?

It's similar to a database firehose: If you accidentally start deleting all
data you should have a quick working backup ready to quickly bring the dead
box up to production.

~~~
fred_durst
I don't know. This could very well be a case of not much more than a bad drag
and drop in SCCM. Its not quite that simple, but I'm not sure this is some
custom process they setup.

~~~
not_kurt_godel
Any tool that allows you to easily perform the antithesis of its function
without making it abundantly clear what clicking the OK button will do is
fundamentally broken.

------
tbyehl
I've built a few systems for deploying Windows... and the last thing that
every one of them did before writing a new partition table and laying down an
image was to check for existing partitions and require manual intervention if
any were found.

------
imgur

      > As soon as the accident was discovered, the SCCM server was powered off – however, by that time, the SCCM server itself had been repartitioned and reformatted.
    

That made me laugh. Poor SCCM server :)

------
svec
With great power comes great responsibility.

------
deckar01
"As soon as the accident was discovered, the SCCM server was powered off –
however, by that time, the SCCM server itself had been repartitioned and
reformatted."

Unicast fail.

------
keehun
I asked my friend attending Emory right now, and he didn't even realize
anything was going on. He says that the Emory IT department has a notorious
distinction on campus as being regularly terrible, mostly with an unreliable
internet connection.

However, it looks like they handled this accident the best they could! Perhaps
this accident would not have happened at a more reliable IT department.

------
ww520
Disasters as well as mistakes are unavoidable, such is life. A hallmark of a
competent organization is how they handle the situation and recover from
disasters or mistakes.

So far all the signs have indicated they are doing great in recovering. I just
hope there won't be onerous processes and restriction afterward due to desire
on "make sure it won't happen again" stance.

------
durkie
hah! delighted to see this here.

my roommate works at the emory library and has had a fun slow week there of
coming home early many days because no one could do work. they were apparently
also given laptops as an interim solution, but those somehow also wiped
themselves eventually (?).

poor IT people...just as they're starting to get a handle on the actual
sitation it starts blowing up on the internet.

------
zacharycohn
I thought this "accident" may have been on purpose... until they mentioned the
servers.

In my days of university tech support.

------
sorennielsen
This happened at one a former workplace too. Only the Solaris and Linux
servers was untouched.

It "mildly" amused the *nix operations guys to see all the "point and click"
colleagues panic.

------
k_sze
Funny how they mention iTunes as one of the "key components" that are restored
first, whereas Visio, Project, Adobe application are relegated to a second
round.

~~~
zachlipton
Presumably iTunes is part of their base system image for all workstations,
along with Office, Firefox, Adobe Reader, and the like. In other words, a
basic set of software to handle standard officework tasks. iTunes is free and
IT would probably rather distribute it everywhere than have people trying to
install it themselves (or calling the helpdesk to get someone with
administrator rights to do it). They then offer additional applications on an
as-needed basis to individuals and departments with specific tasks. So the
designers who do print publications and the faculty who teach digital art
might get the Adobe suite, while people in Facilities who plan construction
will get Project. This keeps licensing costs down and simplifies systems
according to their uses.

~~~
commandar
Generally you're going to keep your base images in SCCM limited to software
that's only infrequently updated. Otherwise somebody has to update the entire
image every time an update gets pushed out. Instead, you package them up and
deploy the apps on top of the base image at install time. It takes a little
longer to deploy, but it takes less admin time to manage since the actual
installs are automated anyway.

------
gojomo
"... to the cloud!"

[http://www.youtube.com/watch?v=jR6xbulUmsg](http://www.youtube.com/watch?v=jR6xbulUmsg)

"Yay, cloud!"

------
nissehulth
Next time I'm about to complain about a bad day at the office, I will read
this story again.

------
lucio
reads like a short dystopian novel

------
CamperBob2
stark3, you seem to be hellbanned.

~~~
taspeotis
What gives you that idea?

[https://news.ycombinator.com/item?id=7758203](https://news.ycombinator.com/item?id=7758203)
is a dupe of
[https://news.ycombinator.com/item?id=7758193](https://news.ycombinator.com/item?id=7758193)

~~~
CamperBob2
(Sorry, didn't see the dupe.)

------
mantrax5
You know how in movies you need at least two people to bring their special
secret keys, plug them in, and turn them at once to enable a self-destruct
sequence?

That is a real principle in interface design - if something would be _really,
really bad_ to activate unintentionally, make it _really, really hard_ to
activate.

If you design a nuclear missile facility, you don't put the "launch nukes"
button right next to "check email" and "open facebook".

Same way it shouldn't be easy for users to delete or corrupt their data by
accident due to some omnipotent action innocently shoved right in between
other trivial actions.

I wouldn't blame the person who triggered this re-imaging process. I'd blame
those who designed the re-imaging interface, to allow it to happen so easily
by accident.

~~~
kalleboo
In my experience, the key is that the UI makes it clear exactly what you're
doing. What I mean is, instead of a button that says "Start Imaging", it
should be "Start Imaging of the 12,600 computers this rule applies to". Of
course, that's a lot more work for the programmer, so it's never done.

~~~
lstamour
But then that leads to people automatically clicking "Yes" to the "Are you
sure..." dialog. Though even I would pause at "Are you sure you want to
reformat 12,500 machines including this one?" ;-)

~~~
rectangletangle
Having people type in some sort of auto-generated confirmation key might give
them pause, seeing as the technique is seldom used.

~~~
kalleboo
Or in my original example, have them enter in the # of computers it's going to
apply to.

------
leccine
We accidentally re-imaged all of the Windows servers with Linux the other day.
Nobody noticed though...

~~~
jacalata
That probably means nobody is using the servers you maintain and your
department should be eliminated completely.

~~~
leccine
Or somebody was just joking... Who knows...

~~~
jacalata
maybe I was joking too!

~~~
zyx321
I've found that jokes tend to get downvoted on HN. Your post seems to be still
in the black.

~~~
leccine
No sense of humor is bad for you. But on HN it is normal.

------
filmgirlcw
I've never been prouder of my alma mater. /s

~~~
gt1337h
You should be. Emory is the 2nd best school in Georgia! :)

