
How is team-member-1 doing? - holman
https://about.gitlab.com/2017/03/17/how-is-team-member-1-doing/
======
mmjaa
I knew a kid once. He was a 'junior operator' in a computer room .. you know,
in the good ol' days, where the computers lived. (Before they escaped and
attached themselves to your wrists.)

He thought he was smart. And sometimes, he was.

One day, he overheard a team leader talk to his programmers about the newly-
minted database, sitting there in front of them on the table, on a brand new
.. amazing .. 640Meg hard drive.

This database had consumed the disk. It had cost the company a cool million
dollars to create. It was vital that we backed it up.

So, the new 640 Meg disk was on its way, onto which we'd back the database up.
The first thing we'll do, the leader said, is copy the database, sector for
sector.

"And only then, will we re-index the database!", he claimed. "Until then, the
indexes will remain un-sorted!"

Well, the kid overheard all of this, but only heard "the indexes will remain
un-sorted!".

Later that night, this kid thought he'd prove himself.

He re-indexed the database.

He didn't tell anyone.

The next day, a not-so-junior programmer came in, saw the database disk
attached to the operator machine, and thought that the backup had been done.
For reasons we shall not explain, he disconnected the disk from the operator
machine.

The index had not been done.

The database was gone.

The new disk arrived, but nobody could mount the old database disk. Much panic
ensued!

Operator logs were consulted. The computer room security cam tapes were
spooled.

Oh shit!

Epilogue: I made a lot of money from those kids, writing a tool to recover a
corrupted database, whose power had been removed mid re-indexing ..

~~~
catshirt
> He was a 'junior operator' in a computer room .. you know, in the good ol'
> days, where the computers lived. (Before they escaped and attached
> themselves to your wrists.)

i hope one day i can tell a story like you. that intro is a work of art. :)

~~~
mmjaa
Thanks. :)

------
Declanomous
What I think is particularly noteworthy is not that Gitlab recognized that
anybody could have made that mistake, but rather how supportive Gitlab was
about the whole thing.

When you make a big mistake, it is easy to place yourself in a mindset where
you feel like a disaster even though everyone is accepting. I call it the
"disappointing your parents" mindset, because it can feel a lot like people
are just being supportive because they love you, and what you did was indeed
inexcusable to a certain degree.

The feeling is made somewhat worse when you are an employee, because your
livelihood and your future are dependent on how other people perceive you. To
that point, I'm really impressed by the fact that Gitlab addressed the fact
that this employee was still being promoted, and that the mistake hadn't
affected that. In my mind that is as at least as important as all of the rah-
rah stuff.

~~~
overcast
They really had no choice. If they were jerks about it, they would have made
an already completely ridiculous scenario ten times worse. Spinning this as
some lighthearted commentary spam every week since then, is their PR move.

~~~
Declanomous
That's true, and I do find the semi-celebratory tone a bit self-
congratulatory. That's why I think team-member-1 still being promoted is the
most important factor at play.

~~~
sytse
It wasn't our intention to come accross as self-congratulatory but I can see
how it might come accross as such. We take this incident really serious and
especially our production engineers are working very hard to improve our
infrastructure.

------
overcast
I really have to just start rolling my eyes at this point. I'm just waiting
for the official meme, and the cycle will be complete. Get to work making your
infrastructure resilient to a simple accidental deletion, and restoring some
faith in your product.

~~~
jeron
>Get to work making your infrastructure resilient to a simple accidental
deletion

I'm sure they've been working on that since the deletion

~~~
aquabib
But where are the stories on this work? What improvements have been made?

Detailed posts on that is how you begin to restore confidence.

No one is just going to take their word that "stuff are in place now".

~~~
a3_nm
There is a list of issues in [https://about.gitlab.com/2017/02/10/postmortem-
of-database-o...](https://about.gitlab.com/2017/02/10/postmortem-of-database-
outage-of-january-31/) \-- also in
[https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-
VCx...](https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-
VCxIABGiryG7_z_6jHdVik/pub), see Recovery, 3, l.

I think it's great that they are being completely transparent about this.

That said, it's true that it's been almost two months and it seems that the
some important issues there are still open and don't look especially active.

~~~
sytse
The follow up was pretty extensive and we'll be working on it for months to
come. Some issues that have been done:

1\. Update PS1 across all hosts to more clearly differentiate between hosts
and environments [https://gitlab.com/gitlab-
com/infrastructure/issues/1094](https://gitlab.com/gitlab-
com/infrastructure/issues/1094)

2\. Set PostgreSQL's max_connections to a sane value
[https://gitlab.com/gitlab-
com/infrastructure/issues/1096](https://gitlab.com/gitlab-
com/infrastructure/issues/1096)

3\. Move staging to the ARM environment [https://gitlab.com/gitlab-
com/infrastructure/issues/1100](https://gitlab.com/gitlab-
com/infrastructure/issues/1100)

4\. Improve PostgreSQL replication documentation/runbooks
[https://gitlab.com/gitlab-
com/infrastructure/issues/1103](https://gitlab.com/gitlab-
com/infrastructure/issues/1103)

5\. Build Streaming Database Backup [https://gitlab.com/gitlab-
com/infrastructure/issues/1152](https://gitlab.com/gitlab-
com/infrastructure/issues/1152)

6\. Assign an owner for data durability [https://gitlab.com/gitlab-
com/infrastructure/issues/1163](https://gitlab.com/gitlab-
com/infrastructure/issues/1163)

------
Pharylon
Several years I was a junior dev at a 3PL (3rd Party Logistics) company
troubleshooting an issue with some overnight data imports.

I had it all loaded up on the test environment, then I'd delete the import,
make some changes, and re-run it. Wash, rinse, repeat, trying to track down
the issue. As I'm sure you can guess, at one point I executed my scirpt in the
wrong window and deleted last nights import from Production.

I immediately told my boss, who was very understanding with a "everyone does
this kind of thing at some point" kind of shrug and went over to our DBA's
office to ask him to re-load the last snapshot. But it turned out the
snapshots had been broken for two weeks and no one had noticed.

And it wasn't a simple issue of re-running the import. After all the orders
had been imported, humans had manually assigned orders to trucks and
dispatched them in the wee hours of the morning. Now there was no way to know
what packages were on what trucks.

I think the DBA ended up buying ApexSQL Log out of pocket to roll back the
deletion with the transaction log.

The result was that for several hours the delivery drivers for a national
office supply company in a certain state were completely unable to use their
handhelds or access their truck's inventory. That was my team-member-1 moment.

~~~
mulmen
What were the consequences to you, the DBA, and the organization?

~~~
keithpeter
I'm interested to know as well.

I imagine that getting the snapshots working properly would be quite an
important factor, otherwise the company would always be a typing error away
from chaos.

~~~
lightbritefight
Working snapshots/backups are system engineers number 1 concern, always. They
mitigate issues with every other system, of any scope. Its the first thing a
new hire sysadmin/eng should ask about coming through the door.

------
gbrindisi
"Bob Hoover, a famous test pilot and frequent performer at air shows, was
returning to his home in Los Angeles from an air show in San Diego. As
described in the magazine Flight Operations, at three hundred feet in the air,
both engines suddenly stopped. By deft maneuvering he managed to land the
plane, but it was badly damaged although nobody was hurt. Hoover’s first act
after the emergency landing was to inspect the airplane’s fuel. Just as he
suspected, the World War II propeller plane he had been flying had been fueled
with jet fuel rather than gasoline. Upon returning to the airport, he asked to
see the mechanic who had serviced his airplane. The young man was sick with
the agony of his mistake. Tears streamed down his face as Hoover approached.
He had just caused the loss of a very expensive plane and could have caused
the loss of three lives as well. You can imagine Hoover’s anger. One could
anticipate the tongue-lashing that this proud and precise pilot would unleash
for that carelessness. But Hoover didn’t scold the mechanic; he didn’t even
criticize him. Instead, he put his big arm around the man’s shoulder and said,
“To show you I’m sure that you’ll never do this again, I want you to service
my F-51 tomorrow.”

------
sverige
I'm old enough to have made some serious mistakes on the job over the years.
Thankfully, the worst ones were when I was much younger, but I know that I
could still make a worse one before I retire.

Here's what impresses me about Gitlab: They not only say they're committed to
honesty and transparency, they actually practice it.

It's easy to see this as some cynical PR move, but to me it's refreshing that
they have addressed specifically what happened to the employee who made the
error. It makes me believe that they are working very hard on fixing their
practices to ensure this kind of failure won't happen again, and I trust they
will share (as @syste said in this thread) what those changes are once they
have it sorted out.

"Oh, how naive you are!" some may say, to which I respond, "Oh, how cynical
and inexperienced you are!" Human failure is inevitable. Designing systems
(whether in code or in management practices) that tolerate this inevitable
failure is very difficult.

This sort of event can be the catalyst for tearing out what didn't work and
creating a much stronger foundation for the future, but only if blame is set
aside and honesty is allowed to prevail in the "after action" analysis. Call
it PR if you like, but I see a healthy desire to deal with what actually
happened and fix it rather than falling into the trap of pointlessly assigning
blame.

Consider that it took congressional hearings and someone with the chutzpah of
Richard Feynman for NASA to own up to the shuttle explosion. A far worse
event, with far worse consequences, but the aftermath of those events and
NASA's complete unwillingness to hold itself accountable and deal with reality
cost it a lot of credibility.

Good on you, Gitlab.

------
impappl
They could support their team members by not continuing to make a circus out
of them

------
Moter8
The first part of the post were interesting and I guess funny, but creating
and selling a T-Shirt with the accident and stuff? IMO this would have been a
fun joke inside the company, but to outsiders, eh. I don't want to sound
grumpy though :)

~~~
YorickPeterse
The shirt is internal only, we're not selling it to the public.

~~~
overcast
For now. Your next blog post will be "Due to popular demand, GitLab Team
Member 1 Shirts For Sale".

~~~
sytse
We won't do this because it would seem like we're not taking this incident
serious and are trying to monetize an outage that affected our users. We're
also not giving them away,

------
oblio
I generally like the openness, but, Gitlab marketing team, if you're
listening: stop spamming social media content on your blog, it seems cheesy
and lazy.

A few tweets or comments are more than enough to prove your point.

They did something similar with the storage post, which was full of Hackernews
opinions.

So, this, or at least post my comment on your blog, too :D

~~~
dchest
Looks like "spamming" completely lost its meaning.

~~~
halostatue
Or at least it’s going back to its original meaning from Monty Python.

------
sofaofthedamned
When I had my first IT job in ~1991 I caused millions of pounds worth of loss
to my employer, a well known retailer in the UK, due to a bug I made.

My boss covered my arse. I love that man, and i've never made a serious
mistake since, as it's made me risk averse.

Gitlab did the right thing here by owning the situation and making it public.

~~~
pestaa
Now you'll just _have to_ share that story or else I won't get any sleep
tonight. Please.

~~~
sofaofthedamned
I only posted this a few weeks ago (check my comment history) but here goes
again:

I was a programmer in my first IT job in 1992 for a large retailer in the UK.
I was working on some stock related code for the branches, of which they had
thousands. They sold a lot of local goods like books which were only sold in a
couple of stores each - think autobiographies of local politicians, local
charity calendars, that sort of thing. Problem with a lot of these items was
that they were not on the central database. This caused a problem with books
especially as you don't pay VAT on books, but if you can't identify the book
then the company had to pay it. This makes sense because some books or
magazines you DID pay VAT on, because they came with other stuff - think
computer magazines with a CD on the front. So my code looked at different
databases and historical info to work out the actual VAT portion payable,
which was usually nil.

I wrote the code (COBOL, kill me now), the testers tested it, all went OK
until when they deployed, on a Friday night. The first I knew was coming in
Monday morning. All the ops had been working throughout the weekend as the
entire stock status for each branch had been wiped. They had to pull a
previous weeks backup from storage, this didn't work as they didn't have the
space for both copies to merge so IBM had to motorcycle courier some hardware
from Amsterdam, etc etc. As this was a IBM mainframe with batch jobs we also
had to stop subsequent jobs in case it made the fuckup worse, so none of the
stock/finance stuff could run at all.

The branches were royally fucked on Monday as, without any stock status to
know what to order, they got nothing - no newspapers, books, anything. We even
made it to the Daily Mail, I think it took at least 3 weeks before ordering
was automatic again. Cost the company literally millions in overtime, not
being able to sell stuff, consultants and reputational damage - it was big
news in the national newspapers.

The root cause? I processed data on a run per-branch. I'd copy the branch data
to a separate area, delete the main data, then stream it back. My SQL however
deleted the main data for ALL branches. It didn't get picked up in QA as, like
me, they only tested with a single branch dataset at a time.

~~~
bartvk
Wow.... Very interesting, thanks for sharing.

------
matt4077
The Gitlab PR team is certainly doing a much better job than their engineering
team did.

I actually appreciate their attitude towards errors by employees.

Unfortunately, the appearance this spectacle creates is that the same sort of
attitude should apply to them as a company, i. e. "Don't fire Gitlab! You've
just invested 200MB of data into their education"

It's a very smart method to protect not just team-member-1, but also
employee-1.

~~~
mhink
I was actually about to make a comment to this effect. They took a legitimate
disaster and handled it _perfectly_ ; as the old saying goes: "No publicity is
bad publicity." They've taken the time over the past few weeks to constantly
release relevant blog posts, which is good for two reasons: 1.) they reassure
customers that they're taking steps to prevent the problem in the future, and
2.) they're capitalizing off natural curiosity to boost brand awareness.
(Although maybe a bit too much, according the minor grumbling in this comment
section. ;) )

I've been a bit wishy-washy on GitLab for awhile, but honestly, I'm thinking I
might give them a shot sometime soon.

------
madamelic
His page says: "Database (removal) Specialist at GitLab"

Love it.

------
anderber
I agree with the mentality that this is a team effort, and when it fails, a
team failure. And that when something goes wrong what's important is to
understand it, and put in place a way for it to not happen again. Kudos to
GitLab for their forward thinking way of working.

------
jaz46
So who's team-member-1 for the Amazon s3 outage a few weeks back? I'm sure
they feel the same way and we'd love to send them gifts to support them just
like the community supported GitLab.

------
winteriscoming
Yet another gitlab post on how open and transparent they are and how they are
being praised for that. It's now just looking more and more like gitlab being
known for being a transparent company than being recognized for their product
or technical competency.

Like I said in another post a while back, it's fine being transparent but
gitlab just have taken this to an extreme. It's important to be private about
certain details and just get real work done and be known for that.

~~~
borplk
It's the same crap that Buffer pulled.

------
tschellenbach
Perfectly happy with Github, but seriously, can I hire you guys as a PR
agency? :)

~~~
jaz46
+1. Hats off to GitLab for being awesome!

------
ar-jan
Pedantic note: I'm sure under 1. Technical Skills, the "I think this is out of
the question here" should read "I think there's no question about this" or
something along those lines.

~~~
sytse
Yep, I think we won't update the blog post because we wanted to post it
verbatim.

------
nickpsecurity
Probably could've designed a good backup and restore strategy with the time
that was invested in this piece. A combo of full backups with append-only
storage of changes going a certain amount of time into the past. Worked for me
for long, long time despite my many screwups. Even when I lost all my stuff to
triple, storage failure I still recovered a tiny bit stored on my cheap,
write-once solution: DVD-R's. There was some bit rot but better than bit loss.
I imagine their solution would be better done with a filesystem or backup
software.

Note: It was neat that much of the community was supportive. I see the article
as really a thank you to them.

~~~
kozak
Instead of DVD-Rs, I now use write-protected USB flash drives (Netac U335 or
similar) with their write protect switches melted with a soldering iron. I
know this doesn't protect about hardware failure, but most data loss actally
gets caused by user actions or software bugs. Store several of them at
different locations to protect from other threats.

~~~
YorickPeterse
We are considering using S4
([http://www.supersimplestorageservice.com/](http://www.supersimplestorageservice.com/)),
it's probably the best place to store your important data.

~~~
ludwigvan
No, too expensive and proprietary. I can't understand why people pay for these
tools when you can build the same thing using open source technologies?

Here's a script I use at home for this:

    
    
      #!/bin/bash
      tar -cf - ~/Documents/ > /dev/null

------
imode
been there. crashed a client's website on a friday by updating the wrong
plugins for a Joomla site.

given a sufficiently complex dependency chain for presented problems, anybody
can be a 'team-member-1'.

mistakes happen at all levels.

