
Post-Mortem and Security Advisory: Data Exposure After travis-ci.com Outage - xtreak29
https://blog.travis-ci.com/2018-04-03-incident-post-mortem
======
cjbprime
Kudos for a thorough and transparent writeup, and (by the looks of things)
understanding that processes fail rather than individuals.

That said, I have to admit to having at least three eye-bulge WTF moments
while reading this.

I'm also surprised that there isn't a Remediation step of "firewall the
development machines away from the production database".

(And isn't the change to database_cleaner to make it throw when run against
remote databases by default a serious break of API compatibility? What if
someone's depending on that behavior?)

~~~
mk89
Indeed!

What a great way to see failure: what went wrong, how to improve without
blaming it on the poor guy. I wish all companies were like this - very forward
thinking!

Funny thing is... this morning on my commute I was watching
[https://www.oreilly.com/ideas/developing-a-chaos-
architectur...](https://www.oreilly.com/ideas/developing-a-chaos-architecture-
mindset) \- which I recommend to anyone!

~~~
twic
I think it is now widely understood that best practice is to blame processes,
not people.

However, i do have a niggling worry that letting people off the hook risks
stunting their personal growth. If you can make mistakes and not bear any
responsibility for it, what is driving you to get better at doing your job?

~~~
IAmEveryone
I think when the circumstances actually show someone was grossly negligent,
it's very hard to convincingly pretend that only processes are to blame.
Remember that the current thinking is working against thousands of years of a
cultural pattern assigning responsibility.

It's also interesting how many seem to be using this idea meant to be applied
to _internal_ processes and individuals and applying it to the companies
themselves.

The most significant example was Gitlab: they had something like four backup
mechanisms for their database, yet three(!) of them had not been working for
months, with nobody noticing. Then, someone actually dropped the production
database. The last working backup system also had some problems I can't
remember right now.

Yet in the comments here, there was barely a hint of criticism of Gitlab.
Instead, they were lauded for not lynching the poor guy who made the last of
what must have been dozens of terrible decisions leading to the incident.

Travis' incident here does actually seem like the sort of freak accident that
one can't rule out completely, at least not at a company of their size. But
forgiveness might not always be warranted.

~~~
solatic
> I think when the circumstances actually show someone was grossly negligent,
> it's very hard to convincingly pretend that only processes are to blame.

On the contrary - the entire reason why you transition from culture to process
as you grow is because when you measure by results, there's no difference
between unintentional negligence (leaving a production terminal open à la
Travis), intentional negligence (let me bystep this annoying check, let me
procrastinate on fixing the backups a la GitLab), and malice.

If you get large enough, you will have grossly negligent people, by
statistical inevitability. You can either accept this statistical
inevitability and design your process for it or you can continue to believe
that you (and everybody else) really actually do only ever hire the very best.

~~~
subway
GP is pretty clearly talking about blaming the company (processes) and not
individuals.

The entire post laments the fact that entities external to the company will
repeatedly forgive the company if the company goes through the motion of
publishing a blameless post-mortem. Even if those post-mortems repeatedly
indicate the company really didn't learn anything, or change the processes
that led up to the incident.

But by golly, they're forward thinking company, and therefore blameless.

------
geofft
So one of the ways of analyzing the root cause here is that the autoincrement
index of the user table in their database is security-sensitive, and
relatively normal DB operations like "Let's roll back the DB" have serious
security implications involving ID reuse. What are some ways to make this less
dangerous? (The rest of it was an operational failure, but it would have been
less trouble if it weren't a security failure.)

I can think of the following:

\- Don't use auth cookies that are signed messages consisting of a UID +
expiration date and other data, use auth cookies that are opaque keys into
some valid-auth database. This is significantly less efficient (every
operation needs a lookup into the DB before you can do anything; if you move
it into a cache you now risk the cache being out-of-date with your DB). AFAIK
using signed UIDs has no security downside other than this, right?

\- Identify users by usernames, not by UIDs. This makes renaming users (which
GitHub allows, so Travis is forced to allow) difficult and security-risky.

\- Use UIDs that are selected from a large random space so make collisions
unlikely, e.g., UUIDs or preferably 256-bit random strings. This seems fine
and probably preferable from a security point of view. Is this fine from the
DB point of view?

Anything else? Maybe a DB restore-from-backup option that preserves
autoincrement counters and nothing else - is that a standard tool?

~~~
jonny_eh
I use a mix: UIDs (auto-incrementing primary keys) for internal app use (e.g.
joins), but use UUIDs for referencing records outside the app (anything sent
out over the API).

The UIDs just make for easier to read logs, and easier to inspect and hand-
write db queries. The UUIDs just seem much more secure when communicating with
client applications.

I also agree that authing with a DB is preferable, if you can afford to do
that at the scale your app needs (which is most apps out there).

~~~
dalore
Why not just UUID for the primary key also and only have one column? Having 2
columns seems like extra complexity for not much gain. Modern databases
usually have a native UUID type column which stores and compares better then
using the string/char type.

One more benefit of using UUIDs for primary keys is that clients can generate
models along with the primary key and know what the key will be BEFORE they
submit it to the database. That is, it works really well with distributed
systems. Auto incrementing primary key is really state stored at the database
level and a source of contention

I would even argue that UUIDs are far better to read logs. You do a grep for
that UUID in the log and it's easier to find then to grep for integer primary
keys. If your contains disparate models then UUIDs will find your entries
easier then integers.

If you ever have to rollback your database but keep some of the new data, it's
easier also. You just export the data you want to keep, do your rollback, and
them import that data knowing that auto incrementing primary keys will never
be an issue.

You can copy objects easily from one database to another and know the primary
key will be the same, and not clash.

For hand writing db queries, it's also better to use a UUID since then I can
use that same query cut and pasted into other databases with the same data and
know that I don't have to check primary keys. The UUID is just a copy/paste.

~~~
bearjaws
A UUID primary key has to live in every index in the table and every foreign
key that references it.

We have a table with a UUID as its primary and it alone consumes 22GB, then a
index reference it, so now another 22GB... That one primary key uses over
100GB of storage, a developer recently went to add another table that
reference it and we had to decide whether we were okay taking another hit. If
we used your example of using a char type (we use blob) it would be double the
size...

It doesn't sound like much, but in a database with all primary keys being
UUIDs you are going to inflate the size of your DB quickly, or have to forego
using foreign keys. I imagine if we used only UUIDs we would have to double or
triple our database infrastructure. Additionally, we would be forced to
introduce partitioning sooner.

Now we have a tech debt ticket to add an auto increment to that table so we
can reclaim disk / memory.

~~~
dalore
I would argue that's a reasonable hit for the gains. I would never add a tech
debit ticket to change just to reclaim disk / memory, that's an anti pattern.

Memory and disk space is cheap.

Also UUIDs are just 128bit integers internally. My example clearly points out
that most modern databases have a native uuid type (and if not use use a 128
bit integer).

Using a char type to store UUIDs is very wasteful and no wonder your indexes
are so huge.

The tech debt ticket you really have is to convert type of uuid column from
char/blob to native.

Note also that if you need to expose your data, you can either expose the auto
incrementing primary key (which leaks data, in that other people can work out
growth rates of various models, like how England worked out how many German
tanks were being built based on a auto incrementing serial number), or you
create a another column (as someone else suggested) which is a uuid. Which you
would need to index anyway.

If index size is an issue then you can just integers which are randomly
generated. You can calculate your collision chance based on how large your
integer is and how often you create. If you happen to pick 128bit integers,
congratulations, you're using UUIDs.

~~~
bearjaws
>Using a char type to store UUIDs is very wasteful and no wonder your indexes
are so huge.

That was a hypothetical, its 22GiB using BLOB storage.

>Note also that if you need to expose your data, you can either expose the
auto incrementing primary key (which leaks data, in that other people can work
out growth rates of various models, like how England worked out how many
German tanks were being built based on a auto incrementing serial number), or
you create a another column (as someone else suggested) which is a uuid. Which
you would need to index anyway.

I state that we use the auto-increment inside of the application, everything
exposed by the API is a UUID. While we lose 32 bits of space per row, just one
foreign key or index saves enough space to justify it.

Best of both worlds.

>Memory and disk space is cheap.

Not when you need to move from multiple r4.8XL to r4.16XLs (think multi-az,
multi region). It would work out to around double the price, we can hire a
developer for the cost of hosting that infrastructure. Even when your business
has the money, you don't want to be the guy telling them to drop another $24k
a month on hosting costs because we didn't think one year ahead.

This of course ignores all the issues that come with replication and
partitioning, multi-master etc.

------
drinchev
> The shell the tests ran in unknowingly had a DATABASE_URL environment
> variable set as our production database. It was an old terminal window in a
> tmux session that had been used for inspecting production data many days
> before. The developer returned to this window and executed the test suite
> with the DATABASE_URL still set.

I was expecting something like this. I remember, I configured my terminal
windows to change their background when I'm on production systems [2], after
around I read about gitlab database incident [1].

1 : [https://about.gitlab.com/2017/02/01/gitlab-dot-com-
database-...](https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-
incident/)

2 : [http://www.drinchev.com/blog/ssh-and-terminal-
background/](http://www.drinchev.com/blog/ssh-and-terminal-background/)

~~~
cjbprime
Though note that wouldn't help here, at least according to a common sense
reading of this part of the post:

> we connected our development environment to a production database with write
> access

i.e. they weren't logged in to a production machine, they were logged in to a
development machine that was permitted to connect directly with write access
to the production database.

~~~
geofft
Since it's a local shell doing something like this is a lot easier - just
change $PS1 to add some colors when certain variables are set.

I use something like this today on my personal machine to distinguish between
my personal and work email addresses in $EMAIL (for silly firewall reasons
it's easier to originate work-owned OSS on my personal machine), and on my
work machine to tell me _which_ production zone I'm talking to - but I don't
have it color when I have any variable set at all, I probably should do that.

------
xtreak29
I was surprised about a tmux session connected to production DB for days.
Though it was idle there are a lot of things that can go wrong during window
switching. My colleague also pointed out the subtle error of assuming the
value of DATABASE_URL in the system instead of being set explicitly by the
test script that could have avoided this.

That being said I am amazed at their transparency over the whole issue and a
thorough write up of the whole incident. It's something we can all learn from.

~~~
daveevad
I am not a tmux user so maybe this is obvious, but how are the environmental
variables transferring from the production machines to where the tests are
being run?

Or, is this stating that the developer started running tests from the
production environment?

~~~
oneweekwonder
> but how are the environmental variables transferring from the production
> machines to where the tests are being run?

tmux[1] is a terminal multiplexer, a screen alternative.

Some speculation, they might have stared the session on a remote machine; set
`DATABASE_URL`; use it for something; detach from the session; to later
reattached with the bash env still in tack; without the developer being aware
he executed the tests commands.

Another tmux "problem" is developers using it to tail logs with a infinite
back-scroll to only fill up server memory. One solution was to kill all tmux
session nightly via cron. Might have helped here.

Finally as tui lover, try tmux! Really keeps the frustration at bay if you ssh
to a box and your session drop to know your terminal workspace is still in
tack.

\- [https://github.com/tmux/tmux/wiki](https://github.com/tmux/tmux/wiki)

------
kgilpin
If you are interested in a list of steps you can take to avoid this happening
to your data, here are some suggestions. I don't believe that any single
measure is sufficient. And I also believe that it's valid to balance the
strictness of your controls against the the amount of protection you really
need.

1\. Vault the passwords. People and machines should fetch passwords on-demand
using identity credentials.

2\. Create a read-only database account. In all cases, use the account that
matches the need. Running reports? Use the read-only account.

3\. Restrict access to read-only and read-write database accounts. Provide
this account information to a limited set of people and tools.

4\. Provide a fairly straightforward way for people to get temporary elevated
access. If it's easy to get elevated access, then users will not be tempted to
"hold on" to elevated access longer than they should (e.g. by leaving a
terminal open for a very long time).

5\. Rotate the credentials of all the accounts regularly. This ensures that
temporary elevated access will become long-term access. It also greatly
reduces the harm created when credentials are leaked, exposed, or forgotten
about (e.g. in an environment variable in an old window).

Note that none of the steps above require a heavy investment in automation.
You can start with basic (even fully manual) processes for key management and
access management, and evolve to automation as you grow.

Finally, keep in mind that this type of accident is not just a "small company"
problem. Recall this AWS ELB outage on Christmas Eve of 2012 -
[https://aws.amazon.com/message/680587/](https://aws.amazon.com/message/680587/)

    
    
        "The data was deleted by a maintenance process that was inadvertently run against the production ELB state data."

~~~
twic
> 4\. Provide a fairly straightforward way for people to get temporary
> elevated access. If it's easy to get elevated access, then users will not be
> tempted to "hold on" to elevated access longer than they should (e.g. by
> leaving a terminal open for a very long time).

Or to hack around it in some other way, eg by adding some sort of backdoor to
a production application.

This is such an important lesson, and one that unfortunately is lost on many
command-and-control type IT departments.

------
sudhirj
They should really consider using a CI system to run their tests.

~~~
gremlinsinc
haha that made me chuckle.

------
wiredfool
I think the root issue here is that the production database "user" has too
many privileges, and the reason for that is migrations. This is compounded by
the test user essentially needing to be a db superuser to create and destroy
test databases, as well as run the migrations for them. I've noticed this
lately with Django, but I'm guessing that it's a general problem.

When I design a DB system, Ideally the production 'user' can only do those
things that we reasonably expect them to be able to do, and truncate isn't one
of them. Also drop tables, potentially delete entries, and any maintenance
tasks. DDL modifications are right out.

Those tasks can be run from a specific user, and locked down to a certain
types of connections that aren't allowed from production.

~~~
twic
When i first worked with migrations, they were run manually as part of the
release process. Over time, we automated them - by making them part of the
release tool. It never occurred to us to make them part of the application
itself. For us, it was therefore completely natural that the app's DB
credentials did not have DDL permissions.

The whole idea of migrations being run by the app still seems really silly to
me. I suppose this is the obvious move for developers who (very sensibly!)
build applications, but don't build their own release tools, to whom it would
never occur to put migrations anywhere except the application.

~~~
zimpenfish
> The whole idea of migrations being run by the app still seems really silly
> to me.

I have one app that kinda runs migrations - there's a helper app which is
built at the same time and (normally) runs before the main app. But this is
based on a per-user database which means there's no "release" process that
could run the migrations.

(Similar to iOS apps, really - when you get a new version of Teleappchatbook,
it'll often have an "upgrading..." step which I assume is migrations / index
updates, etc.)

------
jtmarmon
Wow, the issue with the signed token is very interesting. Found it surprising
the authentication method specifically wasn't mentioned in the remediation.

Food for thought: the security issue wouldn't have happened if (1) travis used
UUIDs instead of sequential IDs as a pkey, or (2) used a secret token for auth
instead of a signed (presumably) JWT.

~~~
jonny_eh
1) They don't need UUIDs as a pkey, but just as another field. I like using
auto-incrementing IDs as the primary key, and UUIDs for referencing records
over internet-facing APIs. Best of both worlds!

2) Totally, JWTs are evil.

~~~
testplzignore
I don't know much about JWTs. Why are they bad?

~~~
zrail
JWT itself is a nice container for signing a small amount of JSON and being
able to easily pass that around. I use it a lot for situations where I want to
ensure someone hasn't futzed with the data, and/or I want an auto-expiring
token of some sort.

JWT, by itself, is not an authentication and authorization system, but people
often use it as such.

------
badmadrad
Wondering why does any developer need update/delete/drop access to a prod
database? Or why would ad hoc scripts have this ability?

~~~
gremlinsinc
only devops should have this info and they should be guarding it fiercly and
only use when they know why/when... it would be easy enough to mirror all
production data to a 2nd db w/ full read/write that's updated daily or weekly
from source. That should give plenty of data for devs to work with.

~~~
idunno246
isnt the idea of guarding access fiercely from developers the antithesis of
devops? That just makes 'devops' people 'ops'

~~~
balls187
Least privileged access isn't necessarily the antithesis of devops.

You could argue that dev-ops model include automation for operations on
production systems, with direct access to production systems limited to a
reduced set of staff. Developers can create the change-sets to modify
infrastructure, but those change sets are reviewed, validated, tested, and
then executed on production via CI/CD Automation.

Infrastructure as Code, and Immutable Infrastructure lends itself well to that
approach.

------
plasma
A suggestion for production database access, create a readonly login (in
addition to a write one).

Login using the readonly login the majority of the time, and only switch to
the write login when required.

------
vijaybritto
Finally I can show a solid example to my team mates who ridiculed me when I
said we needed restrictions on access to prod servers. This is a great write-
up!

~~~
x86_64Ubuntu
Um, you shouldn't need a solid example for such a crystal clear best practice.
I mean, everyone has access restrictions to PROD, it's simply to valuable and
too costly to have to bring back up after a fuckup.

------
bogomipz
>" Using our API logs, and with information from our upstream provider about
the IP address the query originated from, we were able to identify a truncate
query run during tests using the Database Cleaner gem."

I'm assuming by "upstream provider" here they mean ISP/IaaS provider. Either
way they didn't have enough information under their control to identify the
source of the query. The reliance on a third party for accurate logging
information seems like a big blind spot.

What if the upstream provider didn't have the logs? Or the request for access
to those took an excessive amount of time? I didn't see anything in the
remediation steps to address this.

------
jmiserez
Interesting writeup. I loath setting environment variables on long running
terminal sessions exactly because it’s not obvious once they’re set.

I prefer to use a subshell for the command and set the environment variable
each time:

$ ( export FOO=bar; my_cmd )

~~~
ams6110
Or just:

env FOO=bar my_cmd

~~~
cjbprime
Or just:

FOO=bar my_cmd

~~~
mappu
Or just:

    
    
         FOO=bar my_cmd
    

That's the same but with a leading space, to prevent 'bar' from showing up in
your histfile.

~~~
karlding
On bash, this assumes that $HISTCONTROL is set to ignoreboth [0]. This is in
the default ~/.bashrc on Ubuntu, but I can't speak about other distros.

I don't use zsh, but I believe the equivalent is $HIST_IGNORE_SPACE [1].

[0] [https://linux.die.net/man/1/bash](https://linux.die.net/man/1/bash)

[1] [https://unix.stackexchange.com/questions/6094/is-there-
any-w...](https://unix.stackexchange.com/questions/6094/is-there-any-way-to-
keep-a-command-from-being-added-to-your-history)

------
kccqzy
One thing that immediately caught my attention: the fact that it is possible
for a single query/command/request to wipe everything.

To be frank at a place I worked there had always been something like this too:
if you were logged in as super admin, wiping all data is just one POST request
away. That was super convenient when testing things, but having the same in
production made me uneasy. Fortunately before any incident happened I added
additional checks that required special command line flags to enable this API.
Perhaps still not super foolproof but I felt much better.

~~~
zaidf
I wish people would talk about this more. I am of the belief that if a single
command can lead to a failure like this, you can’t simply plan on that
accident not reoccurring. You should basically assume that it will reoccur.

Ideally, I think databases should integrate checks such as these. For example,
how often does a production users table need to be truncated intentionally?
Even by superusers? Usually very, very, rarely. So imagine if the database
made you jump hoops before you could do that. People rely on permissions for
this sort of thing but given how complicated permissions can become as the
team and the database grows, it’s not hard to screw up the permissions and
access control. I believe catching this kind of doomsday scenario is best when
built in deeply at a very low level.

~~~
greenleafjacob
What about a database that would require separate DDL from data commands?
Instead of having a single super-user that can do both DDL and
INSERT/UPDATE/DELETE, you would instead have a DDL user, and a data user. That
would probably prompt people to only use the data user in their application?

------
notimetorelax
Shouldn’t there be a remediation step of making it impossible to login into
another users’ session? E.g. generate a random number for every provisioned
user and add it to the token.

------
thezilch
All of my production terminals have dark-red background and my screen
hardstatus also red. This is my default in rc files, and I have to explicitly
link rc files to get my dev-only black background with lime hardstatus.

~~~
twic
How do you set the background colour?

------
joelhaasnoot
Lots of focus in the comments on the database access issue, but trusting the
user specified (signed) token doesn't seem like a great idea. Not validating
the token against database seems like a painful shortcut

------
philip1209
This makes me think of the Google SRE book. They advise that, if there is a
problem this big, any SRE should have the power to turn off the production
load balancers until the problem is fixed.

I don't think that TravisCI did anything wrong. However, if they had turned
off the load balancers as soon as they realized that there was a huge issue,
it might have protected customer data more. They optimized uptime over
completely fixing the issue. Also, perhaps nobody felt that they had the
authority to turn off the production service.

------
catfood
So the session keys mapped to usernames, rather than IDs in the database?
Otherwise, when the database is restored with the old user IDs, the session
would become invalid instead of continuing to work. This is what I'm seeing:

1\. Tables truncated. 2\. In this window, someone creates an account with a
username that existed in the dropped database. 3\. They see a blank user page
because a new user record was created. 4\. Database restored. 5\. It's as if
you're logged into the original user's account.

~~~
cyrusaf
This is not what happened. The tokens were mapped to user IDs and when people
signed in, the db created new users which may have had the same IDs as old
deleted accounts. When they restored the DB, these tokens pointed to other
users and granted access to these other users' accounts. Quite an unfortunate
situation. May have been mostly avoidable if UUIDs were used instead of
incrementing IDs, but hindsight is 20/20.

~~~
catfood
The part of that that I don't get is how a new user could have the same ID as
an old (truncated) user since "our system created new records for them, with
primary keys generated from the existing sequence (PostgreSQL does not reset
id sequences on truncate)."

Do they mean that the only potentially exposed accounts are those that signed
up after the database was restored?

~~~
cyrusaf
It's possible their truncate also restarted the sequence: "TRUNCATE TABLE
users RESTART IDENTITY;"

[https://stackoverflow.com/questions/13989243/sequence-
does-n...](https://stackoverflow.com/questions/13989243/sequence-does-not-
reset-after-truncating-the-
table?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa)

Although, I am confused about how sign ins created new users. When they say
sign ins, do they mean new accounts?

~~~
catfood
Yeah they must mean new accounts, if not then I'm lost. I guess it could have
reset autoincrement but they said it didn't. The only other thing I can think
of is that the signed token that's put in localStorage is sent to the server
like "someuser|sometoken", the server inspects sometoken, says it checks out,
then takes the client at its word that it's someuser.

------
fermigier
Reminds me of this:

"Why Auto Increment Is A Terrible Idea" (2015) [https://www.clever-
cloud.com/blog/engineering/2015/05/20/why...](https://www.clever-
cloud.com/blog/engineering/2015/05/20/why-auto-increment-is-a-terrible-idea/)

(update: link fixed).

~~~
farnulfo
Working link : [https://www.clever-
cloud.com/blog/engineering/2015/05/20/why...](https://www.clever-
cloud.com/blog/engineering/2015/05/20/why-auto-increment-is-a-terrible-idea/)

------
pradeepchhetri
Amazed to see how transparently they have written the post. I think we all can
learn from such outages[0]

[0]: [https://about.gitlab.com/2017/02/10/postmortem-of-
database-o...](https://about.gitlab.com/2017/02/10/postmortem-of-database-
outage-of-january-31/)

------
SoulMan
Classic case of Developer returning to window with prod env setup. I am sure
it was a "blameless post-mortem" i.e action item contains change in tooling
and processes rather than trying to change human behaviour.

~~~
jlgaddis
> _... i.e action item contains change in tooling and processes rather than
> trying to change human behaviour._

Contrary to what we might hope, in my experience the former is often much
easier to ensure and enforce than the latter.

------
Sembiance
Aren’t these the folks that spammed every github repo with a spam pull request
to integrate their system into your code? I kinda lost all respect for this
project and their developers after that incident.

~~~
rkh_popcorn
Apologies that you were affected by this. The script creating these pull
requests was created and run by a third party not affiliated with the company.
We were similarly upset by this.

~~~
bionoid
> created and run by a third party not affiliated with the company.

Are you saying you never hired the third party, they went about and did this
of their own accord without talking to you first?

~~~
rkh_popcorn
Correct, this was from an overenthusiastic user, someone we did not know and
had no direct contact with.

At the time we weren't actually making money yet, and most of the
contributions to Travis CI came from outside collaborators.

To add to the confusion, we did indeed have a bot in the early days that would
comment on pull requests, but only if the repository was using Travis CI
already (this has now been replaced by GitHub's status API). However, this was
not the same bot account that kept opening unsolicited pull requests on random
projects.

~~~
bionoid
Thanks for the clarification -- is there an official writeup about the
incident somewhere? (should have asked that initially, sorry)

~~~
rkh_popcorn
I was looking for one as well, but it seems we did not write a blog post. I
will do some digging when I find the time, as I know we at least messaged some
people that voiced their frustration directly.

------
yazr
Has anyone estimated whether SaaS in general reduce or improve the uptime in
aggregate over all users ?!

The obvious arguments is that a specialized SaaS is more reliable, but the
rare outages are horrific...

------
bananarepdev
Why does an extremely dangerous tool, such as a database cleaning
tool/library, rely on an environment variable to define the target?

~~~
tgtweak
What's the ideal way to do so? Most of the production systems I've seen
distinguish between prod and staging with only this.

------
stevekemp
Site is down for me, but this mirror isn't:

[http://archive.is/klfF5](http://archive.is/klfF5)

------
nukeop
What is a "read-only follower"? Is this a common term when handling databases?
Is it different than a slave?

~~~
johanj
Many people are using leader/follower instead of master/slave.

~~~
nukeop
First time I hear of this, who uses this and why?

~~~
jffry
"master/slave" evokes the unhappy history of slavery, which people care about
to widely varying degrees.

"leader/follower" or "primary/replica" are much more neutral terms that won't
prompt negative emotions in many people.

People who choose one of the latter two options do so either because they feel
it is more accurate, or because they wish to avoid the negative connotations
of "master/slave", or a mix of both.

~~~
nukeop
It's better to stick to tried and true terminology that everybody understands
than to needlessly introduce new and redundant designations for concepts that
have been in use for decades, just to avoid upsetting rather irrational
American sensibilities.

~~~
bpicolo
Leader/follower and primary/replica are common terminology now. It's unfair to
paint the sensibilities as irrational - this is exactly an example of the kind
of nonchalance that helps lead to underrepresentation of minorities in tech.

You might not care, but there are a lot of people that do. Taking steps like
this improves the comfort level others while affecting yours none. Why is that
not worth it?

~~~
rsynnott
Primary/replica is also _much clearer about what's actually going on_ than
master/slave.

------
d6de964
I'm currently job-seeking and I've seen many jobs ads asking for CI
experience. I'm not fond of using SaaS solutions and would like to fiddle with
CI in private (e.g. using a private gitlab repo.

What would be the steps to setup an own, private and open source CI solution
for, say, a Go, PHP, or JavaScript project?

~~~
colechristensen
Jenkins has existed for a long time.

It has plugins for many things, and you can just run shell scripts when those
don't work for you.

The Job DSL plugin is good for putting everything in code and generating jobs
programatically.

There are a few other CI tools, GitLab itself you mention has been a full
featured CI solution for quite a while and you can host it yourself.

~~~
d6de964
Great, it also has a Docker image apparently.

------
ghoshbishakh
Amazed at their transparency!

------
wemdyjreichert
Attack of Little Bobby Tables

------
0xFFFE
Not being snarky. How hard is it to setup DB replication and do testing/QA on
that DB? Isn't it the SOP?

Why the remediation list doesn't include it?

~~~
ameliaquining
Presumably because that's already SOP. It sounds like the query was supposed
to run against a development or staging DB, but an environment variable that
the dev wasn't aware of caused it to run against prod instead.

------
100k
Great writeup.

This is the third case I'm aware of where CI deleted the production database.
Others are GitHub (back in 2010: [https://blog.github.com/2010-11-15-today-s-
outage/](https://blog.github.com/2010-11-15-today-s-outage/)) and
LivingSocial.

~~~
xtreak29
A video of Zach Holman talking about dropping DB in production twice where
they ran CI over production credentials :
[https://www.youtube.com/watch?v=AwXhckRN6Mc#t=5m50s](https://www.youtube.com/watch?v=AwXhckRN6Mc#t=5m50s)

Great video overall about how to handle these situations

