Hacker News new | past | comments | ask | show | jobs | submit login
Post-Mortem and Security Advisory: Data Exposure After travis-ci.com Outage (travis-ci.com)
351 points by xtreak29 on Apr 9, 2018 | hide | past | web | favorite | 169 comments

Kudos for a thorough and transparent writeup, and (by the looks of things) understanding that processes fail rather than individuals.

That said, I have to admit to having at least three eye-bulge WTF moments while reading this.

I'm also surprised that there isn't a Remediation step of "firewall the development machines away from the production database".

(And isn't the change to database_cleaner to make it throw when run against remote databases by default a serious break of API compatibility? What if someone's depending on that behavior?)


What a great way to see failure: what went wrong, how to improve without blaming it on the poor guy. I wish all companies were like this - very forward thinking!

Funny thing is... this morning on my commute I was watching https://www.oreilly.com/ideas/developing-a-chaos-architectur... - which I recommend to anyone!

I think it is now widely understood that best practice is to blame processes, not people.

However, i do have a niggling worry that letting people off the hook risks stunting their personal growth. If you can make mistakes and not bear any responsibility for it, what is driving you to get better at doing your job?

> If you can make mistakes and not bear any responsibility for it, what is driving you to get better at doing your job?

It could be any of the following:

1) ambition: desire to improve yourself

2) fear: desire not to screw up (and feel that you are making mistakes)

3) fear: job security

In all cases what's essential is a feedback system. The fact that you are not directly the one to be blamed doesn't imply you are automatically not accountable for anything you do. Everyone shares at least partially the blame.

Creating a blameless culture, having a good feedback system in place with maybe some goals, objectives and ways to improve is probably something you need, if you want to keep your employees and have them grow with you and not treat them as expandable.

Good management should be able to do this. If someone is not fit for a role, a good manager will be able to find another way for this person to be useful. The last thing you want to do is to fire someone. At least that's my opinion. Again, management.

IMO it's distinguishing mistakes from pure negligence.

A developer who's able to easily run scripts in production without any checks in place is different than a developer who knowingly sidesteps existing processes in order to perform a task faster or easier.

The punishment for one versus the other should be very different.

In any well run engineering org there is still the grapevine in operation.

You really don't want to be the butt of all the "Joe the prod database killer" jokes.

These things can be different: whether you take personal responsibility for a mistake, whether you are legally responsible for a mistake, whether your manager holds you responsible for a mistake, whether your team holds you responsible for a mistake, and whether a post-mortem report blames you for a mistake. It may be easier to take personal responsibility for a mistake if this isn't coupled with these other things.

I think when the circumstances actually show someone was grossly negligent, it's very hard to convincingly pretend that only processes are to blame. Remember that the current thinking is working against thousands of years of a cultural pattern assigning responsibility.

It's also interesting how many seem to be using this idea meant to be applied to internal processes and individuals and applying it to the companies themselves.

The most significant example was Gitlab: they had something like four backup mechanisms for their database, yet three(!) of them had not been working for months, with nobody noticing. Then, someone actually dropped the production database. The last working backup system also had some problems I can't remember right now.

Yet in the comments here, there was barely a hint of criticism of Gitlab. Instead, they were lauded for not lynching the poor guy who made the last of what must have been dozens of terrible decisions leading to the incident.

Travis' incident here does actually seem like the sort of freak accident that one can't rule out completely, at least not at a company of their size. But forgiveness might not always be warranted.

> I think when the circumstances actually show someone was grossly negligent, it's very hard to convincingly pretend that only processes are to blame.

On the contrary - the entire reason why you transition from culture to process as you grow is because when you measure by results, there's no difference between unintentional negligence (leaving a production terminal open à la Travis), intentional negligence (let me bystep this annoying check, let me procrastinate on fixing the backups a la GitLab), and malice.

If you get large enough, you will have grossly negligent people, by statistical inevitability. You can either accept this statistical inevitability and design your process for it or you can continue to believe that you (and everybody else) really actually do only ever hire the very best.

GP is pretty clearly talking about blaming the company (processes) and not individuals.

The entire post laments the fact that entities external to the company will repeatedly forgive the company if the company goes through the motion of publishing a blameless post-mortem. Even if those post-mortems repeatedly indicate the company really didn't learn anything, or change the processes that led up to the incident.

But by golly, they're forward thinking company, and therefore blameless.

Isn't the fact that there were "dozens of terrible decisions" leading to one poor guy dropping the production database (in a way that was extremely hard to recover) absolutely a failure of process rather than a failure of a person? He made the last mistake in a big series of mistakes, so unless he was also responsible for all the prior mistakes and failed/non-existent processes leading to his mistake, it seems very unfair to blame him. It is, of course, perfectly fair to blame Gitlab, and plenty of people did just that in the comment thread about it at the time.

> Yet in the comments here, there was barely a hint of criticism of Gitlab. Instead, they were lauded for not lynching the poor guy who made the last of what must have been dozens of terrible decisions leading to the incident.

The threads are still there. You can go back and read them, and notice that quite a few comments are along the lines "not an individual fault, but catastrophic on the group level", "I don't think I can trust Gitlab to run a production service", "How the hell do you have all your backups failing and not notice?!".

While in a huge company it might be hard, I feel like any small engineering team can be composed entirely of self motivated people

If you make a mistake (a real one that causes problems for the company and its users), that is enough for most people to want to avoid it happening again.

At the very least, different credentials in the database for each environment would make sense.

I prefer thorough E2E testing to mocked large-component integration testing (i.e. spinning up a db vs mocking it out) so I can totally understand needing to run tests against a database but yeah reading this was a collar-pulling "yikes" moment for me.

Their remediation steps are good though, for a start... They might want to look into spinning up completely new databases programattically for tests rather than connecting to one...

Postgres has credentials in the URL, so I am sure they already had different credentials, but the URL was set.

This seems like it fits the narrative. Still, WTF? A quick google tells me that PostgreSQL supports SSL cert-based authN. The prod environment (and possibly an Ops host) should be the only ones with the right certs to connect to the prod DB.

Creating an ssl connection to the postgres server might add extra connection time (Yes could be prevented by persistent connections but that's not always feasible, especially in a cloud based environment). So it's not the best solution.

Postgres pg_hba.conf does support access by ip address easily enough. For access to the production databases the ip addresses should be limited to those that need access. Or if not in postgres, a firewall could limit it.

So if a developer accidentally tried to wipe it, they wouldn't be allowed to connect. If they actually needed to connect, they would first have to add their ip address (or use a bastion host).

The GP was presumably talking about TLS with mutual auth. There’s no excuse for not having TLS at all enabled, but mTLS is a great extra security feature. The cost of mTLS (performance wise) is negligible, and if you can’t keep a persistent database connection up (because cloud?!) but also can’t do TLS because perf then I don’t know what to tell you. The only performance issue TLS has is that it isn’t used enough.

As someone who has helped many, many companies with Postgres I can tell you that nobody runs PG over TLS in production if everything’s talking over private links.

There’s zero reason to do this.

if you are running PG like a KV store (simple lookups/simple updates/inserts) where most of the dataset fits in memory then it quickly becomes CPU bound and changes in PG configuration/access patterns can have large effects. for example i've seen running pgbouncer on the same host drop performance by about 33%, and switching from simple protocol to extended protocol single use prepared statements drop performance by 33%.

What is interesting is the DB cleaner gem actually does check for `RAILS_ENV=production` and a `DATABASE_URL` that is remote:


It's possible they disabled the localhost check at some point.

Edit: It does now, after Travis folk fixed it. :D

And, Rails (which I assume they're using) typically loads DB credentials from a `database.yml` file for a particular `RAILS_ENV` (local dev, local automated testing, prod). By setting the `DATABASE_URL` env var, they are overriding that (this is fine/expected behavior for app deployment as far as Rails is concerned; it's comparatively less common for localdev). So that bypassed the `RAILS_ENV=prod` check that might also have caught this (e.g. `RAILS_ENV` was undefined or `development`).

So I'd really have loved it if they dug into some of the UX factors to this issue:

* Why do they allow remote access to their production database? (Is it too hard to get a local database copy? "For debugging purposes"? Why was the eng in prod the day before? Was it unintentional? Should we revisit those assumptions? Should we resolve whatever issue that caused us to shortcut by allowing remote, direct prod db access?)

̶ ̶W̶h̶y̶ ̶D̶B̶ ̶c̶l̶e̶a̶n̶e̶r̶ ̶a̶l̶l̶o̶w̶e̶d̶ ̶n̶o̶n̶-̶l̶o̶c̶a̶l̶ ̶(̶o̶l̶d̶e̶r̶ ̶v̶e̶r̶s̶i̶o̶n̶?̶ ̶m̶i̶s̶c̶o̶n̶f̶i̶g̶u̶r̶e̶d̶?̶ ̶e̶t̶c̶.̶)̶

Why, for localdev, they override database.yml and use env vars (which I've found is more cumbersome to use and not as common). Yeah, in production you should use RAILS_ENV/DATABASE_URL/etc. - so are they attempting to have prod parity? Why or why not?

* Why are folks routinely logging in (for debugging purposes) with a DB user that is, effectively, root? I bet that user can `drop table` too. Should folks have a limited access account and then jump through a hoop or two for root?

It sounds like they want to "debug" prod by running some local rails tools (rails c?) with DATABASE_URL set to the prod db. Is that the "best" way to do it? heroku etc. actually open a rails console in the environment and not locally.

Those safeguards were added by Travis CI folks as one of their remediation action items: https://github.com/DatabaseCleaner/database_cleaner/pull/521

Dang, somehow I missed that. Rawk!

So one of the ways of analyzing the root cause here is that the autoincrement index of the user table in their database is security-sensitive, and relatively normal DB operations like "Let's roll back the DB" have serious security implications involving ID reuse. What are some ways to make this less dangerous? (The rest of it was an operational failure, but it would have been less trouble if it weren't a security failure.)

I can think of the following:

- Don't use auth cookies that are signed messages consisting of a UID + expiration date and other data, use auth cookies that are opaque keys into some valid-auth database. This is significantly less efficient (every operation needs a lookup into the DB before you can do anything; if you move it into a cache you now risk the cache being out-of-date with your DB). AFAIK using signed UIDs has no security downside other than this, right?

- Identify users by usernames, not by UIDs. This makes renaming users (which GitHub allows, so Travis is forced to allow) difficult and security-risky.

- Use UIDs that are selected from a large random space so make collisions unlikely, e.g., UUIDs or preferably 256-bit random strings. This seems fine and probably preferable from a security point of view. Is this fine from the DB point of view?

Anything else? Maybe a DB restore-from-backup option that preserves autoincrement counters and nothing else - is that a standard tool?

I use a mix: UIDs (auto-incrementing primary keys) for internal app use (e.g. joins), but use UUIDs for referencing records outside the app (anything sent out over the API).

The UIDs just make for easier to read logs, and easier to inspect and hand-write db queries. The UUIDs just seem much more secure when communicating with client applications.

I also agree that authing with a DB is preferable, if you can afford to do that at the scale your app needs (which is most apps out there).

Why not just UUID for the primary key also and only have one column? Having 2 columns seems like extra complexity for not much gain. Modern databases usually have a native UUID type column which stores and compares better then using the string/char type.

One more benefit of using UUIDs for primary keys is that clients can generate models along with the primary key and know what the key will be BEFORE they submit it to the database. That is, it works really well with distributed systems. Auto incrementing primary key is really state stored at the database level and a source of contention

I would even argue that UUIDs are far better to read logs. You do a grep for that UUID in the log and it's easier to find then to grep for integer primary keys. If your contains disparate models then UUIDs will find your entries easier then integers.

If you ever have to rollback your database but keep some of the new data, it's easier also. You just export the data you want to keep, do your rollback, and them import that data knowing that auto incrementing primary keys will never be an issue.

You can copy objects easily from one database to another and know the primary key will be the same, and not clash.

For hand writing db queries, it's also better to use a UUID since then I can use that same query cut and pasted into other databases with the same data and know that I don't have to check primary keys. The UUID is just a copy/paste.

A UUID primary key has to live in every index in the table and every foreign key that references it.

We have a table with a UUID as its primary and it alone consumes 22GB, then a index reference it, so now another 22GB... That one primary key uses over 100GB of storage, a developer recently went to add another table that reference it and we had to decide whether we were okay taking another hit. If we used your example of using a char type (we use blob) it would be double the size...

It doesn't sound like much, but in a database with all primary keys being UUIDs you are going to inflate the size of your DB quickly, or have to forego using foreign keys. I imagine if we used only UUIDs we would have to double or triple our database infrastructure. Additionally, we would be forced to introduce partitioning sooner.

Now we have a tech debt ticket to add an auto increment to that table so we can reclaim disk / memory.

>I imagine if we used only UUIDs we would have to double or triple our database infrastructure

I think you're greatly exaggerating how big UUIDs are compared to ints. The upper bound is 4x, and that's if your table is keys only, with no other data. That falls down to 2x if you're using bigints. In a typical user table with a bunch of other fields, like a username + password hash, the storage used by the keys is quickly dwarfed by the storage used for "other data".

I would argue that's a reasonable hit for the gains. I would never add a tech debit ticket to change just to reclaim disk / memory, that's an anti pattern.

Memory and disk space is cheap.

Also UUIDs are just 128bit integers internally. My example clearly points out that most modern databases have a native uuid type (and if not use use a 128 bit integer).

Using a char type to store UUIDs is very wasteful and no wonder your indexes are so huge.

The tech debt ticket you really have is to convert type of uuid column from char/blob to native.

Note also that if you need to expose your data, you can either expose the auto incrementing primary key (which leaks data, in that other people can work out growth rates of various models, like how England worked out how many German tanks were being built based on a auto incrementing serial number), or you create a another column (as someone else suggested) which is a uuid. Which you would need to index anyway.

If index size is an issue then you can just integers which are randomly generated. You can calculate your collision chance based on how large your integer is and how often you create. If you happen to pick 128bit integers, congratulations, you're using UUIDs.

>Using a char type to store UUIDs is very wasteful and no wonder your indexes are so huge.

That was a hypothetical, its 22GiB using BLOB storage.

>Note also that if you need to expose your data, you can either expose the auto incrementing primary key (which leaks data, in that other people can work out growth rates of various models, like how England worked out how many German tanks were being built based on a auto incrementing serial number), or you create a another column (as someone else suggested) which is a uuid. Which you would need to index anyway.

I state that we use the auto-increment inside of the application, everything exposed by the API is a UUID. While we lose 32 bits of space per row, just one foreign key or index saves enough space to justify it.

Best of both worlds.

>Memory and disk space is cheap.

Not when you need to move from multiple r4.8XL to r4.16XLs (think multi-az, multi region). It would work out to around double the price, we can hire a developer for the cost of hosting that infrastructure. Even when your business has the money, you don't want to be the guy telling them to drop another $24k a month on hosting costs because we didn't think one year ahead.

This of course ignores all the issues that come with replication and partitioning, multi-master etc.

>"We have a table with a UUID as its primary and it alone consumes 22GB, then a index reference it, so now another 22GB... That one primary key uses over 100GB of storage,"

I'm not following your math here. Can you elaborate? I see 22 and 22 and the 100.

I assume they have more than one reference to that table...

Aren't UIDs effectively the same as integers in this case? Seems like it's conceptually similar to using the MD5 hash of the integer ID. Wouldn't it still cause exactly the same issues mentioned in the article? Anything non-random would.

UUIDs (two 'U') are random.

To be precise only v4 UUIDs are (pseudo-)random, and not the entire thing, there are some version bytes.

UUIDv4s are. The other kinds are deterministic. (And to be fair, quite rare ;))

This right here.

The ORM I wrote (and use...) automatically generates a an internal and an external identifier for every persisted object. Works pretty well in practice.

There are simpler solutions. When you recover your database, simply add a large constant to all autoincrement columns, that should prevent any overlap. Atleast for existing setups this is ideal.

The other option is to use snowflake IDs with a random suffix. They will still sort like normal IDs but won't collide if you roll back (or are unlikely to collide).

UUIDs or fully random strings are a bit meh as primary column in a database. They disconnect all meaning and potential sorting from the records (sorting by primary key is meaningless and random PKs may ruin database performance depending on setup).

Snowflakes will behave like a UUID or random string when properly engineered but are still sortable.

Identifying users by usernames also leads to all kinds of sillyness, internally you should be using the ID and not the username. Usernames are meant for human consumption, not machine consumption.

"Snowflakes" meaning like Twitter Snowflake, an autoincrement with a timestamp folded in (and maybe some ability for multiple read/write database servers to generate IDs without actively coordinating with each other)? That would work - can I easily get Postgres or something to generate these?

I usually prefer Snowflakes with a random suffix behind a timestamp. 34bit UNIX Timestamp sidesteps the 2038 issue and 30 bits of randomness are good enough to prevent collisions with a p < 0.0001 of up to 400 inserts per second (or 1500 inserts per second for p < 0.001). And you can always tweak this to match your workloads (use a bit more randomness, use a bit more timestamp, maybe even use a custom epoch instead of 1970 to reduce the number of bits necessary or use millisecond accurate timestamps)

PostgreSQL supports calling functions as the default value of a column so it should be possible to simply put in a function that returns such a snowflake (either in pure SQL or in PL/Python) and PG will use that.

I’ve tried a similar approach and ran into an issue of id generation during tests. Tests run in parallel and quick succession. Ran into collisions right away. Any suggestions on how to mitigate that?

First, you should still use a UNIQUE column, that prevents the most urgent collision issues.

Second, if your tests run a lot faster than actual loads, consider using a special test timestamp format that allows for higher throughput at the expense of longevity (ie, 10bit timestamp since start of test + 54 bit of random data) or alternatively throttle the tests to closer resemble real workloads (within one or two orders of magnitude)

You can also replace the random part with a Node ID + atomic counter. Each Node that generates IDs gets a unique identifier as ID which is inserted after the timestamp. Then the last few bits of the ID are used by an atomic counter. This ensures that nodes don't generate colliding IDs with eachother (the NodeID part) and with themselves (the Atomic part).

> First, you should still use a UNIQUE column, that prevents the most urgent collision issues.

Right, of course. I was using the UNIQUE/PK columns. When I said collision, I did mean that db threw errors on duplicates. Which is, of course, expected.

> You can also replace the random part with a Node ID

What is a "Node ID"?


A node ID would simply be a somewhat-unique ID for a node.

Sonyflake implements a 16bit machine ID: https://github.com/sony/sonyflake

It prevents multiple machines producing colliding numbers.

The solution is to stop using auto-incrementing IDs generated by the database, and UUIDs are not the only answer.

It is trivial to have an app request and reserve a pool of IDs and assign from them as needed. It guarantees no overlap, is completely distributed, requires no db roundtrip to insert, can remain as numeric data, allows every row to be uniquely identified across tables or databases, still has loose ordering, and it only requires a system with atomic increments to implement.

> It is trivial to have an app request

From where? If the counter for this pool of IDs lives in the database being rolled back, it is vulnerable in the same way as auto-incrementing IDs. If it doesn't, it isn't - but the safety you're getting comes simply from the fact that it doesn't get rolled back, not the rest of the machinery. It would be enough to make sure you don't roll back autoincrement counters when rolling back the rest of the database.

That said, i am a fan of the approach you outline. I learned it from the IdGenerator in ATG Dynamo (c. 2500 - 2200 BC):


You're right, it's an external system. Left it out since a key/value db is super simple to operate and very cheap by just using dynamo/cosmos db/cloud datastore to maintain it globally.

All the benefits help offset the extra dependency and make database backups much simpler, especially since every row has a unique id.

Is there a reason to do this other than security against DB rollbacks? It seems like a lot of complexity to add another stateful (micro?)service just to solve this problem. It's certainly trivial to implement—it's probably even interview-sized—but it doesn't sound trivial at all to operate, since it has reliability, scalability, security, and possibly NIH concerns. (Am I building this on top of another database? On top of a Paxos of some sort? What happens during an extended network partition?)

All you need is a basic key/value store. Zookeeper/consul/etc if that's already running, or use the numerous cloud services where it would costs pennies to have a globally replicated key/value db.

Increment the "ids" key by Y amount, get the current number N, and you now have N - Y as the range to work with. No paxos or network splits to worry about, and far better performance since its 1 call to get a batch of ids.

If I already have a high-availability key-value DB with persistence, sure - but probably I don't, most designs I've seen for SQL + KV on top involve using it as a pure cache with explicit permission to drop the KV store instead of restoring that to a disk snapshot whenever something goes funny. And for this use case (robustness against reverting the SQL database), making the KV store a mere cache of data in the SQL database defeats the point entirely.

And, honestly, most sites I've seen don't need the cache, they can just run on top of the SQL database itself. The fact that you're not serving hundreds of queries per second doesn't mean you don't need security.

Having run Zookeeper, "No paxos or network splits" isn't really how I would describe it. (Nor "super simple to operate.") When it works, it works well, but it's another thing that you need to operate. Adding a dependency on a cloud service just to supply autoincrement doesn't really sound like a great solution. Especially because, by the nature of the problem, I am outsourcing auth to the cloud service: a malicious service can pretty easily issue themselves token for an arbitrary user. And if I'm not already comfortable outsourcing auth, e.g., because everything is in this cloud already, it doesn't seem like this actually solves my original problem which was increasing security.

I think you're misunderstanding. Use postgres or whatever database you want, just generate the ids separate in your app and use a simple key/value store to maintain the counter. The SLAs of a something like Azure CosmosDB is several times higher than any app you can build yourself.

None of this is directly related to security. The issue for them was putting ids in the tokens, and there's no way to solve that other than not doing it. However if they didn't reuse ids, then it wouldn't have mattered, and that part can be solved by moving the id generation out of the database so that backup/restores of the database are simpler. This is in addition to all the other benefits of separate id generation I first mentioned.

> The shell the tests ran in unknowingly had a DATABASE_URL environment variable set as our production database. It was an old terminal window in a tmux session that had been used for inspecting production data many days before. The developer returned to this window and executed the test suite with the DATABASE_URL still set.

I was expecting something like this. I remember, I configured my terminal windows to change their background when I'm on production systems [2], after around I read about gitlab database incident [1].

1 : https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-...

2 : http://www.drinchev.com/blog/ssh-and-terminal-background/

Though note that wouldn't help here, at least according to a common sense reading of this part of the post:

> we connected our development environment to a production database with write access

i.e. they weren't logged in to a production machine, they were logged in to a development machine that was permitted to connect directly with write access to the production database.

Since it's a local shell doing something like this is a lot easier - just change $PS1 to add some colors when certain variables are set.

I use something like this today on my personal machine to distinguish between my personal and work email addresses in $EMAIL (for silly firewall reasons it's easier to originate work-owned OSS on my personal machine), and on my work machine to tell me which production zone I'm talking to - but I don't have it color when I have any variable set at all, I probably should do that.

I did a similar thing as a result of the gitlab incident. I use a custom iterm2 profile called "prod" that sshs into the production "shell box" and sets up a badge + bright colour.


It's the most convenient way for me to get a production shell:

  CMD+O -> "prod" -> Enter
Drops me straight where I need to be.

TIL about iterm profiles. That's going to be handy. :)

I've similarly started to rely on Google Cloud Shell as the only way to access our production database. That way, firewalls never have to allow external access, and it's very clear whether I'm on production.

Adding a Metabase instance with read-only credentials has helped this a lot - because executing one-off data queries is easy (and safe) now.

How come it was enough to set a different database URL in the development server and BAM you're connected to the production database?

Do they share the same password? Or is the password in the URL?

The database URL usually has the username and password as well.

For example: postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1&...]

For anything using libpq a .pgpase file can help separate creds in source/conf from now machine level ACLS.



No, its a URL. Connection strings have a different format, as clearly shown in the example in your link.

checked out this https://github.com/drinchev/phook. Really awesome!!

I was surprised about a tmux session connected to production DB for days. Though it was idle there are a lot of things that can go wrong during window switching. My colleague also pointed out the subtle error of assuming the value of DATABASE_URL in the system instead of being set explicitly by the test script that could have avoided this.

That being said I am amazed at their transparency over the whole issue and a thorough write up of the whole incident. It's something we can all learn from.

This is why I've grown to consider some actions to taint terminal windows, and tainted windows must be destroyed asap.

For example, administrative access to vault requires a root token in the environment of the shell, so that shell is tainted with too much access. Having an ssh-command to a productive data storage also taints a shell, because it's really dangerous to accidentally up-arrow-enter and then you end up on a productive data storage.

History has been a bit annoying with that, but since I installed histdb, that has been a non-issue.

I'm not familiar with histdb. How does that resolve the issue of of tainted environment?

I'm a heavy user of my shell history. Frequent creation and destruction of different concurrent shell sessions clobbers the history quite well until there's nothing really usable left. This resulted in some inertia to keep windows around.

zsh-histdb replaces the single-file ~/.zsh_history with an sqlite3-database - and this database is session aware. This way I can just have as many concurrent session as I want, and create / destroy shell sessions as much as I want and the shell history remains intact.

At some point i learned about the way to export environment variables to a single command:

  LD_LIBRARY_PATH=build/lib build/bin/mycommand
I try to use that whenever i can, and as such it is now very rare for me to generally export a variable in my shell.

Add an extra space in the front to prevent your shell from adding the (potentially secret) env vars to your plaintext history file.

In that situation, i'm more likely to write:

  MY_SECRET=$(read) some_command
Without the space. That way, i get the utility of command history, without the danger of the secret being in there.

  $ cat /etc/profile.d/timeout.sh
  readonly TMOUT
  export TMOUT
That file is created on (almost, see below) each and every machine I administer as part of the initial configuration/provisioning process. By itself, it's nothing spectacular, but it sounds like it would have prevented this incident (as would have several other things, of course).

For the hosts with csh as the default, it's:

  $ cat /etc/profile.d/timeout.csh
  set -r autologout=10

I am not a tmux user so maybe this is obvious, but how are the environmental variables transferring from the production machines to where the tests are being run?

Or, is this stating that the developer started running tests from the production environment?

> but how are the environmental variables transferring from the production machines to where the tests are being run?

tmux[1] is a terminal multiplexer, a screen alternative.

Some speculation, they might have stared the session on a remote machine; set `DATABASE_URL`; use it for something; detach from the session; to later reattached with the bash env still in tack; without the developer being aware he executed the tests commands.

Another tmux "problem" is developers using it to tail logs with a infinite back-scroll to only fill up server memory. One solution was to kill all tmux session nightly via cron. Might have helped here.

Finally as tui lover, try tmux! Really keeps the frustration at bay if you ssh to a box and your session drop to know your terminal workspace is still in tack.

- https://github.com/tmux/tmux/wiki

the tmux thing is a red herring, i encounter similar environment variable things plenty just with open terminal windows(though, not for database_url).

i read it as a couple days ago in their local terminal they exported the prod db as an environment variable. ran some script or console or something that needed to run in prod. didnt unset the variable. Then next time they run any command in that terminal it hits prod.

maybe it was a remote machine, but there wasnt an active db connection, just a hidden config that exploded the next time a command ran

My guess is the second one where the developer assumed that it's a testing environment tmux session but started the test script in a production environment.

If you are interested in a list of steps you can take to avoid this happening to your data, here are some suggestions. I don't believe that any single measure is sufficient. And I also believe that it's valid to balance the strictness of your controls against the the amount of protection you really need.

1. Vault the passwords. People and machines should fetch passwords on-demand using identity credentials.

2. Create a read-only database account. In all cases, use the account that matches the need. Running reports? Use the read-only account.

3. Restrict access to read-only and read-write database accounts. Provide this account information to a limited set of people and tools.

4. Provide a fairly straightforward way for people to get temporary elevated access. If it's easy to get elevated access, then users will not be tempted to "hold on" to elevated access longer than they should (e.g. by leaving a terminal open for a very long time).

5. Rotate the credentials of all the accounts regularly. This ensures that temporary elevated access will become long-term access. It also greatly reduces the harm created when credentials are leaked, exposed, or forgotten about (e.g. in an environment variable in an old window).

Note that none of the steps above require a heavy investment in automation. You can start with basic (even fully manual) processes for key management and access management, and evolve to automation as you grow.

Finally, keep in mind that this type of accident is not just a "small company" problem. Recall this AWS ELB outage on Christmas Eve of 2012 - https://aws.amazon.com/message/680587/

    "The data was deleted by a maintenance process that was inadvertently run against the production ELB state data."

> 4. Provide a fairly straightforward way for people to get temporary elevated access. If it's easy to get elevated access, then users will not be tempted to "hold on" to elevated access longer than they should (e.g. by leaving a terminal open for a very long time).

Or to hack around it in some other way, eg by adding some sort of backdoor to a production application.

This is such an important lesson, and one that unfortunately is lost on many command-and-control type IT departments.

They should really consider using a CI system to run their tests.

haha that made me chuckle.

I think the root issue here is that the production database "user" has too many privileges, and the reason for that is migrations. This is compounded by the test user essentially needing to be a db superuser to create and destroy test databases, as well as run the migrations for them. I've noticed this lately with Django, but I'm guessing that it's a general problem.

When I design a DB system, Ideally the production 'user' can only do those things that we reasonably expect them to be able to do, and truncate isn't one of them. Also drop tables, potentially delete entries, and any maintenance tasks. DDL modifications are right out.

Those tasks can be run from a specific user, and locked down to a certain types of connections that aren't allowed from production.

When i first worked with migrations, they were run manually as part of the release process. Over time, we automated them - by making them part of the release tool. It never occurred to us to make them part of the application itself. For us, it was therefore completely natural that the app's DB credentials did not have DDL permissions.

The whole idea of migrations being run by the app still seems really silly to me. I suppose this is the obvious move for developers who (very sensibly!) build applications, but don't build their own release tools, to whom it would never occur to put migrations anywhere except the application.

> The whole idea of migrations being run by the app still seems really silly to me.

I have one app that kinda runs migrations - there's a helper app which is built at the same time and (normally) runs before the main app. But this is based on a per-user database which means there's no "release" process that could run the migrations.

(Similar to iOS apps, really - when you get a new version of Teleappchatbook, it'll often have an "upgrading..." step which I assume is migrations / index updates, etc.)

Yeah I think it's interesting that companies will setup completely separate dev/prod databases but then grant all permissions to the the user. Roles are powerful and seem underutilized. Most of the time I only need read access so perhaps it's particular to my work, but why not use a dev role, that only has read access, for developing against production data?

Totally agree with you, and now that you bring it up in this way I think I want to change how we execute (Django) migrations at work.

Create a separate user with higher privileges that can run migrations. Use that connection during the deployment phase that executes migrations. Otherwise, the app gets a standard "user" without DDL access.

For what it's worth, truncate should not be a required permission for migrations/ddl tasks

Wow, the issue with the signed token is very interesting. Found it surprising the authentication method specifically wasn't mentioned in the remediation.

Food for thought: the security issue wouldn't have happened if (1) travis used UUIDs instead of sequential IDs as a pkey, or (2) used a secret token for auth instead of a signed (presumably) JWT.

1) They don't need UUIDs as a pkey, but just as another field. I like using auto-incrementing IDs as the primary key, and UUIDs for referencing records over internet-facing APIs. Best of both worlds!

2) Totally, JWTs are evil.

Back in the day uuid as pk was very inefficient due to sorted nature of pk. Time dependent uuid might be less so.

You are right, another uuid field would have solved this.

I don't know much about JWTs. Why are they bad?

JWT itself is a nice container for signing a small amount of JSON and being able to easily pass that around. I use it a lot for situations where I want to ensure someone hasn't futzed with the data, and/or I want an auto-expiring token of some sort.

JWT, by itself, is not an authentication and authorization system, but people often use it as such.

> used a secret token for auth instead of a signed

Seems that you'd still have an issue here, as the secret token would still be linked to a user ID somewhere in the system.

If you use a secret token you would compare the token from the first user 123 to what’s stored in the DB (for the second user ID) see that it doesn’t match and not log them in.

Because they used a signed user ID, there was no way to differentiate the user 123 from before the system restore and after.

Wondering why does any developer need update/delete/drop access to a prod database? Or why would ad hoc scripts have this ability?

Maybe because developers do operations for the production system?

> Maybe because developers do operations for the production system?

Which is probably why they dropped the prod db.

Yeps. Pure operations people don't make mistakes.

That’s a strawman. The argument is “would a devops person focused primarily on deployment be significantly less likely to make errors of this nature?”

When I look at our development, devops and operations people, the likely hood of making stupid mistakes seems to be correlated to skill-level rather than what title the person has. The developers who work part time with operations seems to cause less problems than pure operations-people, maybe because they only do it part time and therefore are a bit more careful.

But I'm sure you can enlighten me with some actual research-based facts?

I thought the same thing. The production database should not even be on the same LAN segment as the development stuff, you should have to VPN or tunnel into it specifically to query it.

I believe they run on Heroku, and there is no such separation possible.

only devops should have this info and they should be guarding it fiercly and only use when they know why/when... it would be easy enough to mirror all production data to a 2nd db w/ full read/write that's updated daily or weekly from source. That should give plenty of data for devs to work with.

isnt the idea of guarding access fiercely from developers the antithesis of devops? That just makes 'devops' people 'ops'

Least privileged access isn't necessarily the antithesis of devops.

You could argue that dev-ops model include automation for operations on production systems, with direct access to production systems limited to a reduced set of staff. Developers can create the change-sets to modify infrastructure, but those change sets are reviewed, validated, tested, and then executed on production via CI/CD Automation.

Infrastructure as Code, and Immutable Infrastructure lends itself well to that approach.

The idea that developers should actually be involved in ops doesn't seem to be DevOps anymore. Apparently that's called "NoOps" now, and DevOps is just ops people scripting things.

Now convince a place that isn't doing this to spend time and resources to put this into place.

'Ship fast and break things' is one of the best and worst cultures in the tech industry.

You'd think a product geared towards development and dev best practices...would maybe use some themselves?

Best practices are always for everybody else, we can skirt around them because ...

TL;DR don't rely on humans to "do the right thing", even if they're supposed to know what they're doing (i.e. ops people).

As part of a team (and an organization) that practices "devops" heavily, all of our devs do ops. We maintain the separation using an oncall/ops rotation and only touching prod from designated "ops" hosts. We also follow a "2 person" rule when touching production. We use a secrets management system to deliver credentials to hosts and alarming setup when the "wrong" hosts (e.g. dev hosts with access to prod creds) have access to certain creds.

A suggestion for production database access, create a readonly login (in addition to a write one).

Login using the readonly login the majority of the time, and only switch to the write login when required.

Finally I can show a solid example to my team mates who ridiculed me when I said we needed restrictions on access to prod servers. This is a great write-up!

Um, you shouldn't need a solid example for such a crystal clear best practice. I mean, everyone has access restrictions to PROD, it's simply to valuable and too costly to have to bring back up after a fuckup.

>" Using our API logs, and with information from our upstream provider about the IP address the query originated from, we were able to identify a truncate query run during tests using the Database Cleaner gem."

I'm assuming by "upstream provider" here they mean ISP/IaaS provider. Either way they didn't have enough information under their control to identify the source of the query. The reliance on a third party for accurate logging information seems like a big blind spot.

What if the upstream provider didn't have the logs? Or the request for access to those took an excessive amount of time? I didn't see anything in the remediation steps to address this.

Interesting writeup. I loath setting environment variables on long running terminal sessions exactly because it’s not obvious once they’re set.

I prefer to use a subshell for the command and set the environment variable each time:

$ ( export FOO=bar; my_cmd )

Or just:

env FOO=bar my_cmd

Or just:

FOO=bar my_cmd

Or just:

     FOO=bar my_cmd
That's the same but with a leading space, to prevent 'bar' from showing up in your histfile.

On bash, this assumes that $HISTCONTROL is set to ignoreboth [0]. This is in the default ~/.bashrc on Ubuntu, but I can't speak about other distros.

I don't use zsh, but I believe the equivalent is $HIST_IGNORE_SPACE [1].

[0] https://linux.die.net/man/1/bash

[1] https://unix.stackexchange.com/questions/6094/is-there-any-w...

One thing that immediately caught my attention: the fact that it is possible for a single query/command/request to wipe everything.

To be frank at a place I worked there had always been something like this too: if you were logged in as super admin, wiping all data is just one POST request away. That was super convenient when testing things, but having the same in production made me uneasy. Fortunately before any incident happened I added additional checks that required special command line flags to enable this API. Perhaps still not super foolproof but I felt much better.

I wish people would talk about this more. I am of the belief that if a single command can lead to a failure like this, you can’t simply plan on that accident not reoccurring. You should basically assume that it will reoccur.

Ideally, I think databases should integrate checks such as these. For example, how often does a production users table need to be truncated intentionally? Even by superusers? Usually very, very, rarely. So imagine if the database made you jump hoops before you could do that. People rely on permissions for this sort of thing but given how complicated permissions can become as the team and the database grows, it’s not hard to screw up the permissions and access control. I believe catching this kind of doomsday scenario is best when built in deeply at a very low level.

What about a database that would require separate DDL from data commands? Instead of having a single super-user that can do both DDL and INSERT/UPDATE/DELETE, you would instead have a DDL user, and a data user. That would probably prompt people to only use the data user in their application?

Shouldn’t there be a remediation step of making it impossible to login into another users’ session? E.g. generate a random number for every provisioned user and add it to the token.

All of my production terminals have dark-red background and my screen hardstatus also red. This is my default in rc files, and I have to explicitly link rc files to get my dev-only black background with lime hardstatus.

How do you set the background colour?

Lots of focus in the comments on the database access issue, but trusting the user specified (signed) token doesn't seem like a great idea. Not validating the token against database seems like a painful shortcut

This makes me think of the Google SRE book. They advise that, if there is a problem this big, any SRE should have the power to turn off the production load balancers until the problem is fixed.

I don't think that TravisCI did anything wrong. However, if they had turned off the load balancers as soon as they realized that there was a huge issue, it might have protected customer data more. They optimized uptime over completely fixing the issue. Also, perhaps nobody felt that they had the authority to turn off the production service.

So the session keys mapped to usernames, rather than IDs in the database? Otherwise, when the database is restored with the old user IDs, the session would become invalid instead of continuing to work. This is what I'm seeing:

1. Tables truncated. 2. In this window, someone creates an account with a username that existed in the dropped database. 3. They see a blank user page because a new user record was created. 4. Database restored. 5. It's as if you're logged into the original user's account.

This is not what happened. The tokens were mapped to user IDs and when people signed in, the db created new users which may have had the same IDs as old deleted accounts. When they restored the DB, these tokens pointed to other users and granted access to these other users' accounts. Quite an unfortunate situation. May have been mostly avoidable if UUIDs were used instead of incrementing IDs, but hindsight is 20/20.

The part of that that I don't get is how a new user could have the same ID as an old (truncated) user since "our system created new records for them, with primary keys generated from the existing sequence (PostgreSQL does not reset id sequences on truncate)."

Do they mean that the only potentially exposed accounts are those that signed up after the database was restored?

It's possible their truncate also restarted the sequence: "TRUNCATE TABLE users RESTART IDENTITY;"


Although, I am confused about how sign ins created new users. When they say sign ins, do they mean new accounts?

Yeah they must mean new accounts, if not then I'm lost. I guess it could have reset autoincrement but they said it didn't. The only other thing I can think of is that the signed token that's put in localStorage is sent to the server like "someuser|sometoken", the server inspects sometoken, says it checks out, then takes the client at its word that it's someuser.

A sign-in can result in an automatic account creation when using federated identity.

Yeah, it mentions that it's only accounts created after restore.

"hindsight is 20/20"

Absolutely, but at the same time I now feel vindicated about using UUIDs with my stuff :D

Reminds me of this:

"Why Auto Increment Is A Terrible Idea" (2015) https://www.clever-cloud.com/blog/engineering/2015/05/20/why...

(update: link fixed).

Amazed to see how transparently they have written the post. I think we all can learn from such outages[0]

[0]: https://about.gitlab.com/2017/02/10/postmortem-of-database-o...

Classic case of Developer returning to window with prod env setup. I am sure it was a "blameless post-mortem" i.e action item contains change in tooling and processes rather than trying to change human behaviour.

> ... i.e action item contains change in tooling and processes rather than trying to change human behaviour.

Contrary to what we might hope, in my experience the former is often much easier to ensure and enforce than the latter.

Aren’t these the folks that spammed every github repo with a spam pull request to integrate their system into your code? I kinda lost all respect for this project and their developers after that incident.

Apologies that you were affected by this. The script creating these pull requests was created and run by a third party not affiliated with the company. We were similarly upset by this.

> created and run by a third party not affiliated with the company.

Are you saying you never hired the third party, they went about and did this of their own accord without talking to you first?

Correct, this was from an overenthusiastic user, someone we did not know and had no direct contact with.

At the time we weren't actually making money yet, and most of the contributions to Travis CI came from outside collaborators.

To add to the confusion, we did indeed have a bot in the early days that would comment on pull requests, but only if the repository was using Travis CI already (this has now been replaced by GitHub's status API). However, this was not the same bot account that kept opening unsolicited pull requests on random projects.

Thanks for the clarification -- is there an official writeup about the incident somewhere? (should have asked that initially, sorry)

I was looking for one as well, but it seems we did not write a blog post. I will do some digging when I find the time, as I know we at least messaged some people that voiced their frustration directly.

Has anyone estimated whether SaaS in general reduce or improve the uptime in aggregate over all users ?!

The obvious arguments is that a specialized SaaS is more reliable, but the rare outages are horrific...

Why does an extremely dangerous tool, such as a database cleaning tool/library, rely on an environment variable to define the target?

What's the ideal way to do so? Most of the production systems I've seen distinguish between prod and staging with only this.

Site is down for me, but this mirror isn't:


What is a "read-only follower"? Is this a common term when handling databases? Is it different than a slave?

It may or may not be the same.

I will answer in the context of PostgreSQL which is the system I know the best. PostgreSQL allows you to create "copies" of the database using replication. A database which is replicating is always read-only. The database from which it is replicating can be read-write (a master) or read-only. In other words, replicas can replicate from replicas in a tree structure.

Any PostgreSQL server which is replicating is referred to in the documentation as a "standby" or "slave". The terms are used interchangeably.

The replication can be synchronous, in which case the "master" and "slave" have exactly the same set of committed transactions at all times. Or, it can be asynchronous, in which case the slave might possibly lag behind the master. Only a master can have a synchronous replica. It used to be that you could only have one synchronous standby, but with recent PostgreSQL you can have as many as you like.

A standby is called "warm" if it is not serving client requests. It's "hot" if it is. Of course, standbys can only serve read traffic.

To answer your question, "read-only follower" is probably a "hot standby". At least, a Heroku follower is generally there to serve read traffic.



Many people are using leader/follower instead of master/slave.

First time I hear of this, who uses this and why?

"master/slave" evokes the unhappy history of slavery, which people care about to widely varying degrees.

"leader/follower" or "primary/replica" are much more neutral terms that won't prompt negative emotions in many people.

People who choose one of the latter two options do so either because they feel it is more accurate, or because they wish to avoid the negative connotations of "master/slave", or a mix of both.

It's better to stick to tried and true terminology that everybody understands than to needlessly introduce new and redundant designations for concepts that have been in use for decades, just to avoid upsetting rather irrational American sensibilities.

Leader/follower and primary/replica are common terminology now. It's unfair to paint the sensibilities as irrational - this is exactly an example of the kind of nonchalance that helps lead to underrepresentation of minorities in tech.

You might not care, but there are a lot of people that do. Taking steps like this improves the comfort level others while affecting yours none. Why is that not worth it?

Primary/replica is also _much clearer about what's actually going on_ than master/slave.

So now we have three pairs of terms that apparently mean exactly the same thing. I'm sure that whatever "minorities in tech" are more interested in not being needlessly confused, than not seeing the word "slave". Who would even think of this in terms of human slavery, and why would it be connected to american minorities only?

I'm currently job-seeking and I've seen many jobs ads asking for CI experience. I'm not fond of using SaaS solutions and would like to fiddle with CI in private (e.g. using a private gitlab repo.

What would be the steps to setup an own, private and open source CI solution for, say, a Go, PHP, or JavaScript project?

You can run GitLab yourself, for free. Or even just supply your runner on GitLab.com: the builds run on your own equipment, but you don't need to host your own GitLab instance.

Jenkins has existed for a long time.

It has plugins for many things, and you can just run shell scripts when those don't work for you.

The Job DSL plugin is good for putting everything in code and generating jobs programatically.

There are a few other CI tools, GitLab itself you mention has been a full featured CI solution for quite a while and you can host it yourself.

Great, it also has a Docker image apparently.

Amazed at their transparency!

Attack of Little Bobby Tables

Not being snarky. How hard is it to setup DB replication and do testing/QA on that DB? Isn't it the SOP?

Why the remediation list doesn't include it?

Presumably because that's already SOP. It sounds like the query was supposed to run against a development or staging DB, but an environment variable that the dev wasn't aware of caused it to run against prod instead.

Great writeup.

This is the third case I'm aware of where CI deleted the production database. Others are GitHub (back in 2010: https://blog.github.com/2010-11-15-today-s-outage/) and LivingSocial.

A video of Zach Holman talking about dropping DB in production twice where they ran CI over production credentials : https://www.youtube.com/watch?v=AwXhckRN6Mc#t=5m50s

Great video overall about how to handle these situations

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact