That said, I have to admit to having at least three eye-bulge WTF moments while reading this.
I'm also surprised that there isn't a Remediation step of "firewall the development machines away from the production database".
(And isn't the change to database_cleaner to make it throw when run against remote databases by default a serious break of API compatibility? What if someone's depending on that behavior?)
What a great way to see failure: what went wrong, how to improve without blaming it on the poor guy. I wish all companies were like this - very forward thinking!
Funny thing is... this morning on my commute I was watching https://www.oreilly.com/ideas/developing-a-chaos-architectur... - which I recommend to anyone!
However, i do have a niggling worry that letting people off the hook risks stunting their personal growth. If you can make mistakes and not bear any responsibility for it, what is driving you to get better at doing your job?
It could be any of the following:
1) ambition: desire to improve yourself
2) fear: desire not to screw up (and feel that you are making mistakes)
3) fear: job security
In all cases what's essential is a feedback system. The fact that you are not directly the one to be blamed doesn't imply you are automatically not accountable for anything you do. Everyone shares at least partially the blame.
Creating a blameless culture, having a good feedback system in place with maybe some goals, objectives and ways to improve is probably something you need, if you want to keep your employees and have them grow with you and not treat them as expandable.
Good management should be able to do this. If someone is not fit for a role, a good manager will be able to find another way for this person to be useful. The last thing you want to do is to fire someone. At least that's my opinion. Again, management.
A developer who's able to easily run scripts in production without any checks in place is different than a developer who knowingly sidesteps existing processes in order to perform a task faster or easier.
The punishment for one versus the other should be very different.
You really don't want to be the butt of all the "Joe the prod database killer" jokes.
It's also interesting how many seem to be using this idea meant to be applied to internal processes and individuals and applying it to the companies themselves.
The most significant example was Gitlab: they had something like four backup mechanisms for their database, yet three(!) of them had not been working for months, with nobody noticing. Then, someone actually dropped the production database. The last working backup system also had some problems I can't remember right now.
Yet in the comments here, there was barely a hint of criticism of Gitlab. Instead, they were lauded for not lynching the poor guy who made the last of what must have been dozens of terrible decisions leading to the incident.
Travis' incident here does actually seem like the sort of freak accident that one can't rule out completely, at least not at a company of their size. But forgiveness might not always be warranted.
On the contrary - the entire reason why you transition from culture to process as you grow is because when you measure by results, there's no difference between unintentional negligence (leaving a production terminal open à la Travis), intentional negligence (let me bystep this annoying check, let me procrastinate on fixing the backups a la GitLab), and malice.
If you get large enough, you will have grossly negligent people, by statistical inevitability. You can either accept this statistical inevitability and design your process for it or you can continue to believe that you (and everybody else) really actually do only ever hire the very best.
The entire post laments the fact that entities external to the company will repeatedly forgive the company if the company goes through the motion of publishing a blameless post-mortem. Even if those post-mortems repeatedly indicate the company really didn't learn anything, or change the processes that led up to the incident.
But by golly, they're forward thinking company, and therefore blameless.
The threads are still there. You can go back and read them, and notice that quite a few comments are along the lines "not an individual fault, but catastrophic on the group level", "I don't think I can trust Gitlab to run a production service", "How the hell do you have all your backups failing and not notice?!".
If you make a mistake (a real one that causes problems for the company and its users), that is enough for most people to want to avoid it happening again.
I prefer thorough E2E testing to mocked large-component integration testing (i.e. spinning up a db vs mocking it out) so I can totally understand needing to run tests against a database but yeah reading this was a collar-pulling "yikes" moment for me.
Their remediation steps are good though, for a start... They might want to look into spinning up completely new databases programattically for tests rather than connecting to one...
Postgres pg_hba.conf does support access by ip address easily enough. For access to the production databases the ip addresses should be limited to those that need access. Or if not in postgres, a firewall could limit it.
So if a developer accidentally tried to wipe it, they wouldn't be allowed to connect. If they actually needed to connect, they would first have to add their ip address (or use a bastion host).
There’s zero reason to do this.
It's possible they disabled the localhost check at some point.
Edit: It does now, after Travis folk fixed it. :D
And, Rails (which I assume they're using) typically loads DB credentials from a `database.yml` file for a particular `RAILS_ENV` (local dev, local automated testing, prod). By setting the `DATABASE_URL` env var, they are overriding that (this is fine/expected behavior for app deployment as far as Rails is concerned; it's comparatively less common for localdev). So that bypassed the `RAILS_ENV=prod` check that might also have caught this (e.g. `RAILS_ENV` was undefined or `development`).
So I'd really have loved it if they dug into some of the UX factors to this issue:
* Why do they allow remote access to their production database? (Is it too hard to get a local database copy? "For debugging purposes"? Why was the eng in prod the day before? Was it unintentional? Should we revisit those assumptions? Should we resolve whatever issue that caused us to shortcut by allowing remote, direct prod db access?)
̶ ̶W̶h̶y̶ ̶D̶B̶ ̶c̶l̶e̶a̶n̶e̶r̶ ̶a̶l̶l̶o̶w̶e̶d̶ ̶n̶o̶n̶-̶l̶o̶c̶a̶l̶ ̶(̶o̶l̶d̶e̶r̶ ̶v̶e̶r̶s̶i̶o̶n̶?̶ ̶m̶i̶s̶c̶o̶n̶f̶i̶g̶u̶r̶e̶d̶?̶ ̶e̶t̶c̶.̶)̶
Why, for localdev, they override database.yml and use env vars (which I've found is more cumbersome to use and not as common). Yeah, in production you should use RAILS_ENV/DATABASE_URL/etc. - so are they attempting to have prod parity? Why or why not?
* Why are folks routinely logging in (for debugging purposes) with a DB user that is, effectively, root? I bet that user can `drop table` too. Should folks have a limited access account and then jump through a hoop or two for root?
It sounds like they want to "debug" prod by running some local rails tools (rails c?) with DATABASE_URL set to the prod db. Is that the "best" way to do it? heroku etc. actually open a rails console in the environment and not locally.
I can think of the following:
- Don't use auth cookies that are signed messages consisting of a UID + expiration date and other data, use auth cookies that are opaque keys into some valid-auth database. This is significantly less efficient (every operation needs a lookup into the DB before you can do anything; if you move it into a cache you now risk the cache being out-of-date with your DB). AFAIK using signed UIDs has no security downside other than this, right?
- Identify users by usernames, not by UIDs. This makes renaming users (which GitHub allows, so Travis is forced to allow) difficult and security-risky.
- Use UIDs that are selected from a large random space so make collisions unlikely, e.g., UUIDs or preferably 256-bit random strings. This seems fine and probably preferable from a security point of view. Is this fine from the DB point of view?
Anything else? Maybe a DB restore-from-backup option that preserves autoincrement counters and nothing else - is that a standard tool?
The UIDs just make for easier to read logs, and easier to inspect and hand-write db queries. The UUIDs just seem much more secure when communicating with client applications.
I also agree that authing with a DB is preferable, if you can afford to do that at the scale your app needs (which is most apps out there).
One more benefit of using UUIDs for primary keys is that clients can generate models along with the primary key and know what the key will be BEFORE they submit it to the database. That is, it works really well with distributed systems. Auto incrementing primary key is really state stored at the database level and a source of contention
I would even argue that UUIDs are far better to read logs. You do a grep for that UUID in the log and it's easier to find then to grep for integer primary keys. If your contains disparate models then UUIDs will find your entries easier then integers.
If you ever have to rollback your database but keep some of the new data, it's easier also. You just export the data you want to keep, do your rollback, and them import that data knowing that auto incrementing primary keys will never be an issue.
You can copy objects easily from one database to another and know the primary key will be the same, and not clash.
For hand writing db queries, it's also better to use a UUID since then I can use that same query cut and pasted into other databases with the same data and know that I don't have to check primary keys. The UUID is just a copy/paste.
We have a table with a UUID as its primary and it alone consumes 22GB, then a index reference it, so now another 22GB... That one primary key uses over 100GB of storage, a developer recently went to add another table that reference it and we had to decide whether we were okay taking another hit. If we used your example of using a char type (we use blob) it would be double the size...
It doesn't sound like much, but in a database with all primary keys being UUIDs you are going to inflate the size of your DB quickly, or have to forego using foreign keys. I imagine if we used only UUIDs we would have to double or triple our database infrastructure. Additionally, we would be forced to introduce partitioning sooner.
Now we have a tech debt ticket to add an auto increment to that table so we can reclaim disk / memory.
I think you're greatly exaggerating how big UUIDs are compared to ints. The upper bound is 4x, and that's if your table is keys only, with no other data. That falls down to 2x if you're using bigints. In a typical user table with a bunch of other fields, like a username + password hash, the storage used by the keys is quickly dwarfed by the storage used for "other data".
Memory and disk space is cheap.
Also UUIDs are just 128bit integers internally. My example clearly points out that most modern databases have a native uuid type (and if not use use a 128 bit integer).
Using a char type to store UUIDs is very wasteful and no wonder your indexes are so huge.
The tech debt ticket you really have is to convert type of uuid column from char/blob to native.
Note also that if you need to expose your data, you can either expose the auto incrementing primary key (which leaks data, in that other people can work out growth rates of various models, like how England worked out how many German tanks were being built based on a auto incrementing serial number), or you create a another column (as someone else suggested) which is a uuid. Which you would need to index anyway.
If index size is an issue then you can just integers which are randomly generated. You can calculate your collision chance based on how large your integer is and how often you create. If you happen to pick 128bit integers, congratulations, you're using UUIDs.
That was a hypothetical, its 22GiB using BLOB storage.
>Note also that if you need to expose your data, you can either expose the auto incrementing primary key (which leaks data, in that other people can work out growth rates of various models, like how England worked out how many German tanks were being built based on a auto incrementing serial number), or you create a another column (as someone else suggested) which is a uuid. Which you would need to index anyway.
I state that we use the auto-increment inside of the application, everything exposed by the API is a UUID. While we lose 32 bits of space per row, just one foreign key or index saves enough space to justify it.
Best of both worlds.
>Memory and disk space is cheap.
Not when you need to move from multiple r4.8XL to r4.16XLs (think multi-az, multi region). It would work out to around double the price, we can hire a developer for the cost of hosting that infrastructure. Even when your business has the money, you don't want to be the guy telling them to drop another $24k a month on hosting costs because we didn't think one year ahead.
This of course ignores all the issues that come with replication and partitioning, multi-master etc.
I'm not following your math here. Can you elaborate? I see 22 and 22 and the 100.
The ORM I wrote (and use...) automatically generates a an internal and an external identifier for every persisted object. Works pretty well in practice.
The other option is to use snowflake IDs with a random suffix. They will still sort like normal IDs but won't collide if you roll back (or are unlikely to collide).
UUIDs or fully random strings are a bit meh as primary column in a database. They disconnect all meaning and potential sorting from the records (sorting by primary key is meaningless and random PKs may ruin database performance depending on setup).
Snowflakes will behave like a UUID or random string when properly engineered but are still sortable.
Identifying users by usernames also leads to all kinds of sillyness, internally you should be using the ID and not the username. Usernames are meant for human consumption, not machine consumption.
PostgreSQL supports calling functions as the default value of a column so it should be possible to simply put in a function that returns such a snowflake (either in pure SQL or in PL/Python) and PG will use that.
Second, if your tests run a lot faster than actual loads, consider using a special test timestamp format that allows for higher throughput at the expense of longevity (ie, 10bit timestamp since start of test + 54 bit of random data) or alternatively throttle the tests to closer resemble real workloads (within one or two orders of magnitude)
You can also replace the random part with a Node ID + atomic counter. Each Node that generates IDs gets a unique identifier as ID which is inserted after the timestamp. Then the last few bits of the ID are used by an atomic counter. This ensures that nodes don't generate colliding IDs with eachother (the NodeID part) and with themselves (the Atomic part).
Right, of course. I was using the UNIQUE/PK columns. When I said collision, I did mean that db threw errors on duplicates. Which is, of course, expected.
> You can also replace the random part with a Node ID
What is a "Node ID"?
Sonyflake implements a 16bit machine ID: https://github.com/sony/sonyflake
It prevents multiple machines producing colliding numbers.
It is trivial to have an app request and reserve a pool of IDs and assign from them as needed. It guarantees no overlap, is completely distributed, requires no db roundtrip to insert, can remain as numeric data, allows every row to be uniquely identified across tables or databases, still has loose ordering, and it only requires a system with atomic increments to implement.
From where? If the counter for this pool of IDs lives in the database being rolled back, it is vulnerable in the same way as auto-incrementing IDs. If it doesn't, it isn't - but the safety you're getting comes simply from the fact that it doesn't get rolled back, not the rest of the machinery. It would be enough to make sure you don't roll back autoincrement counters when rolling back the rest of the database.
That said, i am a fan of the approach you outline. I learned it from the IdGenerator in ATG Dynamo (c. 2500 - 2200 BC):
All the benefits help offset the extra dependency and make database backups much simpler, especially since every row has a unique id.
Increment the "ids" key by Y amount, get the current number N, and you now have N - Y as the range to work with. No paxos or network splits to worry about, and far better performance since its 1 call to get a batch of ids.
And, honestly, most sites I've seen don't need the cache, they can just run on top of the SQL database itself. The fact that you're not serving hundreds of queries per second doesn't mean you don't need security.
Having run Zookeeper, "No paxos or network splits" isn't really how I would describe it. (Nor "super simple to operate.") When it works, it works well, but it's another thing that you need to operate. Adding a dependency on a cloud service just to supply autoincrement doesn't really sound like a great solution. Especially because, by the nature of the problem, I am outsourcing auth to the cloud service: a malicious service can pretty easily issue themselves token for an arbitrary user. And if I'm not already comfortable outsourcing auth, e.g., because everything is in this cloud already, it doesn't seem like this actually solves my original problem which was increasing security.
None of this is directly related to security. The issue for them was putting ids in the tokens, and there's no way to solve that other than not doing it. However if they didn't reuse ids, then it wouldn't have mattered, and that part can be solved by moving the id generation out of the database so that backup/restores of the database are simpler. This is in addition to all the other benefits of separate id generation I first mentioned.
I was expecting something like this. I remember, I configured my terminal windows to change their background when I'm on production systems , after around I read about gitlab database incident .
1 : https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-...
2 : http://www.drinchev.com/blog/ssh-and-terminal-background/
> we connected our development environment to a production database with write access
i.e. they weren't logged in to a production machine, they were logged in to a development machine that was permitted to connect directly with write access to the production database.
I use something like this today on my personal machine to distinguish between my personal and work email addresses in $EMAIL (for silly firewall reasons it's easier to originate work-owned OSS on my personal machine), and on my work machine to tell me which production zone I'm talking to - but I don't have it color when I have any variable set at all, I probably should do that.
It's the most convenient way for me to get a production shell:
CMD+O -> "prod" -> Enter
Adding a Metabase instance with read-only credentials has helped this a lot - because executing one-off data queries is easy (and safe) now.
Do they share the same password? Or is the password in the URL?
For example: postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1&...]
That being said I am amazed at their transparency over the whole issue and a thorough write up of the whole incident. It's something we can all learn from.
For example, administrative access to vault requires a root token in the environment of the shell, so that shell is tainted with too much access. Having an ssh-command to a productive data storage also taints a shell, because it's really dangerous to accidentally up-arrow-enter and then you end up on a productive data storage.
History has been a bit annoying with that, but since I installed histdb, that has been a non-issue.
zsh-histdb replaces the single-file ~/.zsh_history with an sqlite3-database - and this database is session aware. This way I can just have as many concurrent session as I want, and create / destroy shell sessions as much as I want and the shell history remains intact.
$ cat /etc/profile.d/timeout.sh
For the hosts with csh as the default, it's:
$ cat /etc/profile.d/timeout.csh
set -r autologout=10
Or, is this stating that the developer started running tests from the production environment?
tmux is a terminal multiplexer, a screen alternative.
Some speculation, they might have stared the session on a remote machine; set `DATABASE_URL`; use it for something; detach from the session; to later reattached with the bash env still in tack; without the developer being aware he executed the tests commands.
Another tmux "problem" is developers using it to tail logs with a infinite back-scroll to only fill up server memory. One solution was to kill all tmux session nightly via cron. Might have helped here.
Finally as tui lover, try tmux! Really keeps the frustration at bay if you ssh to a box and your session drop to know your terminal workspace is still in tack.
i read it as a couple days ago in their local terminal they exported the prod db as an environment variable. ran some script or console or something that needed to run in prod. didnt unset the variable. Then next time they run any command in that terminal it hits prod.
maybe it was a remote machine, but there wasnt an active db connection, just a hidden config that exploded the next time a command ran
1. Vault the passwords. People and machines should fetch passwords on-demand using identity credentials.
2. Create a read-only database account. In all cases, use the account that matches the need. Running reports? Use the read-only account.
3. Restrict access to read-only and read-write database accounts. Provide this account information to a limited set of people and tools.
4. Provide a fairly straightforward way for people to get temporary elevated access. If it's easy to get elevated access, then users will not be tempted to "hold on" to elevated access longer than they should (e.g. by leaving a terminal open for a very long time).
5. Rotate the credentials of all the accounts regularly. This ensures that temporary elevated access will become long-term access. It also greatly reduces the harm created when credentials are leaked, exposed, or forgotten about (e.g. in an environment variable in an old window).
Note that none of the steps above require a heavy investment in automation. You can start with basic (even fully manual) processes for key management and access management, and evolve to automation as you grow.
Finally, keep in mind that this type of accident is not just a "small company" problem. Recall this AWS ELB outage on Christmas Eve of 2012 - https://aws.amazon.com/message/680587/
"The data was deleted by a maintenance process that was inadvertently run against the production ELB state data."
Or to hack around it in some other way, eg by adding some sort of backdoor to a production application.
This is such an important lesson, and one that unfortunately is lost on many command-and-control type IT departments.
When I design a DB system, Ideally the production 'user' can only do those things that we reasonably expect them to be able to do, and truncate isn't one of them. Also drop tables, potentially delete entries, and any maintenance tasks. DDL modifications are right out.
Those tasks can be run from a specific user, and locked down to a certain types of connections that aren't allowed from production.
The whole idea of migrations being run by the app still seems really silly to me. I suppose this is the obvious move for developers who (very sensibly!) build applications, but don't build their own release tools, to whom it would never occur to put migrations anywhere except the application.
I have one app that kinda runs migrations - there's a helper app which is built at the same time and (normally) runs before the main app. But this is based on a per-user database which means there's no "release" process that could run the migrations.
(Similar to iOS apps, really - when you get a new version of Teleappchatbook, it'll often have an "upgrading..." step which I assume is migrations / index updates, etc.)
Create a separate user with higher privileges that can run migrations. Use that connection during the deployment phase that executes migrations. Otherwise, the app gets a standard "user" without DDL access.
Food for thought: the security issue wouldn't have happened if (1) travis used UUIDs instead of sequential IDs as a pkey, or (2) used a secret token for auth instead of a signed (presumably) JWT.
2) Totally, JWTs are evil.
You are right, another uuid field would have solved this.
JWT, by itself, is not an authentication and authorization system, but people often use it as such.
Seems that you'd still have an issue here, as the secret token would still be linked to a user ID somewhere in the system.
Because they used a signed user ID, there was no way to differentiate the user 123 from before the system restore and after.
Which is probably why they dropped the prod db.
But I'm sure you can enlighten me with some actual research-based facts?
You could argue that dev-ops model include automation for operations on production systems, with direct access to production systems limited to a reduced set of staff. Developers can create the change-sets to modify infrastructure, but those change sets are reviewed, validated, tested, and then executed on production via CI/CD Automation.
Infrastructure as Code, and Immutable Infrastructure lends itself well to that approach.
'Ship fast and break things' is one of the best and worst cultures in the tech industry.
As part of a team (and an organization) that practices "devops" heavily, all of our devs do ops. We maintain the separation using an oncall/ops rotation and only touching prod from designated "ops" hosts. We also follow a "2 person" rule when touching production. We use a secrets management system to deliver credentials to hosts and alarming setup when the "wrong" hosts (e.g. dev hosts with access to prod creds) have access to certain creds.
Login using the readonly login the majority of the time, and only switch to the write login when required.
I'm assuming by "upstream provider" here they mean ISP/IaaS provider. Either way they didn't have enough information under their control to identify the source of the query. The reliance on a third party for accurate logging information seems like a big blind spot.
What if the upstream provider didn't have the logs? Or the request for access to those took an excessive amount of time? I didn't see anything in the remediation steps to address this.
I prefer to use a subshell for the command and set the environment variable each time:
$ ( export FOO=bar; my_cmd )
env FOO=bar my_cmd
I don't use zsh, but I believe the equivalent is $HIST_IGNORE_SPACE .
To be frank at a place I worked there had always been something like this too: if you were logged in as super admin, wiping all data is just one POST request away. That was super convenient when testing things, but having the same in production made me uneasy. Fortunately before any incident happened I added additional checks that required special command line flags to enable this API. Perhaps still not super foolproof but I felt much better.
Ideally, I think databases should integrate checks such as these. For example, how often does a production users table need to be truncated intentionally? Even by superusers? Usually very, very, rarely. So imagine if the database made you jump hoops before you could do that. People rely on permissions for this sort of thing but given how complicated permissions can become as the team and the database grows, it’s not hard to screw up the permissions and access control. I believe catching this kind of doomsday scenario is best when built in deeply at a very low level.
I don't think that TravisCI did anything wrong. However, if they had turned off the load balancers as soon as they realized that there was a huge issue, it might have protected customer data more. They optimized uptime over completely fixing the issue. Also, perhaps nobody felt that they had the authority to turn off the production service.
1. Tables truncated.
2. In this window, someone creates an account with a username that existed in the dropped database.
3. They see a blank user page because a new user record was created.
4. Database restored.
5. It's as if you're logged into the original user's account.
Do they mean that the only potentially exposed accounts are those that signed up after the database was restored?
Although, I am confused about how sign ins created new users. When they say sign ins, do they mean new accounts?
Absolutely, but at the same time I now feel vindicated about using UUIDs with my stuff :D
"Why Auto Increment Is A Terrible Idea" (2015) https://www.clever-cloud.com/blog/engineering/2015/05/20/why...
(update: link fixed).
Contrary to what we might hope, in my experience the former is often much easier to ensure and enforce than the latter.
Are you saying you never hired the third party, they went about and did this of their own accord without talking to you first?
At the time we weren't actually making money yet, and most of the contributions to Travis CI came from outside collaborators.
To add to the confusion, we did indeed have a bot in the early days that would comment on pull requests, but only if the repository was using Travis CI already (this has now been replaced by GitHub's status API). However, this was not the same bot account that kept opening unsolicited pull requests on random projects.
The obvious arguments is that a specialized SaaS is more reliable, but the rare outages are horrific...
I will answer in the context of PostgreSQL which is the system I know the best. PostgreSQL allows you to create "copies" of the database using replication. A database which is replicating is always read-only. The database from which it is replicating can be read-write (a master) or read-only. In other words, replicas can replicate from replicas in a tree structure.
Any PostgreSQL server which is replicating is referred to in the documentation as a "standby" or "slave". The terms are used interchangeably.
The replication can be synchronous, in which case the "master" and "slave" have exactly the same set of committed transactions at all times. Or, it can be asynchronous, in which case the slave might possibly lag behind the master. Only a master can have a synchronous replica. It used to be that you could only have one synchronous standby, but with recent PostgreSQL you can have as many as you like.
A standby is called "warm" if it is not serving client requests. It's "hot" if it is. Of course, standbys can only serve read traffic.
To answer your question, "read-only follower" is probably a "hot standby". At least, a Heroku follower is generally there to serve read traffic.
"leader/follower" or "primary/replica" are much more neutral terms that won't prompt negative emotions in many people.
People who choose one of the latter two options do so either because they feel it is more accurate, or because they wish to avoid the negative connotations of "master/slave", or a mix of both.
You might not care, but there are a lot of people that do. Taking steps like this improves the comfort level others while affecting yours none. Why is that not worth it?
It has plugins for many things, and you can just run shell scripts when those don't work for you.
The Job DSL plugin is good for putting everything in code and generating jobs programatically.
There are a few other CI tools, GitLab itself you mention has been a full featured CI solution for quite a while and you can host it yourself.
Why the remediation list doesn't include it?
This is the third case I'm aware of where CI deleted the production database. Others are GitHub (back in 2010: https://blog.github.com/2010-11-15-today-s-outage/) and LivingSocial.
Great video overall about how to handle these situations