From Chris' Twitter stream (http://twitter.com/#!/defunkt):
Seriously, I blame whoever wrote our crappy continuous integration software.
Oh that's me
Thankfully for me it was a small website and no one really noticed. I can't imagine what that feeling would be like on something like github.
It was not a good day.
 I knew it was a bad idea to automatically log on the target of the mail. But it was the policy. And it still the case as far as I know. And if you forgot your password don't worry It's stored in plaintext...
Or sent a test email to thousands of customers in your prod database encouraging them to use web check-in for their non-existent flight tomorrow.
Yeah, did that five years ago, talk about heart-attack-inducing. Quickly remedied by sending a second email to the same test set, thankfully, but that's the kind of mistake you never forget.
How about the DreamHost case in which they typed the wrong year in their billing code; charged many of their users for an extra year of service, to the tune of $7.5 million:
“What’s the harm in keeping it flexible?” $7.5M in harm, that’s what! "flexibility" is rarely desired in programming!
Now that would be a bad feeling.
Sounds like the biggest impact was on non-github-users trying to read source or documentation on other people's projects moreso than actual users.
But GitHub's main value over something like, say, gitosis, is all of the metadata. This stuff is mainly useful to github users. Projects you're watching, pull requests, bug reports, fun (if only of dubious general use) graphs. That stuff doesn't come across the wire with the individual repos. The big (in terms of bytes) part, the "events" table, is helpful for keeping up with people and projects (if sometimes noisy). I use the RSS feed of that, though, so I didn't even notice the interruption until I heard about it here.
Contrast this with something older like subversion and sourceforge. If sourceforge went down you were shit out of luck.
1. Production DB credentials are only stored on the production appservers, and copied in at deploy time.
2. The production DB can only be accessed from the IPs of the production webfarm.
3. Staging, Testing, Development, and Everything Else live on separate networks and machines than production.
Yes, we've all made silly mistakes. But if you're in that design meeting and somebody asks, should we do ABC in case of XYZ, try not to think about how complicated or time consuming it might be to do ABC. Think about the worst case. If not doing it could at some point bring down the whole business, perhaps you should ponder it some more.
Actually, screw a book... Does anyone else want to start some wiki pages their experiences with screw-ups, the causes and the solutions? Does this exist in a comprehensive way and I just haven't found it?
I worked, back in the dark ages, for a health insurer that had two parallel environments-- one for testing/development, and one for production. None of the developers were allowed anywhere near the production environment. There was one full time employee, a former developer, whose primary job was to move code into production-- which he would only do when he received a signed document authorizing the change. Said document included the telephone numbers where the developer responsible for the code would be for the next 24 hours, so that one of the operators could call you in the middle of the night if your code caused any problems to the system.
At the time, I used to think that this was ridiculous. After managing a staff of programmers myself, I'm not so sure.
Then again, there is no perfect solution for this, is there? If you tried installing new version with more debugging points, you'd be deploying something unstable over something previously unstable and trying to push more data out which might be problematic in itself. I'm not even going into "rollback to stable didn't work" scenarios :(
Deploy management is the combination of a change management system with a deployment tool in a flexible way. So for example, with the right options, in an emergency the manager of a team of developers could deploy anything to the site immediately without requiring approval from change management. The tools are still right there to revert any change, and of course there's snapshots and daily backups for the most critical data.
Except for emergencies, all changes to production would come with an approval from a higher-up with potential code review back-and-forth first. Contact data and reversion capabilities are built in, so everybody knows who did what, how to contact them and how to revert it if they're unavailable. And of course your trending data will tell you when your peak use is and code pushes are typically frozen during that time, minimizing further potential loss.
However, besides having read-only access to production, devs should also have two kinds of testing: "development" and "staging". Development is where the bleeding-edge broken stuff lives and code is written. Staging is an identical machine to those in production. Often you'll see test or qa machines which aren't identical to production, usually because changes aren't pushed to them the way they are in production. The staging machine gets all changes pushed to it like any other production machine, except it lives in a LAN that cannot access production or anything else. A method to reproduce incoming requests and sessions from production system to this staging server will give you a pretty good idea what "real traffic" looks like on this box, if you need it.
Have redundant everything.
Provide tools to talk to production systems, databases, but make them read only and make it hard to mess with production systems unless you're reallllllly sure you know what you're doing.
We have a build master that promotes a reviewed deployment package to QA and/or PROD environments, where the appropriate QA or PROD operations folks do the actual deployment.
It's a luxury to have the resources available for this, but it's a life saver, because it really is stupidly easy to make a simple mistake and totally screw things up.
The last time something similar happened to me, it happened to be at the end of a REALLY long day. And what do you know... that day was then made 24 hours longer, interspersed with the occasional cat nap while backups were being restored and verified.
Fun times. Not.
It's not that hard to keep up really. It's a question of a day or two of setup and then the hardish part of constant discipline to not take the "easy way out" and poke holes into the segregation you've set up. Mainly it takes a single team lead or CTO or whatever to be really explicit that you just don't break the steps and you'll avoid a LOT of problems. You'll still have problems, problems are inevitable, but in general you'll have mitigated them and with a proper backup and merge procedure you'll minimize downtime.
Lesson: don't let anything talk to your production servers.
Make it as difficult as possible for anyone to log in. Treat production as if it were a loaded gun. You shouldn't be touching it unless you absolutely have to, and even then, you need to be vigilantly aware of the consequences of your actions.
What hasn't been said is how refreshing it is to see an honest and quick explanation. I know this type of approach is getting more and more common (see the foursquare outage), but in the grand scale of things, its still quite rare.
It took several levels of basic mistakes for this to happen AND for the restore to be as slow as it is.
Using MySQL with no transactions, no binary backups, no way to do a quick restore, and no separation of dev/production. DROP as opposed to DELETE cannot be rolled back and is therefore scary -- unless you aren't using transactions in the first place, in which case, WHEE!
And DROPs (or truncates) are almost always used if the goal is to remove all the data in a table like you would want if rebuilding the entire system. Something like 100M record transaction is not generally considered a good thing.
For MySQL, using the standard format (MyISAM), you can just do a file copy if you bothered to do a proper backup of the binary files.
If I were running some critical code, I would have seriously reconsider github, or at-least ask for a detailed explanation on their engineering practices and fail-safe mechanisms.
Github isn't facebook and not seeing the last week's "activities" timeline for a few days isn't really that much of a problem.
Taking a few days to restore a table indicates they only had a text dump of it. The details of how they screwed up indicate they were new to the idea of data robustness, despite it being the core of their business.
Being a competent home user and being a competent data-robustness company involve slightly different levels of expertise.
Backing up and restoring mysql is an orthogonal problem.
The fact that github can operate at all in a degraded mode very much indicates their competence. Many architectures can not operate without the entire db.
Plus I bet they'll shore up their db operations next. Github is very much a young company that will take a lick or two in their way to greatness. All the startups here will.
Well, I'm just going by TFA, which stated:
"Newly created users and repositories are being restored, but pull request state changes and similar might be gone."
Repos were affected.
> The fact that github can operate at all in a degraded mode very much indicates their competence. Many architectures can not operate without the entire db.
This is a curious statement to make after you started off by recognizing that git handles itself. The only parts of the repositories that broke were the parts they tried to glue to their database.
It's nice of you to give them a pass. I'd say DBAs are pretty horrified. I suppose it's kind of like how the average person can watch CSI where they say, "I'll create a GUI interface in Visual Basic, see if I can track an IP address" and think it's just fine.
When you're in the business of storing other people's data, things like transactions, binary backups, QUICK RESTORES, and separating dev from production shouldn't be afterthoughts. They are core attributes.
I would understand this for a demo, but not after a couple of years and half a million users. They have paying customers.
"Reliable code hosting
We spend all day and night making sure your repositories are secure, backed up and always available."
The code repos were never in danger, and they've been killing it in the market because they are racing to add awesome extra features, not racin to the bottom.
I don't disagree that they can do more in their db operations, and that it's fine for us paying customers to demand more, but the reality of startups is that it very much is a race and features, for availability or recovery or other, are viciously prioritized and many things don't happen until something breaks.
If you don't thing Github cuts the mustard, svn on Google Code probably won't have problems like this...
Disclaimer: I'm personal friends with much of the Github crew.
Let me introduce you to my friend, GeoCities.
Funny thing about free hosting.
> no matter how much armchair quarterbacking.
Transactions aren't armchair quarterbacking. Binary backups aren't armchair quarterbacking. Separating development from production isn't armchair quarterbacking. Kindly, you have no idea what you're talking about.
You can't give them credit for the repositories being mostly intact, when the ONLY parts that broke were the parts they mucked with to tie them into their database.
> but the reality of startups is that it very much is a race and features, for availability or recovery
And those are exactly the places where they screwed up.
Anyway, it's a great discussion so we can learn from other peoples mistakes. I will be triple checking my restores later today, and likely halt the project I'm working on to get cold standbys shored up asap. (but it's hard to prioritize house keeping over customer centric features in the race I'm running along side the Github crew)
Ironically the most fitting application would be to say that they're armchair quarterbacking their own database administering.
These are categorical mistakes.
"If you don't thing Github cuts the mustard, svn on
Google Code probably won't have problems like this..."
Disclaimer: I work for google on the Project Hosting product.
Yes, they missed out some pretty obvious things. And what happened? A few hours of downtime because their restore was slow, plus a tiny bit of inconsequential data loss. Hardly a catastrophe.
The fact is that every site has some sort of problems. Many of them will be completely obvious like this one. And while github could have gone through and attempted to fix them all, I much prefer they spend their time doing things that have more than a few hours of impact on my life.
No doubt they will fix the issues involved today.
Beta is the stage you reach AFTER you flip on transactions, institute and test quick restores, and separate dev from production.
Also from what I know Github is a flat org where all the devs have ability to do work in production. My company is like this too, and while it sometimes leads to scary mistakes, it also leads to massive productivity over having to go through a release engineering team.
They even brag about it on their main page.
This will be a good reminder for me to always keep a local copy.
Since InnoDB table spaces never shrink, 80G of their truncated data was still all available in a single monolithic ibdata file. An InnoDB recovery tool named page_parser read their 80G ibdata file and spits out a maze of 16k InnoDB page files organized by an arbitrary index id.
There are two internal InnoDB meta tables called SYS_INDEXES and SYS_TABLES which can give you a mapping from table name to PK index id. Unfortunately after the mass TRUNCATE all the tables got new index mappings, so it became a bit of table hide-and-seek.
The InnoDB recovery tools lack a certain polish and maturity. You need to create a C header file for each table you want recover from the pile of 16k page files. You end up having to build a separate version of the constraints_parser binary for each table. There were bugs with the output of negative numbers, unicode handling, VARCHAR types with >= 127 characters, and some edge cases where InnoDB chooses to store CHAR(15) types as VARCHARs internally. Aleksander at Percona really saved the day, he was able to find and fix these bugs pretty quickly.
I remember that magic moment when I finally was able to successfully checksum a huge block of the recovered data against the too-old-to-be-useful backup.
"I love the smell of CRC32 in the morning. It
smells like... victory."
(It's not possible to do this in every case, especially if you make heavy use of stored procedures and triggers, but I don't. If I need client-independent behavior or integrity checks on top of the database, I just use a small RPC server. This makes testing and scaling easier, since there are just simple components that speak over the network. Much easier than predicting everything that could possibly happen to the database.)
The slowness of the restore sounds like they have to insert a text dump back into the database, rather than simply copy over the binary files. Even copying over 1 TB from a single hard drive or between EC2 instances would only take a few hours, not days.
And they weren't using binlogs.
And they didn't separate out their test from production environment, which is understandable when you're in demo mode but not after a couple years and half a million users.
I spent years in Java prior to Ruby and through the various frameworks in Java, I never remember making the mistake or working with anyone who made the mistake of blowing away the production data when running tests.
Since working in Rails, have had at least 2-3 experiences on the team with truncated/dropped tables or mangled data in production, and it isn't just the current team.
Many other languages and frameworks don't have it as part of the normal process used by most developers to accidentally do this as much as it has seemed to happen in Rails 2, due to the conventions used.
With Rails, the same project for the most part is in dev, test, and prod. Developers want database.yml revisioned with the rest of the app. Many Rails developers at some point run production against production DB locally, in order to run migrations on production prior to Capistrano deploys, etc., or maybe just to run a production console to run reports or tweak data. So, there is a very real chance for wild things to happen in such a risky environment. I'm not anti-Rails by any means, but I've not heard of this happening as much as it has historically with Rails in other frameworks.
That way, the app just says "connect me to /x" and the container gives the app /x, which might be the test or prod db connection, the app never knows, or cares. You get the db that the operations guys configured for -that- environment.
Jndi is horrible in many ways, and so are app containers, but there are a few niceties that ruby-land could learn from.
So when it's running something like "rake db:reset", which simply recreates all tables & seeds the database, it just looks at your environment and assumes you know what you're doing. Hudson (and other continuous integration systems) are pretty automated so unless it prompted you, it would run the db scripts quietly.
Any reasonable quality transactional database system has a transaction log for a reason. If you were using commercial DB system like Microsoft SQL Server or Oracle (even a decade ago), this would not be an issue. No data loss. Businesses should care about their data, and I guess this is why the commercial databases are still doing fine in a landscape increasingly dominated by FOSS everywhere.
I realize licensing costs do matter, but I can't fathom why people put up with the sad excuse of a RDBMS which is MySQL. For any nontrivial tasks it is slow, it's unreliable, attempting to secure your data by taking database backups (which seemingly can't even provide you with transactional safety!!!) renders the DB unusable and locked while backup is performed, having databases bigger than what you can store in memory makes it perform like a flat-file, etc etc ad infitum.
Surely there must be something better people can use which is still free? Postgres for instance?
Which makes me think, last time I had anything to do with adminning MySQL, binlogs were required for replication, I'm guessing this means github aren't replicating that database anywhere either...
This is an oversight in github's MySQL configuration (and their post states that they want to fix it)
Commercial databases such as Oracle, SQL Server, and DB2 are built from the ground up to make it possible to recover everything up through the last committed transaction before the incident. I have always presumed that the leading open source traditional relational databases (mySQL and PostgreSQL) provide this same capability.
In the github case (mySQL) it seems that full recovery may have been possible, but it was not possible within an acceptable timeframe. This may be a flaw in the software (lack of features to make quick recovery possible), or a flaw in the strategy (not configured to make quick recovery possible).
However, any of these databases can easily be administered in a manner that allows data loss. The github event is not necessarily an indication of a problem with mySQL. It is unquestionably an indication of a problem with the administration of the database.
Can anyone confirm that full recovery from this type of incident (within a reasonable timeframe) should be possible with both mySQL and PostgreSQL if administered properly?
I prefer my agility to remain on the dev-and-test side of the fence.
But if you're logging stuff, NILFS rocks:
It's been a great service, and I think as long as this kind of thing is rare, and none of my code repositories get corrupted or destroyed, I plan to stick with them.