Hacker News new | comments | show | ask | jobs | submit login
Cause of today's Github outage (github.com)
214 points by jlangenauer on Nov 15, 2010 | hide | past | web | favorite | 115 comments

Ouch. I think we've all done this once or twice, in some fashion or another. I'm just happy they're so open about it. Learning experience == good thing.

From Chris' Twitter stream (http://twitter.com/#!/defunkt):

Seriously, I blame whoever wrote our crappy continuous integration software.

Oh that's me

Exactly, I don't think you've really lived till you've experienced that pit of your stomach feeling when you realise you've just wiped out a product website / database.

Thankfully for me it was a small website and no one really noticed. I can't imagine what that feeling would be like on something like github.

My worst was discovering I had written a unique ID generator which was (due to me typing "==" instead of "!="), producing duplicate IDs -- and not only that, it was producing them at exponentially increasing rates -- and every duplicate ID was destroying an association in our database, making it unclear what records belonged to who.

It was not a good day.

Mine was for a French social networking site 4 years ago. They used to send mails everyday to say "hey look at the people who you might know". The links on the mail would automatically log the user on the website. When I sent the code live it took 2 days (and more than 50000 mails to found out that when I sent a mail to person Z about person Y the link logged in Z ON Y's account.

[1] I knew it was a bad idea to automatically log on the target of the mail. But it was the policy. And it still the case as far as I know. And if you forgot your password don't worry It's stored in plaintext...

> Exactly, I don't think you've really lived till you've experienced that pit of your stomach feeling when you realise you've just wiped out a product website / database.

Or sent a test email to thousands of customers in your prod database encouraging them to use web check-in for their non-existent flight tomorrow.

Yeah, did that five years ago, talk about heart-attack-inducing. Quickly remedied by sending a second email to the same test set, thankfully, but that's the kind of mistake you never forget.

Heck, you think that's heart attack inducing?

How about the DreamHost case in which they typed the wrong year in their billing code; charged many of their users for an extra year of service, to the tune of $7.5 million:

http://blog.dreamhost.com/2008/01/15/um-whoops/ http://blog.dreamhost.com/2008/01/16/the-aftermath/ http://blog.dreamhost.com/2008/01/17/the-final-update/

Wow. That is a great read.

“What’s the harm in keeping it flexible?” $7.5M in harm, that’s what! "flexibility" is rarely desired in programming!

I have a strict rule for myself: never use any curse words in any comments, variable names, dummy accounts etc.

I have the opposite strict rule: use as many curse words in comments, variable names, and dummy accounts as possible. That way you'll find out quickly when someone else notices!

I recommend against that. :) A team I was on had a demo for a client, and a shaky database schema. I had used, as a test account, the username "MOTHERFUCKINYEAH" for the same purpose. The purging of this account caused a few 500 errors, and we almost lost the client, and although I didn't get fired, we were all shuffled around after that to less critical projects, and one of us actually got sent out of the state.

I don't think they're useful for variable names, but I use them (very sparingly) in comments. A great way to spot what needs refactoring is to find the piece that caused the most frustration to the author. Under a tight deadline, apologies to future maintainers can also be helpful in injecting some levity.

Like when the google engineer added "/" to the list of bad URLs, thereby marking every single website (!) as possibly dangerous.


Now that would be a bad feeling.

I suspect that this is why people tend to understand when GitHub goes down, and this allows them to be more open. All (or nearly all) of GitHub's users are tech-savvy, so we tend to understand the problems.

It also doesn't hurt that every user's repositories are "backed up" to localhost.

Sounds like the biggest impact was on non-github-users trying to read source or documentation on other people's projects moreso than actual users.

Definitely the important part. Linus Torvalds once joked: "Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it." Distributed SCMs like git are a great way to

But GitHub's main value over something like, say, gitosis, is all of the metadata. This stuff is mainly useful to github users. Projects you're watching, pull requests, bug reports, fun (if only of dubious general use) graphs. That stuff doesn't come across the wire with the individual repos. The big (in terms of bytes) part, the "events" table, is helpful for keeping up with people and projects (if sometimes noisy). I use the RSS feed of that, though, so I didn't even notice the interruption until I heard about it here.

I did it for the main test database, and STILL felt like crap. SO many headaches.

I once messed up a table that was responsible for a custom-distributed auto-increment field. If you roll that back and don't reset some servers, you have new data overwriting old data. Very bad stuff.

I didn't even notice, maybe because it was Sunday and none of my data seems to have been lost. Well, except for the wiped events but I can live with that.

Good thing git is distributed. I've been working on my code all day and never even noticed!

So what's the point of using github if you don't notice it's offline?

Github provides a convenient place for me to publish my work to the world, allowing others to pull it as they desire (my work machine is a laptop, and can never be relied upon to be up, or at any particular address). Because git is distributed my workflow is not effected in the slightest by "somebody else's" (github's) completely separate repo being down.

Contrast this with something older like subversion and sourceforge. If sourceforge went down you were shit out of luck.

This is why it's important to isolate production from other environments. Three rules have kept me from ever borking a production database:

1. Production DB credentials are only stored on the production appservers, and copied in at deploy time.

2. The production DB can only be accessed from the IPs of the production webfarm.

3. Staging, Testing, Development, and Everything Else live on separate networks and machines than production.

While this is part of the problem, to me it seems like they didn't have proper restore procedures or at least hadn't tested them enough. There are countless ways to corrupt a database and restoring from backup is part of most remedies.

If someone ever produces a good book of best practices for sysadm/syseng, please provide examples like these of why it's important to follow these best practices.

Yes, we've all made silly mistakes. But if you're in that design meeting and somebody asks, should we do ABC in case of XYZ, try not to think about how complicated or time consuming it might be to do ABC. Think about the worst case. If not doing it could at some point bring down the whole business, perhaps you should ponder it some more.

Actually, screw a book... Does anyone else want to start some wiki pages their experiences with screw-ups, the causes and the solutions? Does this exist in a comprehensive way and I just haven't found it?

While a book of screw-ups might be amusing, I think it might be more instructive to look at the "old school" ways that screw-ups like this were avoided.

I worked, back in the dark ages, for a health insurer that had two parallel environments-- one for testing/development, and one for production. None of the developers were allowed anywhere near the production environment. There was one full time employee, a former developer, whose primary job was to move code into production-- which he would only do when he received a signed document authorizing the change. Said document included the telephone numbers where the developer responsible for the code would be for the next 24 hours, so that one of the operators could call you in the middle of the night if your code caused any problems to the system.

At the time, I used to think that this was ridiculous. After managing a staff of programmers myself, I'm not so sure.

It depends on the software a bit, but I got a taste of "programmers not allowed on production" environment (in a very small scale). The problem I run into is that you have no idea what's going on on the production environment. Some characteristics are maybe reproducible on the dev site, but actual users will always do something differently. Sometimes not being able to poke the live system in a specific way while it's live, will cost you weeks of guessing in the dev environment.

Then again, there is no perfect solution for this, is there? If you tried installing new version with more debugging points, you'd be deploying something unstable over something previously unstable and trying to push more data out which might be problematic in itself. I'm not even going into "rollback to stable didn't work" scenarios :(

In my experience the best solution for a large-scale dynamic site is a combination of read-only access to production and deploy management.

Deploy management is the combination of a change management system with a deployment tool in a flexible way. So for example, with the right options, in an emergency the manager of a team of developers could deploy anything to the site immediately without requiring approval from change management. The tools are still right there to revert any change, and of course there's snapshots and daily backups for the most critical data.

Except for emergencies, all changes to production would come with an approval from a higher-up with potential code review back-and-forth first. Contact data and reversion capabilities are built in, so everybody knows who did what, how to contact them and how to revert it if they're unavailable. And of course your trending data will tell you when your peak use is and code pushes are typically frozen during that time, minimizing further potential loss.

However, besides having read-only access to production, devs should also have two kinds of testing: "development" and "staging". Development is where the bleeding-edge broken stuff lives and code is written. Staging is an identical machine to those in production. Often you'll see test or qa machines which aren't identical to production, usually because changes aren't pushed to them the way they are in production. The staging machine gets all changes pushed to it like any other production machine, except it lives in a LAN that cannot access production or anything else. A method to reproduce incoming requests and sessions from production system to this staging server will give you a pretty good idea what "real traffic" looks like on this box, if you need it.

Even in fairly conservative financial environments I've seen developers getting read-only access to production application servers and databases to troubleshoot specific problems.

Given the topic, it would be good to make mirrors of these pages before doing anything else with them.

We have a great sysadmin at work, and I've learned 3 very important lessons:

Have redundant everything. Backup everything. Provide tools to talk to production systems, databases, but make them read only and make it hard to mess with production systems unless you're reallllllly sure you know what you're doing.

97 Things Every Software Architect Should Know is a pretty good starting point.


Thanks, i'll give this a read.

Forthright and classy. Compare to register.com, which had a big DNS outage Friday (affecting anybots.com) and never admitted to a problem.

I believe they had a message on their homepage during the event about being DDoSed, but yeah no after-remarks is kind of ugly in my opinion. They could at least post and hide it from the general customer.

1) http://seclists.org/nanog/2010/Nov/415

Lesson, don't let your CI machine talk to your production servers (firewalls are good at this).

In my environment, our Dev's (individuals or environments/subnets) don't have access to PROD or QA, and our CIT boxes are in DEV. Likewise, QA and PROD only have access to their own environments.

We have a build master that promotes a reviewed deployment package to QA and/or PROD environments, where the appropriate QA or PROD operations folks do the actual deployment.

It's a luxury to have the resources available for this, but it's a life saver, because it really is stupidly easy to make a simple mistake and totally screw things up.

The last time something similar happened to me, it happened to be at the end of a REALLY long day. And what do you know... that day was then made 24 hours longer, interspersed with the occasional cat nap while backups were being restored and verified.

Fun times. Not.

I'm not so sure it's a luxury. Maybe it is if your startup is servicing a group of techies who have knowledge of what problems lay in the background but most clients don't know and don't care.

It's not that hard to keep up really. It's a question of a day or two of setup and then the hardish part of constant discipline to not take the "easy way out" and poke holes into the segregation you've set up. Mainly it takes a single team lead or CTO or whatever to be really explicit that you just don't break the steps and you'll avoid a LOT of problems. You'll still have problems, problems are inevitable, but in general you'll have mitigated them and with a proper backup and merge procedure you'll minimize downtime.

Agreed, but to be clear, the luxury I was referring to was specifically having the DEV and PROD staff being totally separate teams/individuals, not just the separation of environments. In other words, nobody on the DEV team did any PROD operations, except in the case of bug investigation, tuning advice, etc.

Nice, but still too specific.

Lesson: don't let anything talk to your production servers.

Make it as difficult as possible for anyone to log in. Treat production as if it were a loaded gun. You shouldn't be touching it unless you absolutely have to, and even then, you need to be vigilantly aware of the consequences of your actions.

Interesting idea. I wonder if it would be worthwhile to store server information in LDAP, add a field for environment type, and then teach your server components to act on that information. Then when the dev app server connects to the prod database, the database checks the server name, sees that it's dev, and refuses the connection.

Maybe I should go and put in a reference to this thread in yesterday's thread about how you don't really need firewalls...

I agree with what's been said so far. 1 - Shit happens 2 - We've all done stupid stuff 3 - Testing environment shouldn't have access to production

What hasn't been said is how refreshing it is to see an honest and quick explanation. I know this type of approach is getting more and more common (see the foursquare outage), but in the grand scale of things, its still quite rare.

I have a slightly higher threshold for considering something refreshing.

It took several levels of basic mistakes for this to happen AND for the restore to be as slow as it is.

Using MySQL with no transactions, no binary backups, no way to do a quick restore, and no separation of dev/production. DROP as opposed to DELETE cannot be rolled back and is therefore scary -- unless you aren't using transactions in the first place, in which case, WHEE!

I don't have much MySQL experience, but have lots of experience with other RDBMSs. Large restores take a long time no matter what. They shouldn't take days, but since we don't know how large the events table was we have no way of knowing if a faster restore was possible. A hot restore which it sounds like what they are doing may take longer by simple fact that it's restoring while the table is in use.

And DROPs (or truncates) are almost always used if the goal is to remove all the data in a table like you would want if rebuilding the entire system. Something like 100M record transaction is not generally considered a good thing.

They take a long time if you have to rebuild the indexes, which is what you have to do if you only have a text dump.

For MySQL, using the standard format (MyISAM), you can just do a file copy if you bothered to do a proper backup of the binary files.

I am a software developer, so I know "shit happens", but having the same configuration for database as testing environment, (same superuser name and password), which is not isolated from test environment, is pretty criminal even for a first time mistakeIMHO, especially for a product like github whom business, small and big trust with there business critical piece ("repository").

If I were running some critical code, I would have seriously reconsider github, or at-least ask for a detailed explanation on their engineering practices and fail-safe mechanisms.

I can only hope it shocks some sense into kids that use GitHub for distribution rather than putting a tarball named $name-$version.tgz (or bz2 or xz or whatever). As much as I love GitHub, is has been the bane of my automated-build existence. I don't want to ever have to make a build script that guesses at a SHA1 (or punts by doing a clone at depth zero) again.

git ls-remote will help you to get the hash of HEAD. It's much cheaper than doing a zero-depth clone.

I think the fact that they have been able to recover all critical data soon and will even update the event table over the next few days clearly displays their competence.

Github isn't facebook and not seeing the last week's "activities" timeline for a few days isn't really that much of a problem.

> I think the fact that they have been able to recover all critical data soon and will even update the event table over the next few days clearly displays their competence.

Taking a few days to restore a table indicates they only had a text dump of it. The details of how they screwed up indicate they were new to the idea of data robustness, despite it being the core of their business.

Being a competent home user and being a competent data-robustness company involve slightly different levels of expertise.

Git itself handles robustness for what counts -- the source code and revision history. This outage did not affect repos as far as I have heard.

Backing up and restoring mysql is an orthogonal problem.

The fact that github can operate at all in a degraded mode very much indicates their competence. Many architectures can not operate without the entire db.

Plus I bet they'll shore up their db operations next. Github is very much a young company that will take a lick or two in their way to greatness. All the startups here will.

> This outage did not affect repos as far as I have heard.

Well, I'm just going by TFA, which stated:

"Newly created users and repositories are being restored, but pull request state changes and similar might be gone."

Repos were affected.

> The fact that github can operate at all in a degraded mode very much indicates their competence. Many architectures can not operate without the entire db.

This is a curious statement to make after you started off by recognizing that git handles itself. The only parts of the repositories that broke were the parts they tried to glue to their database.

It's nice of you to give them a pass. I'd say DBAs are pretty horrified. I suppose it's kind of like how the average person can watch CSI where they say, "I'll create a GUI interface in Visual Basic, see if I can track an IP address" and think it's just fine.

Can you share more details about how you usually work ?

I think it's a measure of the goodwill in the community for Github (and perhaps, the fact that a lot of us have done something similar in the past) that they won't cop much flak at all for this.

I don't know. I use github, but my paid, private repos are elsewhere. The fact that someone, anyone, can run against the production system and nuke it raises some basic questions about password storage. I don't run a site anything like github, but my production and test databases have different passwords and none of them are stored in a way that the test environment could get access to the live db, nor could the tests be run on the production environment. There's bad luck and theres asking for trouble.

If github spent their time so that their site raised absolutely no basic questions, then they'd still be in beta by now.

To me, that sounds like encouraging a race to the bottom.

When you're in the business of storing other people's data, things like transactions, binary backups, QUICK RESTORES, and separating dev from production shouldn't be afterthoughts. They are core attributes.

I would understand this for a demo, but not after a couple of years and half a million users. They have paying customers.

"Reliable code hosting

We spend all day and night making sure your repositories are secure, backed up and always available."

The half a million users validates their techniques, no matter how much armchair quarterbacking.

The code repos were never in danger, and they've been killing it in the market because they are racing to add awesome extra features, not racin to the bottom.

I don't disagree that they can do more in their db operations, and that it's fine for us paying customers to demand more, but the reality of startups is that it very much is a race and features, for availability or recovery or other, are viciously prioritized and many things don't happen until something breaks.

If you don't thing Github cuts the mustard, svn on Google Code probably won't have problems like this...

Disclaimer: I'm personal friends with much of the Github crew.

> The half a million users validates their techniques,

Let me introduce you to my friend, GeoCities.

Funny thing about free hosting.

> no matter how much armchair quarterbacking.

Transactions aren't armchair quarterbacking. Binary backups aren't armchair quarterbacking. Separating development from production isn't armchair quarterbacking. Kindly, you have no idea what you're talking about.

You can't give them credit for the repositories being mostly intact, when the ONLY parts that broke were the parts they mucked with to tie them into their database.

> but the reality of startups is that it very much is a race and features, for availability or recovery

And those are exactly the places where they screwed up.

Are you sitting in a chair? Do you not work at Github? Then you're armchair quarterbacking. We all are. Even if you are a quarterback for another team (I'm a DBA myself)

Anyway, it's a great discussion so we can learn from other peoples mistakes. I will be triple checking my restores later today, and likely halt the project I'm working on to get cold standbys shored up asap. (but it's hard to prioritize house keeping over customer centric features in the race I'm running along side the Github crew)

"a pejorative modifier to refer to a person who experiences something vicariously rather than first-hand, or to a casual critic who lacks practical experience"

Ironically the most fitting application would be to say that they're armchair quarterbacking their own database administering.

Unless you work for github, you lack practical experience in github's internal operation.

Fortunately, there is this thing called "science" which means we can understand things about the world regardless of where we live. As Dawkins would say, there is no such thing as "Chinese Science" or "French Science", just science. Similarly, there is no such thing as "Github MySQL" or "Github separation of production and development systems" in that same sense.

These are categorical mistakes.

    "If you don't thing Github cuts the mustard, svn on
     Google Code probably won't have problems like this..."
Neither will Mercurial on Google Code ;-)

Disclaimer: I work for google on the Project Hosting product.

Quite fair! Google code is sweet

If we're going to use hindsight, we might as well look at the cause and the effect.

Yes, they missed out some pretty obvious things. And what happened? A few hours of downtime because their restore was slow, plus a tiny bit of inconsequential data loss. Hardly a catastrophe.

The fact is that every site has some sort of problems. Many of them will be completely obvious like this one. And while github could have gone through and attempted to fix them all, I much prefer they spend their time doing things that have more than a few hours of impact on my life.

No doubt they will fix the issues involved today.

Your post is a good illustration of the type of attitude we want to scrupulously avoid when appointing someone to be in charge of a database.

> If github spent their time so that their site raised absolutely no basic questions, then they'd still be in beta by now.

Beta is the stage you reach AFTER you flip on transactions, institute and test quick restores, and separate dev from production.

So basically, you don't make mistakes, only other people do. Is it cold up there on that high horse?

It is trivial to put up barriers to mistakes.

After you make them, sure. But trying to predict mistakes before they happen can be difficult to impossible.

No I make mistakes all the time. That's kinda my point.

I wonder if this is a rails thing. It's an out of the box pattern to put production, development and testing credentials into one database configuration file.

Also from what I know Github is a flat org where all the devs have ability to do work in production. My company is like this too, and while it sometimes leads to scary mistakes, it also leads to massive productivity over having to go through a release engineering team.

I moved my repos off GitHub to my EC2 server a month or two back since they're private and I was only using GitHub for keeping a copy of my code offsite. It's faster for simple push/pull and considering the sunk cost of my EC2 server, also free. I was trying to browse some repos on GitHub yesterday during the downtime and was thankful that my own were still available.

Also, if you just need something for replicating code, Amazon is offering free micro servers on new accounts for up to a year I think.

rookie mistake, better security if they isolate their networks too...

I don't know why you first got down-voted. I have never worked on a project where a testing environment could access the production db.

I think he got downvoted for calling the GitHub team "rookies". Sure, they may not have tons of experience running 5-nines systems, but they've clearly built a great product that a lot of people love and respect.

They may not be rookies as Ruby hackers, but the evidence clearly points to them being rookies from the standpoint of data robustness. That's embarrassing enough when it's not your core business offering.

They even brag about it on their main page.

Yeah that is just insane. We typically run prod, stage, test, dev. The test and dev instances are even in another dmz.

May be it's foresight or may be it's just paranoid. We've always used a entirely different username/password in prod and the password for prod never sits in the config files. Kind of saved us a couple of times. Sometimes it doesn't require a highly sophisticated setup to prevent a catastrophe.

Minutes before the outage, my account would seem non-existent and all my repositories gone. They really scared the hell out of me.

This will be a good reminder for me to always keep a local copy.

Since it's git-based, is it even possible to not keep a local copy? I thought everything that goes into github has to be committed to your local git repo first?

Well, there is always a local copy since you have to commit to the local repo first before you push it to github. That is, of course, as long as you don't remove it for no apparent reason.

I was walking through someone's capistrano scripts, and suddenly everything came back "Not Found". My first thought was that the guy had just starting cleaning up his account. :)

Second time I've heard of this happening fairly recently. Another incident, same cause. http://www.bigdoor.com/blog/bigdoor-api-service-has-been-res...

This was pretty painful. Their backups were too old, so it was necessary to do InnoDB data recovery rather than a straight-forward restore from backup.

Since InnoDB table spaces never shrink, 80G of their truncated data was still all available in a single monolithic ibdata file. An InnoDB recovery tool named page_parser read their 80G ibdata file and spits out a maze of 16k InnoDB page files organized by an arbitrary index id.

There are two internal InnoDB meta tables called SYS_INDEXES and SYS_TABLES which can give you a mapping from table name to PK index id. Unfortunately after the mass TRUNCATE all the tables got new index mappings, so it became a bit of table hide-and-seek.

The InnoDB recovery tools lack a certain polish and maturity. You need to create a C header file for each table you want recover from the pile of 16k page files. You end up having to build a separate version of the constraints_parser binary for each table. There were bugs with the output of negative numbers, unicode handling, VARCHAR types with >= 127 characters, and some edge cases where InnoDB chooses to store CHAR(15) types as VARCHARs internally. Aleksander at Percona really saved the day, he was able to find and fix these bugs pretty quickly.

I remember that magic moment when I finally was able to successfully checksum a huge block of the recovered data against the too-old-to-be-useful backup.

"I love the smell of CRC32 in the morning. It smells like... victory."

I try to make my apps work against SQLite and the production database, so I can run all my tests against an in-memory SQLite database. This makes the tests run Really Fast, and it prevents a configuration error from causing my production data to go away.

(It's not possible to do this in every case, especially if you make heavy use of stored procedures and triggers, but I don't. If I need client-independent behavior or integrity checks on top of the database, I just use a small RPC server. This makes testing and scaling easier, since there are just simple components that speak over the network. Much easier than predicting everything that could possibly happen to the database.)

Anyone know what database system they use?

MySQL. Not that using anything else would have prevented this or would help restore the table faster...

fwiw, Cassandra automatically takes a snapshot (which is basically instantaneous -- see http://www.riptano.com/blog/whats-new-cassandra-066 for a brief explanation why) before truncating or dropping a column family, exactly because it makes recovering from a foot/gun incident like this so much less painful.

> MySQL. Not that using anything else would have prevented this or would help restore the table faster...

The slowness of the restore sounds like they have to insert a text dump back into the database, rather than simply copy over the binary files. Even copying over 1 TB from a single hard drive or between EC2 instances would only take a few hours, not days.

And they weren't using binlogs.

And they didn't separate out their test from production environment, which is understandable when you're in demo mode but not after a couple years and half a million users.

Rails makes this way too easy to do by accident.

If your database is on a socket then that would mean tests are running on a production machine, which just seems crazy. If it is on TCP than that means access isn't restricted properly. At any rate how is this Rails specific?

"how is this Rails specific?"

I spent years in Java prior to Ruby and through the various frameworks in Java, I never remember making the mistake or working with anyone who made the mistake of blowing away the production data when running tests.

Since working in Rails, have had at least 2-3 experiences on the team with truncated/dropped tables or mangled data in production, and it isn't just the current team.

Many other languages and frameworks don't have it as part of the normal process used by most developers to accidentally do this as much as it has seemed to happen in Rails 2, due to the conventions used.

With Rails, the same project for the most part is in dev, test, and prod. Developers want database.yml revisioned with the rest of the app. Many Rails developers at some point run production against production DB locally, in order to run migrations on production prior to Capistrano deploys, etc., or maybe just to run a production console to run reports or tweak data. So, there is a very real chance for wild things to happen in such a risky environment. I'm not anti-Rails by any means, but I've not heard of this happening as much as it has historically with Rails in other frameworks.

I think it could happen just as easily in Java, especially in continuous integration scripts, which are nearly fully automated. I've nearly done something in Hibernate before just by running ant scripts that looked at my environment variables to see which database it should truncate. If I forget to change back the environment variable and my machines were able to access the prod servers, same thing could happen. Only reason it didn't happen for me is because I usually have pretty strict rules in my pg_hba.conf file that prevents foreign hosts from connecting.

Not really. Reasonably complex java apps would use jndi to store a single db connect name, which the app server maps internally to db credentials.

That way, the app just says "connect me to /x" and the container gives the app /x, which might be the test or prod db connection, the app never knows, or cares. You get the db that the operations guys configured for -that- environment.

Jndi is horrible in many ways, and so are app containers, but there are a few niceties that ruby-land could learn from.

Actually, in Rails 3, there are explicit guards against running the test suite against the production environment.

See https://rails.lighthouseapp.com/projects/8994/tickets/5685-t...

However in this case, it happened on their continuous integration server -- which probably does more than just "rake test". On our hudson scripts, it does something like "rake db:reset" because some concurrent scripts require data to be persisted in the database and not rolled-back per test.

So when it's running something like "rake db:reset", which simply recreates all tables & seeds the database, it just looks at your environment and assumes you know what you're doing. Hudson (and other continuous integration systems) are pretty automated so unless it prompted you, it would run the db scripts quietly.

No system could prevent it, but there are databases(Oracle and MongoDB off the top of my head, and I'm sure there are others) that implement slave delays specifically to help mitigate human error like this.

Incorrect. The article states that they lost data between backup time and whenever the incident happened.

Any reasonable quality transactional database system has a transaction log for a reason. If you were using commercial DB system like Microsoft SQL Server or Oracle (even a decade ago), this would not be an issue. No data loss. Businesses should care about their data, and I guess this is why the commercial databases are still doing fine in a landscape increasingly dominated by FOSS everywhere.

I realize licensing costs do matter, but I can't fathom why people put up with the sad excuse of a RDBMS which is MySQL. For any nontrivial tasks it is slow, it's unreliable, attempting to secure your data by taking database backups (which seemingly can't even provide you with transactional safety!!!) renders the DB unusable and locked while backup is performed, having databases bigger than what you can store in memory makes it perform like a flat-file, etc etc ad infitum.

Surely there must be something better people can use which is still free? Postgres for instance?

FWIW, even MySQL let's you store "binlogs", which allow you to "rerun" all commands which changed the data since your last backup - if you've configured it to.

Which makes me think, last time I had anything to do with adminning MySQL, binlogs were required for replication, I'm guessing this means github aren't replicating that database anywhere either...

There are many reasons for using Postgres over mysql, but having a backupable bin-log isn't one of them. MySQL does this too.

This is an oversight in github's MySQL configuration (and their post states that they want to fix it)

You are right on. The cardinal rule of traditional relational databases is that data loss is unacceptable. A database administrator who allows data loss is by definition incompetent. A database that allows data loss is by definition not a database.

Commercial databases such as Oracle, SQL Server, and DB2 are built from the ground up to make it possible to recover everything up through the last committed transaction before the incident. I have always presumed that the leading open source traditional relational databases (mySQL and PostgreSQL) provide this same capability.

In the github case (mySQL) it seems that full recovery may have been possible, but it was not possible within an acceptable timeframe. This may be a flaw in the software (lack of features to make quick recovery possible), or a flaw in the strategy (not configured to make quick recovery possible).

However, any of these databases can easily be administered in a manner that allows data loss. The github event is not necessarily an indication of a problem with mySQL. It is unquestionably an indication of a problem with the administration of the database.

Can anyone confirm that full recovery from this type of incident (within a reasonable timeframe) should be possible with both mySQL and PostgreSQL if administered properly?

Well done on coming clean. However this is why the dinosaur pens have such arduous red tape -- to try and catch serious errors before they hit production. A mate of mine works in that world and he regularly stops code going into production that would hose mission-critical government data.

I prefer my agility to remain on the dev-and-test side of the fence.

It kind of makes me wish NILFS2 would become production-ready faster. Give MySQL its own partition, and just roll back to a previous checkpoint if you wipe everything. Not a substitute for backups, but a pretty speedy way to recover for a minor snafu like this.

NILFS is very nifty for write-oriented work, but I'm not sure if that's their workload.

But if you're logging stuff, NILFS rocks: http://lists.luaforge.net/pipermail/kepler-project/2009-June...

Mysql 5.6+ has delayed replication. Until then, there's always tools like:


simply checking if you're talking to a production instance could avert something like this. having some metadata in the db about whether the data stored there is acting as production and at what version and deployment level, so tests can have a sanity check before destructive activities.

why the downvote?

Presumably because the comment is obvious, unhelpful, and somewhat mean-spirited. I don't think anybody here needs a lesson in how to avoid accidentally dropping your production database.

thanks for the explanation, i didn't mean to be evil. my suggestion might be obvious to some, but definitely not everyone, considering this isn't a singular occurrence.

Kudos to them for having the guts to say publicly that they accidentally destroyed their production database.

It's been a great service, and I think as long as this kind of thing is rare, and none of my code repositories get corrupted or destroyed, I plan to stick with them.

I thought systems written in Erlang never go down! ;-)

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact