Hacker News new | past | comments | ask | show | jobs | submit login
Did Pixar accidentally delete Toy Story 2 during production? (2012) (quora.com)
515 points by chenster on Jan 17, 2017 | hide | past | web | favorite | 253 comments

I was at Pixar when this happened, but I didn't hear all of the gory details, as I was in the Tools group, not Production. My memory of a conversation I had with the main System Administrator as to why the backup was not complete was that they were using a 32-bit version of tar and some of the filesystems being backed up were larger than 2GB. The script doing the backup did not catch the error. That may seems sloppy, but this sort of thing happens in the Real World all the time. At the risk of spilling secrets, I'll tell a story about the animation system, which I worked on (in the 1996-97 time frame).

The Pixar animation system at the time was written in K&R C and one of my tasks was to migrate it to ANSI C. As I did that I learned that there were aspects of this code that felt like a school assignment that had escaped from the lab. While searching for a bug, I noticed that the write() call that saved the animation data for a shot wasn't checked for errors. This seemed like a bad idea, since at the time the animation workstations were SGI systems with relatively small SCSI disks that could fill up easily. When this happened, the animation system usually would crash and work would be lost. So, I added an error check, and also code that would save the animation data to an NFS volume if the write to the local disk failed. Finally, it printed a message assuring the animator that her files were safe and it emailed a support address so they could come help. The animators loved it! I had left Pixar by the time the big crunch came in 1999 to remake TS2 in just 9 months, so I didn't see that madness first hand. But I'd like to think that TS2 is just a little, tiny bit prettier thanks to my emergency backup code that kept the animators and lighting TDs from having to redo shots they barely had time to do right the first time.

The point is that one would like to think that a place like Pixar is a model of Software Engineering Excellence, but the truth is more complex. Under the pressures of Production deadlines, sometime you just have to get it to work and hope you can clean it up later. I see the same things at NASA, where, for the most part, only Flight Software gets the full on Software Engineering treatment.

Right on the money with the "Real World" anecdote.

We do penetration tests for a wide range of clients across many industries. I would say that the bigger the company, the more childish flaws we find. For sure the complexity, scale, and multiple systems do not help towards having a good security posture , but never assume that because you are auditing a SWIFT backend you will not find anything that can lead to direct compromise.

Maybe not surprisingly, most startups that we work with have a better security posture than F500 companies. They tend to use the latest frameworks that do a good job of protecting against the standard issues, and their relatively small attack landscape doesn't leave you with much to play.

Of course there are exceptions.

Would love to have a chat about your view on security posture between smaller and bigger companies, but couldn't find your email in your HN profile. Mine is in my profile so if you have the time, please send me a message.

I actually think that would make a great discussion on HN. ;-)

Hmmm can't seem to find your email in your profile. I think you have to put it in the about section.

Ha, oops :-) updated.

One of the really interesting artifacts from the NASA flight software programs is that it helps put an upper bound of god honest ground truth level of effort to produce "perfect" software. Everything else we do is approximation to some level of fidelity. The only thing even reasonably close is maybe SQLite, and most people think the testing code for it is about 10x overkill.

It makes one start to contemplate how little we really understand about software and how nascent the field really is. We're basically stacking rocks in a modern age where other engineering disciplines are building half-km tall buildings and mile-spanning bridges.

Fast forward 2500 years and the software building techniques of the future must be as unrecognizable to us as rocket ships are to people who build mud huts.

We're stacking transistors measured in nm into worldwide communications systems, compelling simulations of reality, and systems that learn.

The scale is immense, so everything is built in multiple layers, each flawed and built upon a flawed foundation, each constantly changing, and we wouldn't achieve the heights we do if perfection, rather than satisfaction, was the goal.

Perhaps at some point the ground will stop shifting.

Sure we would, it would just take longer. A thousand years instead of 50. But just like we still use bridges and roads thousands of years old today, our distant descendents would still be using the exact foundations of what we produce now.

Sure, eventually we would get there. But we wouldn't be as far as we are at this moment.

You mean holding it all together with duct tape and chewing gum?

I mean being able to communicate with people around the world in real time, and all the rest.

> Perhaps at some point the ground will stop shifting.

Doubtful. Machines will build the ground instead, and what they build on top of it will be incomprehensible to us; at least we'll get to observe in awe.

What the comment under is saying. The scale can just not be compared. The order of magnitude of complexity and variable in computer system are far bigger than in any other engineering discipline.

> The script doing the backup did not catch the error. That may seems sloppy, but this sort of thing happens in the Real World all the time.

I disagree. I mean, I agree those things happen, but the system administrator's job is to anticipate those Real World risks and manage them with tools like quality assurance, redundancy, plain old focus and effort, and many others.

The fundamental of backups is to test restoring them, which would have caught the problem described. It's so basic that it's a well-known rookie error and a source of jokes like, 'My backup was perfect; it was the restore that failed.' What is a backup that can't be restored?

Also, in managing those Real World risks, the system administrator has to prioritize the value of data. The company's internal newsletter gets one level of care, HR and payroll another. The company's most valuable asset and work product, worth hundreds of millions of dollars? A personal mission, no mistakes are permitted; check and recheck, hire someone from the outside, create redundant systems, etc. It's also a failure of the CIO, who should have been absolutely sure of the data's safety even if he/she had to personally test the restore, and the CEO too.

I don't know or recall the details well enough to be sure, but it's possible that they were, in fact, testing the backups but had never before exceeded the 2GB limit. Knowing that your test cases cover all possible circumstances, including ones that haven't actually occurred in the real world yet, is non-trivial.

Your post is valid from a technical and idealistic standpoint, however when you realize the size of the data sets turned over in the film / TV world in a daily basis, restoring, hashing and verifying files during production schedules is akin to painting the forth bridge - only the bridge has doubled in size by the time you get half way through, and the river keeps rising...

There are lots of companies doing very well in this industry with targeted data management solutions to help alleviate these problems (I'm not sure that IT 'solutions' exist), however these backups aren't your typical database and document dumps. In today's UHD/HDR space you are looking at potentially petabytes of data for a single production - solely getting the data to tape for archive is a full time job for many in the industry, let alone administration of the systems themselves, which often need overhauling and reconfiguring between projects.

Please don't take this as me trying to detract from your post in any way - I agree with you on a great number of points, and we should all strive for ideals in day to day operations as it makes all our respective industries better. As a fairly crude analogy however, the tactician's view of the battlefield is often very different to that of the man in the trenches, and I've been on both sides of the coin. The film and TV space is incredibly dynamic, both in terms of hardware and software evolution, to the point where standardization is having a very hard time keeping up. It's this dynamism which keeps me coming back to work every day, but also contributes quite significantly to my rapidly receding hairline!

> Your post is valid from a technical and idealistic standpoint

You seem to have direct experience in that particular industry, but I disagree that I'm being "idealistic" (often used as a condescending pejorative by people who want to lower standards). I'm managing the risk based on the value of the asset, the risk to it, and the cost of protecting it. In this case, given the extremely high value of the asset, the cost and difficulty of verifying the backup appears worthwhile. The internal company newsletter in my example above is not worth much cost.

> solely getting the data to tape for archive is a full time job for many in the industry, let alone administration of the systems themselves, which often need overhauling and reconfiguring between projects.

Why not hire more personnel? $100K/yr seems like cheap insurance for this asset.

> restoring, hashing and verifying files during production schedules is akin to painting the forth bridge - only the bridge has doubled in size by the time you get half way through, and the river keeps rising...

> you are looking at potentially petabytes of data for a single production

I agree that not all situations allow you to perform a full restore as a test; Amazon, for example, probably can't test a complete restore of all systems. But I'm not talking about this level of safety for all systems; Amazon may test its most valuable, core asset, and regardless there are other ways to verify backups. In this case it seems like they could restore the data, based on the little I know. If the verification is days behind live data or doesn't test every backup, that's no reason to omit it; it still verifies the system, provides feedback on bugs, and reduces the maximum dataloss to a shorter period than infinity.

> I disagree that I'm being "idealistic" (often used as a condescending pejorative by people who want to lower standards)

A poor word choice on my part. It was certainly not meant to come across that way, so apologies there! Agreed that a cost vs risk analysis should be one of the first items on anyone's list, especially given the perceived value of the digital assets in this instance.

No problem; I over-reacted a bit. Welcome to HN! We need more classy, intelligent discussion like yours, so I hope you stick around.

I think GP's point is that although it's obviously sloppy, it's also sadly common.

Because sure, it's basic. To someone who knows that it's basic.

Also, this was way back in 1998, when what we would consider sloppy today was par for the course.

This particular case is one that's hard to test - you'd restore the backup, look at it, and it would look fine; all the files are there, almost all of them have perfect content, and even the broken files are "mostly" ok.

As the linked article states, they restored the backup seemingly sucessfully, and it took two days of normal work until someone noticed that the restored backup is actually not complete. How would you notice that in backup testing that (presumably) shouldn't take thousands of man-hours to do?

Good points. High-assurance can be very expensive in almost any area of IT. Speaking generally, when the asset is that valuable, the IT team should take responsibility for anticipating those problems - difficult, but not impossible. Sometimes you just have to roll up your sleeves and dig into the hard problems.

Speaking specifically, based on what you describe (neither of us is fully informed, of course), the solutions are easy and cheap: Verify the number of bytes restored, the numbers of files and directories restored, and verify checksums (or something similar) for individual files.

The impression I got from the descriptions of that incident and especially the followup was that their main weakness was not technical, but organizational - their core business consisted on making, versioning and using a very, very large number of data assets that was very important to them, but they apparently didn't have any good process of (a) inventory of what assets they have or should have, and (b) who is responsible for each (set of) assets. Instead, the assets "simply" were there when everything was going okay, and just as simply weren't there without anyone noticing when it wasn't.

If they had even the most rudimentary tracking or inventory of those assets/artifacts, the same technical problems would cause a much simpler and faster business recovery; instead, circumstances forced them to inventory something that they (a) possibly didn't have and (b) didn't know if it needed to exist in the first place, and (c) in a hurry, without preparation or adequate tools or people for that.

IT couldn't and cannot fix that - implementing a process may need some support from IT for tooling or a bit of automation, but most of the fix would be by and for the non-IT owners/keepers of that data.

Cool story :) You are bang on point when it comes to software engineering at what are thought to be "top tier" development houses. In the ideal world sure they will build the very best software but the real world has [unrealistic] deadlines and when you have deadlines it means corners get cut. Not always but very often. This leads to the whole "does it do exactly what is required?" and if it does then you are moved onto the next thing often with the "promise" that you will be able to go back and "fix things" at a later date. Of course we all know that promise is never kept.

On a related note:

"Backups aren't backups until they've been tested."

They really are Schrödinger's backups until a test restore takes place. This is one area where people cut corners a lot because no one cares about backups until they need them. But it's worth the effort to do them right, including occasional, scheduled manual testing. If you can't restore the data/system you're going to be the one working insane hours to get things working when a failure occurs.

And then there's the aftermath. Unless you are lucky enough to work for a blame free organization major data loss in a critical app due to a failure of the backup system (or lack thereof) could be a resume generating event. If you're ordered to prioritize other things over backups make sure you get that in writing. Backups are something everyone agrees is "critical" but no one wants to invest time in.

| As I did that I learned that there were aspects of this code that felt like a school assignment that had escaped from the lab

from a brief stint in the gfx industry, you are correct.

Pixar isn't a model of software excellence, it's a model of process and (ugh) culture excellence.

Didn't Pixar invent the alpha channel? Being the 'a' in rgba is pretty rad!

No, a Pixar co-founder invented it well before Pixar existed.

But I mean, the other co-founder was Ed Catmull, so it's not like the company is short on innovation.

I've heard about NASA's Flight Software teams being very strict on 9-5 work hours, with lots of code review and tests. I was under the impression this wasn't as strict with the competition from SpaceX and Blue Origin now that we aren't sending people to space on our (USA) own rockets. Is my impression incorrect?

SpaceX (or rather, Elon Musk) is famous for pushing their developers hard. Elon sent his team to live on a remote island in the Pacific [1] where they were asked to stay until they could (literally!) launch.

[1] https://www.bloomberg.com/graphics/2015-elon-musk-spacex/

Fantastic read.

I don't work in the manned space part of NASA and the software I deal with isn't critical, so I can't say. Most of what I know about Flight software development comes from co-workers who've done that work. They speak of getting written permission to swap two lines of code. That sort of thing. I think it would be cool to have code I wrote running on Mars, but I don't know if I could cope with that development environment.

I have some friends on the mechanical engineering side of things at SpaceX and can definitely say that 9-5 work hours don't even seem to be a suggestion. It likely varies team to team though.

My recollection of the details is lacking but this jives with what I remember about a talk I attended by a Pixar sysadmin. I think there was only a couple slides about it since it was just one part of a "journey from there to here" presentation about how they managed and versioned large binary assets with Perforce.

There are other anecdotes online about this catastrophic data loss and backup failure but I think it was, funny enough, the propensity for some end users of Perforce to mirror the entire repository that saved their bacon. I say funny because this is something a Perforce administrator will generally bark at you about since your sync of this enormous monolithic repo would be accompanied by an associated server side process that would run as long as your sync took to finish and thanks to some weird quirks of the Perforce SCM long running processes were bad and would/(could?) fuck up everyone else's day. In fact I think a recommendation from Pixar was to automatically kill long running processes server side and encourage smaller workspaces. Anyway, I digress. They were able to piece it together using local copies from certain workstations that contained all or most of the repo. Bad practices ended up saving the day.

> The Pixar animation system at the time

Was that menv? I've heard stories that Pixar builds these crazy custom apps that rival things like 3D Studio and Maya but that never leave their campus!

Yes, Menv, for Modelling ENVironment, although it was a full animation package, not just a modeler. It has a new name now and has been extensively rewritten.

i think it might have turned out better if they had lost the movie - Toy Story I and III have really good plots, but the screenplay of Toy Story II isn't that stellar; is it possible that they would have changed the story of II if it had been lost? (Mr. Potato head might have said that they lost the movie on purpose)

They actually did rewrite the entire story. What they recovered was completely remade (per the answer in the OP)

The original version did have Woody being stolen by a toy collector and being rescued by the other toys. I don't think many of the specifics beyond that survived the rewrite, but I don't know for sure. There are links elsewhere in this thread to claimed versions of the original story, but I can't vouch for their authenticity. I never saw any of the in-progress story reels.

On a somewhat related note I hope someday they'll take the scene assets they have from the older films, beef up the models or substitute them with newer ones from recent movies and re-render them. The stories are solid and a remaster of older Pixar films would be a hit I think.

I'd actually disagree with you - I think the story in Toy Story II as released is top notch.

Unlike most people, I found TS3 to be the weakest of the three. It seemed like a terrific idea for a short that got padded to make it feature length and the padding was pretty average. I wish they had made that short.

I think this might be due to timing. I saw TS when it first came to premium cable. Then I fell way behind on animated films. In 2012 I went on a binge to catch up, and re-watched TS, then saw TS2 for the first time a week later, then TS3 for the first time a week after that.

So for me I was watching TS3 while TS and TS2 were still fresh. Most people were seeing TS3 after a long gap so they may have forgotten details of TS2, and that gap also helped TS3 get a huge nostalgia factor. TS2 was fresh in my mind, and I had no nostalgia.

I'm not sure if I'd put TS or TS2 on top. TS had more novelty, but TS2 explored more weighty themes and had deeper emotional content.

The biggest difference, I think, was leaving the hunting for a head for a second moment, or even not doing it at all.

Commitment would be very different if people were being asked to help while some heads were rolling. Because you're a real team when everybody is going in the same direction. Any call on "people, work hard do recover while we're after the moron who deleted everything" wouldn't have done it.

You just commit to something when you know that you won't be under the fire if you do something wrong without knowing it.

I never understood the attitude of some companies to fire an employee immediately if they make a mistake such as accidentally deleting some files. If you keep this employee, then you can e pretty sure he'll never made that mistake again. If you fire him and hire someone else, that person might not have had the learning experience of completely screwing up a system.

I think that employees actually makes less mistakes and are more productive if they don't have be worried about being fired for making a mistake.

There is a great quote from Tom Watson Jr (IBM CEO):

> A young executive had made some bad decisions that cost the company several million dollars. He was summoned to Watson’s office, fully expecting to be dismissed. As he entered the office, the young executive said, “I suppose after that set of mistakes you will want to fire me.” Watson was said to have replied,

> “Not at all, young man, we have just spent a couple of million dollars educating you.” [1]

All depends on how leadership views employee growth

[1] http://the-happy-manager.com/articles/characteristic-of-lead...

There's s story about Bill Clinton's early years that is similar. He became governor at 32 and had ambitious plans, increasing the gas tax to fix the roads was one of them. The tax passed and subsequently Clinton lost re-election. He was stung at his loss since he was a fairly popular governor despite the gas tax hike. A few years later he decided to run again and went all over the state to talk to voters. In one small town he came across a man and introduced himself. The man said "I know who you are, you're the sumbitch that raised the gas tax!" Clinton replied "I guess I can't count on your vote then." The man said "Oh, I'll still vote for you." Shocked, Clinton asked why? The man grinned and said, "Cause I know you'll never do that again!"

That's not really the same though. Did Clinton actually manage to fix the roads? If he did, that wasn't a mistake and voters were simply retaliating for a tax increase.

> Did Clinton actually manage to fix the roads? ... and voters were simply retaliating for a tax increase.

Not a great argument because many people view one of the primary responsibilities of local government is to maintain the roads (in USA). If they cannot properly budget and allocate money, regardless if the tax increase worked, it was the wrong way to fix the problem. With this mindset government can fix every problem by raising taxes.... Not acceptable to most people.

That argument doesn't necessarily make sense. You can't budget properly and allocate funds if you have no funds. Look at all the countries with a high level of social services. They collect a lot of tax.

If people really think that the government can maintain roads with no money, assuming they don't have that money, I don't know what to say.

> You can't budget properly and allocate funds if you have no funds. Look at all the countries with a high level of social services. They collect a lot of tax.

If their primary purpose is to take care of the roads, that should be one of the first items that gets funded with taxes they already collect, therein lies the problem people have. It's not like they have no money, it was improperly allocated to the point where they were in the negative to meet the needs required of them. We are not a country with a lot of social services, we have very few. It a case of the government not doing their jobs well and taking more money cover that fact up.

My point is that services don't come from thin air. There may have been things besides the roads that may need to get funded every year and not have enough surplus to cover the roads. You may even have priorities that are important enough that even if running them was a inefficient, you may need to fund them anyway while you try and improve efficiency. Introducing a tax so that you could finally fund a project is not at face value a bad idea.

I am not familiar with this particular instance. But the story about Clinton as it stands is not really relevant. Much like this sub-thread.

> If people really think that the government can maintain roads with no money

Do you really think they brought in "no" money? That's ridiculous. The government should figure out how to waste less of the existing taxes before demanding more.

More simply, setting tax rates is part of budgeting.

If every time I did not budget properly would it be acceptable to ask my boss for more money? Every time? Or is it my fault for not budgeting properly. I'd probably be fired if I did this.

That seems like an unrelated question. I thought we were talking about Clinton's one-time budget to improve roads that included a tax increase to cover it. Clinton wasn't governor in the previous term, he wasn't the one that under-budgeted the roads originally.

> he wasn't the one that under-budgeted the roads originally.

Im not sure who caused the budget deficit in the first place, but he is the one that took more money from citizens to fix a problem that should have been fixed by reallocating existing funds.

No one thinks they can fix the roads with no money. Rather, they think they can fix the roads with the amount of money they have.

That said, governments do spend money that doesn't exist as a matter of routine. That's why the Fed exists.

> I never understood the attitude of some companies to fire an employee immediately if they make a mistake such as accidentally deleting some files. If you keep this employee, then you can e pretty sure he'll never made that mistake again.

I did fire an employee who deleted the entire CVS repository.

Actually, as you say, I didn't fire him for deleting the repo. I fired him the second time he deleted the entire repo.

However there's a silver lining: this is what led us (actually Ian Taylor IIRC) to write the CVS remote protocol (client / server source control). before that it was all over NFS, though the perp in question had actually logged into the machine and done rm -rf on it directly(!).

(Nowadays we have better approaches than CVS but this was the mid 90s)

What the hell. How do people just go around throwing rm -rf s so willy nilly.

Campfire horror story time! Back in 2009 we were outsourcing our ops to a consulting company, who managed to delete our app database... more than once.

The first time it happened, we didn't understand what, exactly, had caused it. The database directory was just gone, and it seemed to have gone around 11pm. I (not they!) discovered this and we scrambled to recover the data. We had replication, but for some reason the guy on call wasn't able to restore from them -- he was standing in for our regular ops guy, who was away on site with another customer -- so after he'd struggled for a while, I said screw it, let's just restore the last dump, which fortunately had run an hour earlier; after some time we were able to get a new master set up, though we had lost one out of data. Everyone went to bed around 1am and things were fine, the users were forgiving, and it seemed like a one-time accident. They promised that setting up a new replication slave would happen the next day.

Then, the next day, at exactly 11pm, the exact same thing happened. This obviously pointed to a regular maintenance job as being the culprit. It turns out the script they used to rotate database backup files did an "rm -rf" of the database directory by accident! Again we scrambled to fix. This time the dump was 4 hours old, and there was no slave we could promote to master. We restored the last dump, and I spent the night writing and running a tool that reconstructed the most important data from our logs (fortunately we logged a great deal, including the content of things users were creating). I was able to go bed around 5am. The following afternoon, our main guy was called back to help fix things and set up replication. He had to travel back to the customer, and the last things he told the other guy was: "Remember to disable the cron job".

Then at 10pm... well, take a guess. Kaboom, no database. Turns out they were using Puppet for configuration management, and when the on-call guy had fixed the cron job, he hadn't edited Puppet; he'd edited the crontab on the machine manually. So Puppet ran 15 mins later and put the destructive cron job back in. This time we called everyone, including the CEO. The department head cut his vacation short and worked until 4am restoring the master from the replication logs.

We then fired the company (which filed for bankruptcy not too long after), got a ton of money back (we threatened to sue for damages), and took over the ops side of things ourselves. Haven't lost a database since.

Mine is from back when I was a sysadmin at the local computer club. We had two Unix machines (a VAX 11/750 and a DECstation of some model). We had a set of terminals connected to the VAX and people were using the DECstation by connecting to it using telnet (this was before ssh).

What happened was that one morning when people were logging in to the DECstation they noticed that things didn't quite work. Pretty much everything they normally did (like running Emacs, compiling things, etc) worked, but other, slightly more rare things just didn't work. The binaries seemed to be missing. It was all very strange.

We spent some time looking into it and finally we figured out what had happened. During some mantenance, the root directory of the DECstation had been NFS-mounted to the VAX, and the mount point was under /tmp. I don't remember who did it, but it's not unlikely that it was me. During the night, the /tmp cleanup script had run on the VAX which deleted all files that had an atime (last access time) of more than 5 days. This meant that all files the DECstation needed to run, and all the files that were used during normal operation were still there, but anything slightly less common had been deleted.

This obviously taught me some lessons, such as never mount anything under /tmp, never NFS mount the root directory of anything and never NFS mount anything with root write permissions. The most important thing about sysadmin disasters are that you learn something from them.

When disk space is limited and you are working with large files, you need to clean up after yourself. And human make mistakes. I am not sure if this still does anything in newer rm, but it used to be a common mistake:

    $ rm -rf / home/myusername/mylargedir/
(note the extra space after slash)

The real solution is comprised of:

    * backups (which are restored periodically to ensure they contain everything)
    * proper process which makes accidental removal harder (DCVS & co.)

Day 1 in my first job in the UK I ran an "update crucial_table set crucial_col = null" without a where clause on production. Turned out there were no backups. Luckily the previous day's staging env had come down from live, so that saved most of the data.

What most people don't realize is that very few places have a real (tested) backup system.

_goes off to check backups_

I had a coworker who would always do manual dangerous SQL like these within a transaction ... and would always mentally compare the "rows affected" with what he thought it should be before committing.

And then commit it.

It's a good habit.

My workflow for modifying production data is:

   1) Write a select statement capturing the rows you want to modify and verify them by eyeball
   2) (Optional) Modify that statement to select the unchanged rows into a temp table to be deleted in a few days
   3) Wrap the statement from step 1 in a transaction
   4) Modify the statement into the update or delete
   5) Check that rowcounts haven't changed from step 1
   6) Copy-and-paste the final statement into your ticketing or dev tracking system
   7) Run the final statement

It may be overkill, but the amount of grief it can save is immeasurable

I have never done what the GP describes but I consider myself very lucky as it's a very common mistake. I have heard enough horror stories to always keep that concern in the back of my mind.

I do what your coworker did and it's a great feeling when you get the "451789356 rows updated" message inside a transaction where you are trying to change Stacy's last name after her wedding and all you have to do is run a ROLLBACK.

Then it's time to go get a coffee and thank your deity of choice.

One of PostgreSQL's best features is transactional DDL: You can run "drop table" etc. in a transaction and roll back afterwards. This has saved me a few times. It also makes it trivial to write atomic migration scripts: Rename a column, drop a table, update all the rows, whatever you want -- it will either all be committed or not committed at all. Surprisingly few databases support this. (Oracle doesn't, last I checked.)

MySQL's console can also warn you if you issue a potentially destructive statement without a WHERE clause: http://dev.mysql.com/doc/refman/5.7/en/mysql-tips.html#safe-...

The `--i-am-a-dummy` flag, which I wish were called `--i-am-prudent` because we all are dummies.

It works for more than databases.

- With shells, I prefix risky commands on production machines with #, especially when cutting and pasting

- Same for committing stuff into VCS, especially when I'm cherrypicking diffs to commit

- Before running find with -delete, run with -print first. Many other utilities have dry-run modes

I do a select first using the where clause I intend to use to get the row count.

Then open a transaction around the update with that same where clause, check the total number of rows updated matches the earlier select, then commit.

This approach definitely reduces your level of anxiety when working on a critical database.

My practise is to do:

  UPDATE ImportantTable SET
    ImportantColumn = ImportantColumn
  WHERE Condition = True
Check the rows affected, then change it to:

  UPDATE ImportantTable SET
    ImportantColumn = NewValue
  WHERE Condition = True

Not doing this is like juggling with knives. I cringe every time I see a colleague doing it.

Lots of people "do backups", not many have a "disaster recovery plan" and very few have ever practised their disaster recovery plan.

Years ago we had an intern for a time, and he set up our backup scripts on production servers. He left after a time, we deleted his user, and went on our merry way. Months later, we discover the backups had been running under his user account, so they hadn't been running at all since he left. A moment of "too busy" led to a week of very, very busy.

I've done that where crucial_col happened to be the password hash column.

We managed to restore all but about a dozen users from backup, and sent a sheepish email to the rest asking them to reset their passwords.

Yup, I did something like that command once, to a project developed over 3 months by 5 people without a backup policy (university group project). Luckily, this was in the days when most of the work was done on non-networked computers, so we cobbled everything together from partial source trees on floppies, hunkered down for a week to hammer out code and got back to where we were before. It's amazing how fast you can write a piece of code when you've already written it once before.

That was the day I started backing up everything.

I am finding more and more that the 'f' is not required. Just 'rm -r' will get you there usually, and so I'm trying to get into the habit of only doing the minimum required. Unfortunately, git repos require the -f.

Accidents like these have happened to me enough times that my .bashrc contains this in all my machines:

    alias rm='echo "This is not the command you are looking for."; false'
I install trash-cli and use that instead.

Of course this does not prevent other kinds of accidents, like calling dd to write on top of the /home partition... ok, I am a mess :)

> The real solution is comprised of

* making "--preserve-root" the default... :-)

Now days with the low price of disk space and high price of time, it's much cheaper to buy new disk drives than to pay people to delete files. And safer!

I did something similar to my personal server using rsync.

> cd /mnt/backup

> sudo rsync -a --delete user@remote:some/$dir/ $dir/

Only to see the local machine become pretty much empty when $dir was not set.

Funny to still see Apache etc still running in memory despite any related files missing.

On Linux, if a process is holding those file handles open, the OS doesn't really delete them until the process is killed. You can dig into /proc and pull out the file descriptor address, cat the contents back out, and restore whatever is still running as long as you don't kill the process.

For next time Apache is hosting a phantom root dir. ;) These things happen to all of us. We just have to be prepared.

Ahh, learned something new. Informative comment.

> before that it was all over NFS, though the perp in question had actually logged into the machine and done rm -rf on it directly(!).

With NFS Version 3, aka NeFS, instead using rlogin to rm -rf on the server, the perp could have sent a PostScript program to the server that runs in the kernel, to rapidly and efficiently delete the entire CVS tree without requiring any network traffic or even any context switches! ;)


The Network Extensible File System protocol(NeFS) provides transparent remote access to shared file systems over networks. The NeFS protocol is designed to be machine, operating system, network architecture, and transport protocol independent. This document is the draft specification for the protocol. It will remain in draft form during a period of public review. Italicized comments in the document are intended to present the rationale behind elements of the design and to raise questions where there are doubts. Comments and suggestions on this draft specification are most welcome.

The Network File System The Network File System (NFS™* ) has become a de facto standard distributed file system. Since it was first made generally available in 1985 it has been licensed by more than 120 companies. If the NFS protocol has been so successful why does there need to be NeFS ? Because the NFS protocol has deficiencies and limitations that become more apparent and troublesome as it grows older.

1. Size limitations.

The NFS version 2 protocol limits filehandles to 32 bytes, file sizes to the magnitude of a signed 32 bit integer, timestamp accuracy to 1 second. These and other limits need to be extended to cope with current and future demands.

2. Non-idempotent procedures.

A significant number of the NFS procedures are not idempotent. In certain circumstances these procedures can fail unexpectedly if retried by the client. It is not always clear how the client should recover from such a failure.

3. Unix®† bias.

The NFS protocol was designed and first implemented in a Unix environment. This bias is reflected in the protocol: there is no support for record-oriented files, file versions or non-Unix file attributes. This bias must be removed if NFS is to be truly machine and operating system independent.

4. No access procedure.

Numerous security problems and program anomalies are attributable to the fact that clients have no facility to ask a server whether they have permission to carry out certain operations.

5. No facility to support atomic filesystem operations.

For instance the POSIX O_EXCL flag makes a requirement for exclusive file creation. This cannot be guaranteed to work via the NFS protocol without the support of an auxiliary locking service. Similarly there is no way for a client to guarantee that data written to a file is appended to the current end of the file.

6. Performance.

The NFS version 2 protocol provides a fixed set of operations between client and server. While a degree of client caching can significantly reduce the amount of client-server interaction, a level of interaction is required just to maintain cache consistency and there yet remain many examples of high client-server interaction that cannot be reduced by caching. The problem becomes more acute when a client’s set of filesystem operations does not map cleanly into the set of NFS procedures.

1.2 The Network Extensible File System

NeFS addresses the problems just described. Although a draft specification for a revised version of the NFS protocol has addressed many of the deficiencies of NFS version 2, it has not made non-Unix implementations easier, not does it provide opportunities for performance improvements. Indeed, the extra complexity introduced by modifications to the NFS protocol makes all implementations more difficult. A revised NFS protocol does not appear to be an attractive alternative to the existing protocol.

Although it has features in common with NFS, NeFS is a radical departure from NFS. The NFS protocol is built according to a Remote Procedure Call model (RPC) where filesystem operations are mapped across the network as remote procedure calls. The NeFS protocol abandons this model in favor of an interpretive model in which the filesystem operations become operators in an interpreted language. Clients send their requests to the server as programs to be interpreted. Execution of the request by the server’s interpreter results in the filesystem operations being invoked and results returned to the client. Using the interpretive model, filesystem operations can be defined more simply. Clients can build arbitrarily complex requests from these simple operations.

Surely you've heard of at least these arguments:

- Employee was error prone and this mistake was just the biggest one to make headlines. Could be from incompetence or apathy.

- Impacted clients demanded the employee at-fault be terminated.

- Deterrence: fire one guy, everyone else knows to take that issue seriously. Doesn't Google do this? If you leak something to press, you're fired, then a company email goes out "Hey we canned dude for running his mouth..."

It's better to engage the known and perhaps questionable justifications than to "never understand".

Case 1: It's fine to fire individuals for ongoing performance issues. (though you must make clear to those who remain that the number and types issues the individual already had, and the steps that had been taken to help the individual rectify their performance issue.)

Case 2: no competent manager would fire an employee who made a mistake to satisfy clients. They may move the employee to a role away from that client, but it would be insanity to allow the most unreasonable clients to dictate who gets fired. Any manager who does what you suggest should expect to have lost all credibility in the eyes of their team.

Case 3a: A leak to the press is a purposeful action. Firing for cause is perfectly reasonable. Making a mistake is not a purposeful action.

Case 3b: If you want to convey that a particular type of mistake is serious, don't do so by firing people. Do so with investments in education, process, and other tools that reduce the risk of the mistake occurring, and the harm when the mistake occurs. Firing somebody will backfire badly, as many of your best employees will self-select away from your most important projects, and away from your company, as they won't want to be in a situation where years of excellent performance can be erased with a single error.

Case 2: Agreed, but not everyone is lucky enough to work for a competent manager. And managers don't fit neatly within competent and incompetent buckets. Under external or higher pressure ("his job or your job") a normally decent manager might make that call.

Case 3a: Good distinction, a conscious leak is not a mistake. It's possible for a leak to be accidental though, say under alcohol, lost laptop, or just caught off guard by a clever inquisitor.

Case 3b: Firing has the effects you mention, but it also has the effect of assigning gravity to that error. I'm not claiming the benefits outweigh the drawbacks, but some managers do.

I'm not a proponent of the above, but it's good to understand the possible rationale behind these decisions.

Firing someone over making a mistake is never a good idea.

If you're going to have firing offenses, spell those out. E.g. breaking the law, violating some set of rules in the handbook, whatever., so that people can at least know there's a process or sensibility to the actions.

If people can be fired for making a mistake, and that wasn't laid out at the outset, then they're just not gonna trust the stability of your workplace.

Firing for mistakes can make sense in the context of a small company that has to pay enough to rectify the mistake that it significantly impacts the budget. If this cost needs to be recouped, it is only fair that it be recouped from the salary preserved by terminating the responsible party. We're not all megacorps.

This is going to depend on the severity, cost, budget, importance of the role filled, etc., but I think it's probably one of the only semi-plausible justifications for firing based on things that do not reflect a serious and ongoing competency or legal issue.

That's nonsense.

A mistake is made, and a material loss has been incurred. This sucks. Been there, done that, didn't get the t-shirt because we couldn't afford such a luxury. I watched my annual bonus evaporate because of somebody else's cock-up.

But there's no reason to believe that firing the mistake-maker is the best move. Maybe the right move is to find money somewhere else (cutting a bunch of discretionary, pushing some expenses into the future, reducing some investment), or maybe it's to ask a few people to take pay cuts in return for some deferred comp. Or maybe it's to lay-off somebody who didn't do anything wrong, but who provides less marginal value to the company.

But it'd be one hell of a coincidence if, after an honest to god mistake, the best next action was to fire the person who made the mistake. After all, if they were in a position to screw your company that hard, they almost certainly had a history of being talented, highly valued, and trustworthy. If they weren't good, you wouldn't have put them in a place where failure is so devastating.

>But there's no reason to believe that firing the mistake-maker is the best move.

Yeah, I'm not saying it necessarily or even probably is. I'm saying that reality sometimes makes it so that we have to make these compromises.

>Firing for mistakes can make sense in the context of a small company that has to pay enough to rectify the mistake that it significantly impacts the budget. If this cost needs to be recouped, it is only fair that it be recouped from the salary preserved by terminating the responsible party.

What was the fired person doing? Presumably they were performing required work otherwise the company wouldn't have been paying them in the first place.

That means you know need to pay to replace which costs more than keeping an existing employee. Or you could divide their responsibilities among the remaining employees but if you thought you could do that you would have already laid them off without waiting for them to mess something up.

If you're going to let your clients decide when you fire someone you're having some enormous issues. Take the person off their account, sure, but how in hell does a client make your HR decisions?

> Deterrence

If I see someone getting axed for making a mistake, I'd be making a mistake if I didn't immediately start firing up the resume machine.

> Doesn't Google do this? If you leak something to press, you're fired, then a company email goes out "Hey we canned dude for running his mouth..."

I've never heard of this happening. I've heard of people fired for taking photographs (or stealing prototypes!) of confidential products and handing them to journalists.

Leaking something to the press is an entirely different class of failure than a technical screw up.

"Why would I fire you? I just paid $10M to train you to not that make mistake again!"

I had a great boss (founder of the company) who said, after I just screwed up, "There is not a mistake you can make, that I haven't already made. Just don't make the same mistake twice."

Reminds me of the boot-camp story of the nervous recruit inquiring of his sergeant:

"Sir beg my pardon for asking but why did you give Smith 50 press-ups for asking a question? You said that there were no stupid questions. Sir."

"I gave him the press-ups the SECOND time he asked. Will you need to ask again?"

That's awful. So they train people to never ask for clarification or refresh if they misunderstand or forget, so instead they go on to make a far worse mistake acting on incorrect information.

A clarification question is not asking the same question twice. The point seems to be that you should pay attention.

I believe that is the lesson from the story, but I don't believe the lesson from the story == the lesson irl.

The IRL takeaway is that if you don't open your mouth, you don't get punished. If you do open your mouth, you might get punished.

It's incentives. The true benefit to the company comes when people can make mistakes and learn from them. But often, the forces on management are not in alignment. Imagine Manager Mark has a direct report, Ryan, commit a big and public error. And then 1.5 years later Ryan commits another public error.

"What kind of Mickey Mouse show is Manager Mark running over there?" asks colleague Claire, "Isn't that guy Ryan the same one that screwed up the TPS reports last year?"

On the other hand, if Mark fires Ryan, then mark is a decisive manager. Even if the total number of major errors is higher, then still there will not be a risk of letting being known as a manager that let's people screw up multiple times.

Just like how a new executive is brought into a failing company, the company still fails but the executive is awarded a nice severance package.

Sounds like the life story of Yahoo.

"From the Earth to the Moon" - a great series about the space race (Tom Hanks made it after Apollo 13) - has a scene about this:


Very good point. Aviation is awesome in that sense - accident investigations are focused on understanding what happened, and preventing re-occurence. Allocating blame or punishment are not part of it, at least in enlightened jurisdictions.

Furthermore, a lot of individual error's are seen in an institutionalised "systems" framework - given that people invariably will make mistakes, how can we set up the environment/institutions/systems so that errors are not catastrophic.

Not sure how that applies to movie animation, to be honest, but not primarily looking for whom to blame was certainly a very good move.

Rotorcraft PPL here: actually assigning blame is a very important part of accident analysis because it's critical to determining whether resolution will involve modifying training curriculum or if there was a mechanical failure. CFIT, engine failure due to fuel exhaustion, flight into wires, low G mast bumping, and settling with power are all sure ways to die from pilot error. And if the pilot did make a serious error their license could be suspended or revoked.

In general we consider that people don't want to die while piloting an airplane. So, even in a major event where all lives aboard were lost, investigating the whole problem and finding opportunities for improvements will make aviation safer, simply saying "the pilot screwed it" won't get anything done.

Of course, if the problem is just a borderline behavior of the pilot or co-pilot, it'll be fantastic if we can get him off the circuit before he locks the captain outside and programs the plane to crash against a mountain. Or not to stretch fuel limits so that he will fall out of gas.

But... if we can also learn how to make a out-of-gas plane land and survive, and the cost is "let's not put this pilot into jail, because it's better to learn how to save more lives", I prefer this approach. Probably you'll be able to get the pilot from some other behavior.

>So, even in a major event where all lives aboard were lost, investigating the whole problem and finding opportunities for improvements will make aviation safer, simply saying "the pilot screwed it" won't get anything done.

Yes, it does. Many aviation crashes are attributed to "pilot error". There's only so much you can do with procedures and such; at some point, the pilot has to be held accountable for screwing up, and investigations do exactly that many times.

Usually, in major events, you're looking at commercial airliners with a pilot and co-pilot and in those cases, it's usually something much worse than a mistake by the pilot, and frequently several bad things happening at once. But in general aviation, where you have one pilot, frequently non-commercial, flying a small aircraft, the cause is frequently just "pilot error". A common example of this is the pilot running out of fuel because they did their calculations wrong. It happens frequently with private pilots, and in a Cessna you can't just pull over when you run out of gas.

I'd argue even that even in cases of "clear cut" pilot error, the goal is to learn and prevent it.

For example, the first fatal 747 crash, the Lufthansa coming down upon departure from Nairobi, happened almost certainly because the flight crew did not extend the leading edge flaps. Clear case of pilot error. If that response had been it, it would have happened again (in fact, it did happen at least twice before, but at lower altitude airports where the aircraft performance was enough for the crew to depart without accident).

Instead, it was acknowledged that the whole system could be improved, and Boeing put in a take-off configuration warning.

Similarly, AF 347 over the Atlantic - sure, you can argue that the pilot in the right seat should not have pushed the stick forward, and that it was entirely his fault, case closed. But maybe one can, instead, improve the whole system, the HCI, etc.

Edit: typo

You're talking about big commercial planes. Yes, you can alter procedures here and attempt to prevent the same thing from happening again.

Not in general aviation. You're not going to get all the Cessna 172 owners to modify some part of their plane to make it better and avoid some incident where some yahoo private pilot did something dumb and crashed. It's hard enough just getting privately-owned aircraft to be properly serviced. Many of them are many decades old and quite primitive. You're not going to improve "the whole system", the HCI, etc. in some airplane made in 1940 or whenever.

Finally, attributing an incident to "pilot error" doesn't automatically mean that there weren't contributing factors or that things couldn't be done better.

That's true. (Just as a rendering farm producing a major motion picture needs a different approach and policy than some dude on his typewriter... :-)

You are probably right, but I am privately a bit concerned regarding the crashed Air France Flight 447, and the conclusions made regarding pilot error.

I can't shake the suspicion that the Airbus man-machine interface and programming is partly to blame - possibly only when pushed into an extreme configuration, and certainly just as a topping on other factors.

It's clear however, that it's politically and economically impossible to ground all machines made by the European union's prestige project, Airbus.

Fully agreed - shifting all the blame on the pilot absolves Airbus too easily.

Remember the AirAsia from Surabaya, where they got themselves into an upset because of a tiny crack in the rudder-travel limiter unit ("topping on other factors"), then apparently made basically the same mistake as AF 447, one pilot pulling full back all the way down.

Let's ask the authority, the NTSB:

> The NTSB does not assign fault or blame for an accident or incident; rather, as specified by NTSB regulation, “accident/incident investigations are fact-finding proceedings with no formal issues and no adverse parties ... and are not conducted for the purpose of determining the rights or liabilities of any person.” 49 C.F.R. § 831.4.

(This is on first (non-title) page of any NTSB accident report. [1])

Of course, the NTSB determines "Probably Cause", and makes safety recommendations. But the point is not to blame the pilot. Notice also that enforcement action is taken by a different agency, the FAA, not the NTSB.

Lastly, the whole ASR reporting system is set up to maximise the information gathered and minimise future accidents, while giving some dispensation to pilots that have made mistakes.

[1] see e.g. e.g. https://www.ntsb.gov/investigations/AccidentReports/Reports/...

Edit: add > to indicate quote.

Sorry for banging on about this, but the accident report I linked to above is a great example, as it happens. It's about the cargo 747 that stalled shortly after take-off in Bagram, Afghanistan, caught on a spectacular and sobering video.

Cause: cargo was not secured enough and slid back during take-off, shifting centre of gravity, stall, crash. Blame: loadmaster. Done. Or are we?

No: Loadmasters are not FAA-certificated (a gap in the system). The operator procedures were inadequate. FAA oversight over these cargo operations was deficient, one reason being that the FAA inspectors were insufficiently trained. So, suddenly "blame" rests not only with the loadmaster (who perished in the crash, btw), but with the system, the operator, FAA procedures, FAA training, etc.

This is true, but I think the point the parent comment was getting at is that an investigation tends to take a more holistic look at the incident rather than simply assigning blame directly to a single factor. Even pilot error, especially in the context of commercial aviation, is often found to be the result of training deficiencies or cultural issues on the part of the airline.

Yes, exactly, well said.

That's precisely the idea: if a pilot makes errors so grave as to endanger the aircraft, how come the airline training/monitoring in place did not pick up indications earlier?

The people that helped overcome the "pilot error, case closed"-mindset must be thanked (among others) for making aviation today as safe as it is.

Robinson R22 pilot?

Assigning "Cause" is an important part of accident analysis. There have been many cases of a pilot with proper aeronautical decision making processes, and due care and caution, still making a piloting error resulting in a mishap.

Classic example would be TACA 110 [0] Weather related factors, and a flaw in the engine design caused both engines to fail. Rushing the restart procedure resulting in a "hung start" of both engines and subsequent overheat. Thanks to the skillful flying of Capt. Carlos Dardano, and his crew, this 737 made one of the most successful dead-stick landings in history. Capt. Dardano was not to "blame" for the mishap, but he did make errors that were contributing factors.

The 737 needed an engine replaced, but was able to be flown out and fully repaired within weeks of the mishap. Southwest Airlines retired the aircraft in December 2016, with over 27 years of uneventful service.

[0] https://en.wikipedia.org/wiki/TACA_Flight_110

That's a different sort of blame than what the parent was talking about. This is a sort of technical blame - what needs to be different to not have that sort of incident happen. That's different than "blame" as in scapegoating, where now that you've got a story about how someone fucked up we don't have to feel bad about there being a plane crash anymore.

"Aviation is awesome in that sense - accident investigations are focused on understanding what happened, and preventing re-occurence. Allocating blame or punishment are not part of it, at least in enlightened jurisdictions."

Same goes for every tech company I have worked at. I have never been in a post-mortem meeting where the goal was to allocate blame. It was always emphasized that the goal of the meeting was to improve our process to make sure it never happens again, not punish the party responsible.

Didn't France find a Continental mechanic guilty for manslaughter in that last Concord crash?

Appeal court overturned the conviction.

Excerpts from Wiki:

> In March 2008, Bernard Farret, a deputy prosecutor in Pontoise, outside Paris, asked judges to bring manslaughter charges against Continental Airlines and two of its employees – John Taylor, the mechanic who replaced the wear strip on the DC-10, and his manager Stanley Ford – alleging negligence in the way the repair was carried out.

> At the same time charges were laid against Henri Perrier, head of the Concorde program at Aérospatiale, Jacques Hérubel, Concorde's chief engineer, and Claude Frantzen, head of DGAC, the French airline regulator. It was alleged that Perrier, Hérubel and Frantzen knew that the plane's fuel tanks could be susceptible to damage from foreign objects, but nonetheless allowed it to fly.

> Continental Airlines was found criminally responsible for the disaster by a Parisian court and was fined €200,000 ($271,628) and ordered to pay Air France €1 million. Taylor was given a 15-month suspended sentence, while Ford, Perrier, Hérubel and Frantzen were cleared of all charges. The court ruled that the crash resulted from a piece of metal from a Continental jet that was left on the runway; the object punctured a tyre on the Concorde and then ruptured a fuel tank. The convictions were overturned by a French appeals court in November 2012, thereby clearing Continental and Taylor of criminal responsibility.

> The Parisian court also ruled that Continental would have to pay 70% of any compensation claims. As Air France has paid out €100 million to the families of the victims, Continental could be made to pay its share of that compensation payout. The French appeals court, while overturning the criminal rulings by the Parisian court, affirmed the civil ruling and left Continental liable for the compensation claims.

I remember this from the Field Guide to Understanding Human Error. Making recovering from human error a well-understood process is important, and as you point out, that process will work best if people aren't distracted by butt-covering.

Link for those interested:


The fault here does not lie with just one person. One person ran the rm -rf command. Other people failed to check the backups. Others made the decision to give everyone full root access. Really it was a large part of the company that was to blame.

Whenever there's a bug in code I reviewed, there are at least two people responsible: The person who wrote the code and me, the person who reviewed it.

I've found that that helps morale, as there's a sense of shared responsibility, but there's no blaming people for problems where I work, so I haven't actually seen what happens when people are searching for the culprit.

The usual process is "this happened because of this, this and this all went wrong to cause us to not notice the problem, and we've added this to make it less likely that it will happen again". If you have smart, experienced people and you get problems, it's because of your processes, not the people, so the former is what you should fix.

Everywhere I have worked took the shared responsibility approach, I think that's the status quo but there are obviously exceptions.

One day at $megacorp a bug caused a production outage and a project manager sent an email to the team calling out the engineer who committed the change and asking us all to be more careful in the future. That manager was immediately reprimanded both in the email chain and in private.

I find the culture of shared responsibility to be one of the best qualities of our industry, even if it isn't universal.

Oh god. I know someone who made all of these mistakes by themselves in a certain week. And kept his job. He was pretty.

I left and found out two months later from a friend he had managed to take down almost every single server in the place for which he had access. Even the legacy don't touch systems that just boot and run equipment.

Be pretty.

Ed Catmull discusses this incident thoroughly in Creativity Inc.. He believed seeking retribution for this incident would've been counterproductive and discouraged Pixar's overall ethos as a safe place to experiment and make mistakes. It is this ethos and culture of vociferous, thorough experimentation and casting everyone's performance in the light of "What can we learn from that?" rather than "What ROI did we get from the last 3 months?" that Catmull credits for Pixar's success (paraphrasing here, but I believe this is an accurate summary).

Since Catmull has an engineering background (his PhD involved the invention of the Z-buffer, and he was doing computer graphics before anyone knew anything about it), he understands that mistakes and failed projects, when combined with an forthright and collaborative feedback loop, are not problematic detours, but rather necessary mile markers on the path to real innovation. We'd be so much further ahead if we put more men like Catmull in charge of things.

The biggest problem with reading Creativity Inc. is that it will rekindle the hope that there may be a sane workplace out there somewhere, when practically speaking, we know that few of us will ever find employment in one. It gave me a number of disquieting feelings as I read that the attributes of a workplace that all good engineers crave actually can and sometimes do exist out there. I had convinced myself that these things were myths, so now I'm sad that my boss isn't Ed Catmull.

That said, I do believe some evaluation and/or discipline would've been appropriate in this case, not for the person who accidentally executed a command in the wrong directory, but for the people who were supposed to be maintaining backups and data integrity.

Assuming that your primary job duties involve data integrity and system uptime, having non-functional backups of truly critical data stretches beyond the scope of "mistakes" and into the scope of incompetence.

It is, I'm sure, very possible that no one was really assigned this task at Pixar and that it would therefore by improper to punish anyone in particular for the failure to execute it, but I do believe there is a limit between mistakes en route to innovation and negligence. My experience has been that most companies strongly take one tack or the other: they either let heads roll for minor infractions (and thus never allow good people to get established and comfortable), or they never fire anyone and let the dead weight and political fiefdoms gum up the works until the gears stop altogether. There needs to be a balance, and that's a very hard thing to find out there.

> It is, I'm sure, very possible that no one was really assigned this task at Pixar and that it would therefore by improper to punish anyone in particular for the failure to execute it, but I do believe there is a limit between mistakes en route to innovation and negligence.

If indeed there was no-one assigned this task, then it was a mistake of negligence on the part of Pixar's management at the time. I'm not saying that to be snippy — that is exactly the job of management: to build the systems and processes required for employees to achieve the firm's goals. Proper backup and restore of data is one of those processes.

Yeah, I understand that, but at the same time, backups and security, while being among the two most critical aspects of IT and computer infrastructure at a company, are often the most overlooked by everyone. That persists today and I'm sure it was even more the case back then. If management can't give it the time of day, how can an employee be expected to do so?

An executive usually requires a "Come to Jesus" moment like this one, where the entire company teeters on the precipice due to lax backup or security policies, to really have the importance impressed upon him or her. At that point, they are generally much more supportive, though sadly, this too can start to fade if the sysadmins do their job too well.

I don't want anyone to come away thinking that most companies have solved these problems. It's definitely not the case, even in large, established companies. Security and backups continue to get little attention until it's too late.

We really need to call a celebrity in MBA circles and get that person to run a seminar meant to scare the pants off the execs.

We had a proper backup system in place at my company. Backups were replicated to an identical system at a remote site, and we periodically validated random backups. Our audit team also spot checked our backups to ensure that all servers were being backed up. Then we had a management change; ditched our system for a more expensive solution, and basically told our application owners that backups were a nice thing to have, but they shouldn't really count on them. The person in charge of backups was assigned other tasks and encouraged not to spend too much time on managing the backup system. With over 1000 servers, it's a matter of time until we have an issue that leads to a CTJ event. Unfortunately, the backup admin will probably be the one facing Jesus, not the management staff...

> I had convinced myself that these things were myths, so now I'm sad that my boss isn't Ed Catmull.

Wasn't Catmull involved in wage-fixing? [http://www.cartoonbrew.com/artist-rights/ed-catmull-on-wage-...]

He was at center, and admitted to it. https://www.bloomberg.com/news/articles/2014-11-19/apple-goo...

"“Like somehow we’re hurting some employees? We’re not,” Catmull said. “While I have responsibility for the payroll, I have responsibility for the long term also,” Catmull said. “I don’t apologize for this. This was bad stuff.”"

I can't find the other interview, but in a later interview he makes it clear his job is to worry about Disney's profitability.

I worked in visual effects for five years and loved every moment. People who choose that profession are fun, creative, passionate artistic, super energetic, crazy, smart, pull off the nearly impossible every project, out of the ordinary in every way, and crazy (listing that one twice). I miss the people. When I changed back to non-vfx development work my take home pay literally doubled. Obviously there is more to life than just pay! But the wage suppression has had a lasting effect... at the same time there are so many people who want into "the biz" it appears they that can/should? get away with it.

There is no artificial wage suppression. While it could be argued that the common business arrangement of no-poach agreements could potentially have that effect, it's certainly not the only (and probably not even a significant) factor.

In VFX, as in game dev, there is far more supply of people desperate to get those gigs than there is demand. That's why your take-home pay doubles when you switch to something less alluring, not a competitor's agreement not to recruit.

Catmull has to look at the big picture, i.e., "Is our company going to operate well if competitors are constantly sending out emails to our employees and offering them 120% salary to switch jobs? Since it'd be equally disruptive if we did this to them and no work would ever be accomplished in this sector, let's just have a truce where we don't actively pursue one another's talent, and then we can stop destroying each others' projects with these counterproductive bidding wars."

Continuity and seniority is very important to the smooth operation of a company. VFX projects take years to complete and they're probably more sensitive to high churn than other types of projects, so the concern is even more justified in this sector than in the general sense.

If the company can't see the big picture and find a way to be productive within that climate, far worse than not getting cold calls from the recruiting department of the competitor, everyone will be out of work.

I know it's fun to hate on executives and believe me, I know they very frequently deserve every bit of it. But there is validity to the perspective that concerns for overall corporate performance must be take into the balance.

I gotta say I agree with Catmull. I understand that specifically the VFX people are trying to blow this up into some massive offense, but I simply do not see it. As I stated in other comments, this is a very common arrangement that is in no way limited to VFX, Silicon Valley, or tech.

It's great that Catmull has the backbone to refuse to apologize when someone is trying to shame him into submission. This is a surprisingly rare attribute these days.

Whenever this comes up, I get eviscerated on HN, but I don't think Catmull was involved in the mean-spirited conspiracy that union groups are trying to traipse up. I believe he saw those deals, which were absolutely conventional as you can see by the list of participants, not as limiting employees from seeking other employment, but merely discouraging counterproductive bidding wars.

One potential interpretation is that this artificially depresses worker salaries as workers are not continuously being auctioned back and forth. Another potential interpretation is that this allows the company to have the stability it needs to function, prevent toxic sentiment among peers who take a bid from one company or the other, potentially leaving others holding the bag for the project.

I believe this latter interpretation is the intent of most such agreements, and that the former is rarely considered legitimate (i.e., continuous bidding wars would be too disruptive to be feasible even if there were no formal agreements in place).

Such understandings are very common across competitors in all lines of work, whether they're written or not; at a former company, I was personally told by the CEO that we couldn't actively recruit someone who worked at a competitor because we didn't want to risk starting a bidding war over talent and potentially throwing everything off-kilter. Such arrangements are not SV-exclusive, let alone Pixar-exclusive, and they are done out of practicality, not malice. Market value for employees can be correctly surmised without feverish, aggressive overbidding.

The incident is frequently misconstrued as a complete block on any cross-hiring. My understanding is that it was simply an agreement forbidding cross-recruiting; a gentleman's agreement that they wouldn't try to start an arbitrary bidding war over the one company's talent if that company wouldn't try to start one over theirs. Employees were still free to seek and obtain employment at any of the major studios independently if they so chose.

I think that panicked cries of wage fixing and intentional repression of employment opportunities are not only not credible, but farcical. I'd ask myself why someone is interested in painting an imbalanced and unrepresentative picture such as that.

If anything, these agreements are a failure of the contemporary legal and HR departments across every major technology company involved in computer animation. I believe the intent of the executives was nothing more than maintaining a stable workforce. Their lawyers and HR people should have warned them that there was another dormant interpretation that could've been used by exploitative politicians to misrepresent the situation.

Last time this came up (that I saw), the Pixar, Apple, and Intel et al chiefs were being compared to Nazis. That is beyond the pale.

The cross-recruiting agreement is controversial; at least there's an argument that it should have been OK.

But Catmull in particular was involved in the more extreme implementations of the deal. In his version, not only did cooperating companies agree not to actively poach, nor to passively extend offers when approached by employees of competitors, but also they agreed to actively report amongst themselves whenever they were approached by employees of competitors.

This probably damaged the careers of engineers who were not completely satisfied with their current employer, because as an exec are you going to give a key project to someone who's about to go work for the competition?

Maybe even the Catmull version of the wage-fixing scheme has some defenders, but I think there are very few defenders of the Catmull form of the agreement compared to the fraction of HNers who are OK with the milder form that merely instructed recruiters not to actively initiate conversations with employees of competitors.

Rather than this very long defense of someone it sounds like you do not know, how about a simpler explanation similar to the one that started this discussion: people make mistakes, it's better not to focus on retribution.

Just as we don't go out of our way to defend the guy who did rm -rf for what he did specifically, but rather move on.

Because I don't necessarily think it was a mistake. And to be clear, you are correct that I have no personal association with Ed Catmull (or anyone else involved, including any employees who may have been affected by the no-poach compacts). I just don't like seeing the hate machine churn up over a pretty typical and reasonable business practice, especially against leaders as commendable as Ed Catmull, who has, by far, taken the brunt of the attacks on these issues (because no one knows who runs Dreamworks).

IMO, the lesson to draw from this is "get better legal advice and avoid showboating politicians". Happy to move on.

EDIT: also, shortened the parent for you. I agree it was overly long.

It is a mistake if it is illegal.

You can't make ANY agreements to refuse to recruit from certain companies.

Making any "official" effort or policy to prevent a bid war is illegal.

I'm not a lawyer, but given the volume of long articles on legal industry websites when one searches for terms like "no-poach agreement legality", it appears this is not so cut and dry. I am sure that to some extent, it will also vary based on jurisdiction.

Do we really believe that the big shot lawyers at all of these places (remember, many household brands were parties) are so bad they would allow a contract that was pro se illegal to be entered into, or that the executives secretly fast-tracked these documents and bypassed counsel? To me, the situation sounds much grayer than some portray it.

This case was settled, not tried. We are left to speculate on what the outcomes and conclusions would've been had the settlement not proceeded.

It depends on the jurisdiction, sure. But in this case the jurisdiction was California, the most pro labor state in the US.

I don't believe their lawyers or the companies are incompetent. I believe that they were actively malicious.

They thought that the legal costs of getting caught and losing the lawsuit, or settling, would be less than following the law, so they actively chose to take the risk and break the law.

And it turns out, they were mostly correct about that. Doesn't mean that they shouldn't be condemned for it.

This is backed up by all the statements that they made about "don't put this stuff in writing! ", ect. And the fact that they stopped engaging in these practices. (if they were doing nothing wrong, then they would continue, right?)

Is a lawyer not incompetent if he green-lights the corporation's involvement in an actively malicious scheme?

No. It is actually a smart(but evil) thing to do, if you think the courts won't fine you that much.

X billion gain > Y billion loss.

These companies won. They saved more money than they lost.

From a business standpoint, it was a good idea to break the law, and get sued.

> Such understandings are very common across competitors in all lines of work, whether they're written or not; at a former company, I was personally told by the CEO that we couldn't actively recruit someone who worked at a competitor because we didn't want to risk starting a bidding war over talent and potentially throwing everything off-kilter.

I think this is one of those occasions where it's OK for companies to individually come to this conclusion and not implement this as a practice. But the moment several companies come to a collective agreement on the same, it enters questionable and probably illegal territory.

Creativity Inc is a great book and I also wish I worked for Ed Catmull when I finished it. Didn't somebody have an illicit backup copy of most of it and they were able to get most of what they needed back from that?

I absolutely agree with this.

There was an incident where I work where an employee (a new hire) set up a cron job to delete his local code tree, re-sync a new copy, then re-build it using a cron job every night. A completely reasonable thing for a coder to automate.

In his crontab he put:

    rm -rf /$MY_TREE_ROOT
and as everyone undoubtedly first discovered by accident, the crontab environment is stripped bare of all your ordinary shell environment. So $MY_TREE_ROOT expanded to "".

The crontab ran on Friday, IIRC, and got most of the way through deleting the entire project over the weekend before a lead came in and noticed things were disappearing. Work was more or less halted for several days while the supes worked to restore everything from backup.

Could the blunder have been prevented? Yes, probably with a higher degree of caution, but that level of subtlety in a coding mistake is made regularly by most people (especially someone right out of university); he was just unlucky that the consequences were catastrophic, and that he tripped over the weird way crontab works in the worst possible usage case. He probably even tested it in his shell. We all know to quadruple-check our rm-rfs, but we know that because we've all made (smaller) mistakes like his. It could have been anyone.

Dragging him to the guillotine would have solved nothing. In fact, the real question is "how is it possible for such an easy mistake to hose the entire project?" Some relatively small permissions changes at the directory root of the master project probably would have stopped stray `rm -rf`s from escaping local machines without severely sacrificing the frictionless environment that comes from an open-permissions shop. So if anything, the failure was systems's fault for setting up a filesystem that can be hosed so easily and so completely by an honest mistake.

The correct thing to do was (and is) to skip the witch hunt, and focus on problem-solving. I am not sure, but I think the employee was eventually hired on at the end of his internship.

For me the principle is: Standards and habits are teachable. Competence and attitude, less so. Educate and train for the former, and a failure of the former should cause you to look first at your procedures, not the people. Hire and fire for the latter.

> You just commit to something when you know that you won't be under the fire if you do something wrong without knowing it.

The other side is if you play a key role (and head could roll after the hard work is done) to simply leverage that fact (perhaps with others) as an advantage such that you have get a new contract and can't be fired for X amount of time.

"And then, some months later, Pixar rewrote the film from almost the ground up, and we made ToyStory2 again"

So that effort to recreate it (not to mention produce it in the first place) was pretty much all for naught? That must have been soul destroying

In the comments, Oren addressed exactly that particular question:

"We didn't scrap the models, but yes, we scrapped almost all the animation and almost all the layout and all the lighting. And it was worth it.

Changing the script saved the film, which in turn allowed Buzz and Woody to carry on for future generation (see ToyStory3 for how awesome that universe continues to be - well done to everyone who worked on the lastest installment!) and, in some ways, set a major cornerstone in the culture of Pixar. You may have heard John or Steve or Ed mention "Quality is a good business model" over the years. Well, that moment in Pixar's history was when we tested that, and it was hard, but thankfully I think we passed the test. Toy Story 2 went on to became one of the most successful films we ever made by almost any measure.

So, suffice it to say that yes, the 2nd version (which you saw in theatres and is now on the BluRay) is about a bagillion times better than the film we were working on. The story talent at the studio came together in a pretty incredible way and reworked everything. When they came back from the winter holidays in January '99, their pitch, and Steve Jobs's rallying cry that we could in fact get it done by the deadline later that year, are a few of the most vivid and moving memories I have of my 20 years at the studio."


Out of curiosity, is there any outline of the prior script available publicly? I'm curious where they were taking the franchise before the rewrite.

I found this PDF[0] that purports to be the first draft of the script, but I can't vouch for its authenticity. It was linked from this summary on a Pixar fan wiki[1].

[0] http://web.archive.org/web/20121213100913/http://www.raindan...

[1] http://pixar.wikia.com/wiki/Toy_Story_2_(original_storyline)

How is it possible to get a remake done by deadline? How did the original have so much extra time padded into its schedule?

> Steve Jobs's rallying cry that we could in fact get it done by the deadline later that year

There interesting but here is that Jobs didn't know if his cry was true. But he needed it to be true, so it was. Jobs was a member of the "action-based community", not the "reality-based community" https://en.wikipedia.org/wiki/Reality-based_community

They got it done by killing everyone, working multiple shifts, 24/7, no days off, etc. Some people left after it was over, due to burn out. Surviving TS2 was a test of Pixar's resilience that they passed, but at a cost.

Didn't they shut the whole company down for a period of time to give everyone a break after that push? I seem to remember reading that somewhere.

I disagree wholeheartedly, they had a rare chance to rebuild using their acquired knowledge with none of the debt or cruft.

"We have to keep this scene even though it's not quite perfect because otherwise it's a waste of money".

Maybe this is a bad example actually, movie industry is something you launch and market and leave.

But the best architectures I've seen have been demolished, destroyed and rebuilt from the ground up for their purpose.

Same with code.

Right, but I'm thinking from the perspective of someone who's been working on something for ages, gone through the stress of nearly losing it, then miraculously recovering it ... only to have found that a lot of their work was ditched. You're right that it probably ended up better, and sister comments are right in that it wasn't ALL for naught ... but can you imagine the moment you found out it was being reworked?

No need to imagine. It's not just during disasters like the Pixar case. Creative collaborative ventures like films and animation are filled with months of effort being deleted with a few quick keystrokes.

Back when I was still in the film/video industry, it happened often, you kinda get accustomed to the ephemeral nature and you try not to get too attached to your work. Not always successfully but you try.

> But the best architectures I've seen have been demolished, destroyed and rebuilt from the ground up for their purpose.

This can also be a destructive siren call:


But I think it's not absolute. Sometimes rewrites are imperative.

In the case of toy story 2, I think the analogy would be that they ditched the product they were working on to create a different one. Thus making joel's point non relevant there

It wasn't a new product, they kept a lot of the assets (notably the characters and the universe) but rewrote the script extensively and had to scrap a lot of animation.

I don't think an analogy to software rewrite works very well, the domains are so different. For example Joel makes the point that code have a lot of embedded information which you have to rediscover if you rewrite from scratch. This introduces a huge risk and uncertainty.

If you rewrite a movie script halfway during production you have to scrap a lot of work but you don't really loose knowledge, so the risk is more manageable.

And I'm also pretty sure they reworked their backup and validation tools as well.

Same with government.

Indeed, about every couple century in history we come up with some new form of government. It's only ever so incremental (hence why 'revolutions' are few and major in scope), and often has to do with adapting to new conditions (technological and social change mostly).

That statement really resonated with me.

So often something happens which seems like a total disaster, the end of the world, and you struggle desperately to fix it.

In hindsight it turns out it didn't matter as much as you thought it did anyway. Has happened in so many startups I've worked at.

"The only stress and pressure in life is the stress and pressure you put on yourself".

Life goes on.

A large percentage of files in a CG production are not shot specific, so no, the recovery work was definitely not wasted. There are sets, lighting setups, props, layouts, models, textures, shaders, character rigs, procedural and effects systems, etc., etc. A few of those things might have to be redone, but when those things are set up and the script changes, the main bulk of the work is cameras and character animation, and then re-rendering.

> all for naught?

I bet you they were suddenly industry experts in source control and data backups.

I made a mistake like this once, I feel most that are really dogmatic about backups have something like this.

A long time ago I had a hard drive fail that had a bitcoin wallet with about 10 bitcoins in it.

At the time it was worth a hundred USD or so. I tried to fix it myself, ended up failing and throwing the drive out.

Right after that bitcoin started its meteoric climb. Every now and then I check the prices, then I go check that my backups are running, that my restores work, that my offsite backup is setup correctly, that every single one of my devices is backed up.

It was a $9,000 life lesson (as of right now...)

Having lost quite some Bitcoins myself to different errors, IMO it was a hundred USD lesson. You could have replaced them but chose not to. The value therefore was whatever they would have netted you at the time. Worrying about what could have been will just drive you nuts.

I know, I'm not actually as broken up about it as I make it sound sometimes. It was a mistake, and it could "technically" cost me 10 grand, so it's a nice number to remember when I think about the time I'm spending setting up and testing my backups. (And it's a fantastic story to tell others which can often get them to start using a backup system)

In the end, it may end up being net positive in my life when I save something huge later.

Caring to take good backups and the knowledge/skill of doing so is well worth more than $9K in the long run. Losing months of work is a mental killer.

(What's worse is losing 100+ BTC to Gox. :p)

I have not lost anything really important but I cannot live without at least 3 geographically separated backups of generations of backups for something I don't want to lose.

> (What's worse is losing 100+ BTC to Gox. :p)

I lost 18 BTC to MtGox... Bought most of them when the exchange rate was about $500.

I knew there was danger in keeping it on the exchange. I wanted to show my girlfriend who bought one of those BTC how easy it was to transfer them to a local machine, to demonstrate the power of Bitcoin because I was passionate about it and thought it had a chance of changing our corrupt banking system.

She kept putting me off, she thought it was some big procedure and just wanted me to do it. I forgot about it after that for a couple of months... Then MtGox exploded.


If you're out there, MtGox thief, I worked for that BTC. Karma will get you in the end.

I believe that the original version that was scrapped was intended to be a straight-to-video release. It was completely reworked when the company decided to give the project bigscreen treatment.

Almost. TS2 was originally a direct to video film. But Disney liked the work-in-progress so much that they approved making it a feature film. And Pixar management at the time wasn't really thrilled with the idea of having an "A" team that made feature films and a "B" team that made DTV films. That could lead to morale problems in the B team. It was much later that problems in the story lead to replacing the director and doing the restart.

> "And then, some months later, Pixar rewrote the film from almost the ground up, and we made ToyStory2 again"

Reminds me of Fred Brooks quote. "Plan to throw one away. You will anyhow".

Heh, I'm sure it did.

I worked at a few VFX studios, and everyone has deleted large swathes of shit by accident.

My favourite was when an rsync script went rogue and started deleted the entire /job directory in reverse alphabet order. Mama-mia[1] was totally wiped out, as was Wanted (that was missing some backups, so some poor fucker had to go round fishing assets out of /tmp, from around 2000 machines.)

From what I recall (this was ~2008) There was some confusion as to what was doing the deleting. Because we had at the time a large(100 or 300tb[2]) lustre file system, it didn't really give you that many clues. They had to wait till it went on a plain old NFS box before they could figure out what was causing it.

Another time honoured classic is matte painters on OSX boxes accidentally dragging whole film directories into random places.

[1]some films have code names, hence why this was first

[2]That lustre was big, physically and IO, it could sustain something like 2-5 gigabytes a second, It had at least 12 racks of disks. Now a 4u disk shelf and one server can do ~2gigabyes sustained

We lost a good chunk of Tintin (I think) when someone tried to use the OSX migration assistant to upgrade a Macbook that had some production volumes NFS mounted. It was trying in vain to copy several PB of data (I am convinced that nobody at Apple has ever seen or heard of NFS), and because it was so slow the user hit cancelled and it somehow tried to undo the copy and started deleting all the files on the NFS servers.

There was another incident where there was a script that ran out of cron to cleanup old files in /tmp, and someone NFS mounted a production volume into /tmp...

Eventually we put tarpit directories at the top of each filesystem (a directory with 1000 subdirectories each with 1000 subdirectories, several layers deep) to try and catch deletes like the one you saw, then we would alert if any directories in there were deleted so we could frantically try and figure out which workstation was doing the damage.

I had a client with a Linux server who wanted to automount the share on their OS X workstations. I cannot believe the hoops I had to jump through to make something as simple as NFS work. Every iteration of OS X seems to make traditional *nix utilities less and less compatible and remove valuable tools for no reason other than obstinance.

In the most recent VFX company I worked at, with some of the same guys, the backup sys was fucking ace. Firstly rm was aliased to a script that just moved stuff, not deleted it.

Second, there were very large nearlines that took hourly snapshots. Finally, lots and lots of tape for archive.

From the next web write up -

> The command that had been run was most likely ‘rm -r -f *’, which—roughly speaking—commands the system to begin removing every file below the current directory. This is commonly used to clear out a subset of unwanted files. Unfortunately, someone on the system had run the command at the root level of the Toy Story 2 project and the system was recursively tracking down through the file structure and deleting its way out like a worm eating its way out from the core of an apple.

As a linux neophyte, I once struggled to remember whether the trailing slash on a directory was important. So I typed "rm -rf", and pasted the directory name "schoolwork/project1 " (with a trailing space), but then I waffled and decided to add a trailing slash. So I changed it to "rm -rf schoolwork/project1 /".

That's my theory as to what they did.

  rm -Rf test /
  rm: it is dangerous to operate recursively on ‘/’
  rm: use --no-preserve-root to override this failsafe
GNU rm, but I don't know when it was introduced.

Quick googling around, and it seems like this behavior (in Ubuntu) changed somewhere near the end of 2008: https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/174...

tl;dr That's when coreutils Ubuntu package switched from "--no-preserve-root" to "--preserve-root" by default.

This is relatively recent, and caused quite a stir when it was introduced. Certainly not around during Toy Story 2.

edit: More info:


One feature I've always wanted is to make a directory non-deletable without otherwise changing the directory's functionality. So if I do "rm -rf ~/work" I don't lose anything, but I can still do "touch ~/work/whatever". As far as I know, this is impossible.

You don't actually need write access to a file/directory to delete it. You only need write access to the parent (removing or adding a file modifies the container, not the contained).

Remove write access to the parent directory, while keeping traversal and read rights, and you'll be sorted.

But then I can't add files to my home directory...

So don't put the stuff you want to protect this way in your home directory, but somewhere else, and symlink the relevant directories to your home directory.

Quite clumsy, but I guess it works. I'll keep it in mind, thanks for the idea.

You said it was impossible. I simply indicated it wasn't.

The question was perhaps imprecisely phrased, but I think you can figure out what I'm trying to accomplish.

# mkdir /tmp/foo

# touch /tmp/foo/.immutable

# chattr +i /tmp/foo/.immutable

# rm -rf /tmp/foo

It's not EXACTLY what you asked for, but the .immutable file cannot be deleted until you call chattr -i on it, which protects the directory....

That doesn't protect siblings of <foo/.immutable>...

If I remember correctly, they were using Solaris at the time, and I know for sure that Solaris did not have this safety net, nor did Linux at the time. The requirement of --no-preserve-root was introduced almost a decade later.

obligatory call for everyone to use safe-rm [1].

it's one thing to shoot yourself in the foot, but without safe-rm you (or someone else less cautious) will eventually fire the accidental head-shot. it's happened to me a couple times; but never again since I started using safe-rm everywhere.

[1] http://serverfault.com/questions/337082/how-do-i-prevent-acc...

I'll add another obligatory, make sure you have a working backup system.

And that you already tried to restore from it

It's more likely they wanted to remove all hidden directories from their home directory, and ran `rm -rf .*`.

Under some shells (eg bash) that will expand to include `..` and `.`.

The last time it happened to me, it was caused by a space. Instead of rm -rf toto٭, I had rm -rf toto ٭.

Hopefully, it was not at the root directory and we have frequent snapshots.

I have totally never done that. If I had done that, I might have been saved by a lack of permissions. Like if I was on a mounted external drive, so not in my home directory and it didn't get too far.

Edit: What would have been much more worse, if I had done it, would have been

    rm -rf ~ /foo
instead of

    rm -rf ~/foo

Such a bug was in an Nvidia driver install script a few years ago:

  rm -rf /usr /lib/modules/something/something

Bumblebee specifically, https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi... is the fix. Pretty sure Steam had the same issue early in their linux release, probably lots of other programs too.

Yes, I specifically remember a lot of hub bub about Steam deleting entire home directories if a path was input incorrectly. Glad that never bit me.

And also iTunes, back in the early days of OS X:


This is why I never use bash for scripting, and opt for Ruby instead.

You can use the `Pathname` class to treat paths as objects, which you can concatenate only valid paths with. Aside from obviously all the other sanity-saving features.

Same. Ruby isn't my preference, but I avoid bash scripting whenever possible. I see very little need to use it when practically every machine requires Python, Ruby, and other high-level dynamic languages as pre-requisite for some basic software that's included by default. I respect bash, but it's from a bygone time. Why deal with the esoteric one-character test operators, the nitpicky spacing issues, etc? Use modern, readable, testable code. So much nicer and easier to read and work with that way.

I'd love to see someone deploy a Python or Ruby-based shell. I'm sure these are available, but they're not widely used.

Bash may be worse than Ruby or Python for scripting, but it's far better as an interactive shell.

And I'm quite happy not having shell use hold back evolution of Ruby or Python, and not having drama like Ruby 1.8 -> 1.9 or Python 2.x -> 3.x affecting shell use.

That's mortifying!

Or, maybe you forget which directory you are in, and delete the wrong one that way: `rm -rf ../*` when you think you're in `/mnt/work/grkvlt/tmp` but are actually in `/mnt/work/grkvlt` at the time.

or perhaps it was the dreaded "rm -rf ${DIR}/*" when $DIR is not defined.

Safe shell coding tip #10451: Variables containing directories always have to end in a slash, and paths may never be built using variables and slashes.

Why is that more likely?

rm just unlinks the files at the inode level, seems like a disk forensics utility like the imagerec suite could have restored alot of the 'lost' data. In fact i've done it on source code after learning that the default behavior of untar was to overwrite all of your current directory structure. since it was text i didnt need anything fancy like imagerec, instead i just piped the raw disk to grep, and looked for parts i knew were in files i needed, then output them and the surrounding data to an external hard drive.

Back then yes. These days with SSDs, the OS will issue trim commands to the disk, zeroing the blocks from the OS point of view, and on SSDs with "secure delete", from a forensics point of view as well.

Perhaps I'm misunderstanding you, but trim doesn't secure delete anything. It merely indicates to the SSD that the sector is unused so that it can reap the underlying NAND location the next time it decides to garbage collect.

In other words, the data is still there in the flash, but only the SSD firmware (and physical access) has access to it.

I think the reality is that some blocks will be lazily erased. I believe modern flash firmware periodically GCs in the background to prevent write amplification. If this is triggered you may overwrite data.

NAND requires an erase before write so I wouldn't be surprised if some controllers are lazily erasing blocks to get better long term write speeds and prevent GC hiccups.

At least Intel (and probably others) now provide full-disk encryption by default. So the flash is unreadable by any other controller: http://www.intel.com/content/dam/www/public/us/en/documents/...

The reason for marking it as erased is so the firmware can physically erase the flash sector. It could happen immediately, a minute later or next week. But a well-written SSD firmware will try to erase blocks any time it's not busy doing something else, as erasing blocks takes way longer than writing them.

You might be able to recover something from the physical flash, but there's definitely no guarantees.

> erasing blocks takes way longer than writing them

Honest question, could you elaborate on that? Intuitively I would've thought writing and erasing are _the same_ from a physical standpoint, insofar as "erasing" means writing zeros.

For flash, erase resets an erase unit to its default state, which can be all 0 or all 1 depending on the technology. Writes changes bits from the default to the opposite only. Depending on the flash chips interface, you may be able to do this at the bit, byte, or block level, but changing in the opposite direction is expensive and time consuming.

In theory we could make flash with tiny erase units (down to the bit level), but in practice we don't because the extra circuitry would drive the price through the roof.

That's interesting, if nanotech (the part that's about assembling 'things' at a molecular level, hyper-grained 3D printing so to speak) really enters the economic breakthrough we've been expecting since the early 80's, I see one huge improvement for flash right here.

There are two operations on flash memory: "erase" (set all bits to 0) and "write" (set some bits to 1). The former is expensive, the latter cheap. Note that "write" can't set bits to 0. Hence the benefit of keeping zeroed blocks around.

That makes perfect sense, thank you.

I found this article about recovering data from SSDs very interesting.


That is very interesting, I didn't know there were SSDs with "secure delete."

I remember that Apple removed the "Secure Empty Trash" feature in OS X 10.11 because they didn't feel like they could guarantee secure deletion with their new fast SSDs present in most of their computers.

"secure delete" (actually "secure erase") is a term from the ATA standard, it's supported by practically all ATA and later SATA devices for a very long time. The idea was you'd tell the disk to erase everything, it would do the actual deletion in the background but would not allow reading the old data. This was also the way to reset passworded harddrives, you could send a secure erase command without authentication, the drive would wipe itself and once done, the password is gone.

Once SSDs arrived on the scene, they were limited by the interface as ATA didn't specify a way to mark sectors as unused. People found that their performance would degrade with use as all the sectors became utilized. But a secure erase command would mark all sectors as erased, so the drive would work like new. Later on, ATA got the TRIM (and later queued versions of TRIM) so the OS can mark specific sectors as erased. But the result is that a lot of people confuse flash sector erasing with secure erase.

Yup, disk arrays can be quite a problem when it comes to forensics recovery. This has been a bit of a nemesis of mine over the years. Friends or family will decide to buy a single RAID solution for backup and configure it to write files across the disks for performance because they don't know any better. Four years later they'll come to me because something happened to the array like a failed or corrupted controller NVRAM and they want to recover the files. For backup I recommend mirrored single individual spinning disks, preferably in multiple locations.

thanks! oh those destitute days of swap as an actual resource instead of just a, something must be broken, i just filled up 32 gigs of main memory.

That's really only possible on a single user system. VFX studios usually have large NFS servers (filers) typically running proprietary filesystems, and with hundreds or thousands of clients writing files simultaneously they don't get all laid down on the attached arrays in nice neat chunks representing entire files. Typically they would even try and distribute writes to multiple raid groups/arrays for a single file. Also consider the size, you were able to copy it onto an external drive to recover. Studios don't have a spare, empty filer sitting around to dump the random contents of a few racks worth of disks onto.

The video link is broken. Here is (what seems to be) the same video: https://www.youtube.com/watch?v=8dhp_20j0Ys

John Cleese's talk on Creativity recently made it to the front page of HN again https://www.youtube.com/watch?v=9EMj_CFPHYc and if you haven't watched it I highly recommend it.

I believe it was in this talk that he says the best work he ever did was when he scrapped and started over. Which from practice I think we can all admit that while its the hardest to do, it is always for the best.

> ... it is always for the best.

Not necessarily. People often underestimate (in engineering fields) how much work it will take to rebuild something. In software there is a high degree of creativity which can have large downstream effects. You need to architect your system in such a way to make it possible to replace components when needed, this is where strong separation of concerns is important.

One thing that I've seen happen time and again is an organization bifurcating itself, so that there is one team working on the new cool replacement, and the other working on the old dead thing that everyone hates. Needless to say this creates anymosity and serverely limits an organizations ability to respond to customer demands.

Starting over should be taken very seriously.

This event is explained more in depth in the book Creativity Inc by Ed Catmull. It's a pretty good story.

love that book

Oh man, this is the best punch line:

> And then, some months later, Pixar rewrote the film from almost the ground up, and we made ToyStory2 again. That rewritten film was the one you saw in theatres and that you can watch now on BluRay.

At first I was feeling how it would feel to lose all that work, so frustrating! But then even if you hadn't, it turns out management was gonna throw it all away anyway!

Funny that this comes up a few weeks after I finished Ed Catmull's "Creativity Inc." If you want a little more detail about this (and other Pixar related things and Steve Jobs) read the book. It is a really good one.

A good way to fuckup on windows/C#: I Was iterating through network folders to delete (which all start with "\\servername"), except that I had a bug and instead was iterating through the characters of the first network folder path. And that's how I discovered that in windows, "\" means root of the current active drive. And that's also why I value my automatic backup to a NAS twice a day.

This brings up a great practical question. What's the state of the art of this sort of thing for more modest but still modern data storage requirements?

Context: For the last five years, my backup system has been to have Time Machine do hourly backups on my MBP (main development machine, just shy of 1TB data), with key spots on my Linux server (3TB data at the moment) backed up daily to my in-laws' house using cron and rsync, and spot directories on the MBP backed up there as well.

But the hard drive on the Time Capsule I've used seems to have gotten unreliable, and the external USB drive I bought to replace it has not worked reliably for more than a day or two at a time. And even when it was all working properly, I was never really verifying my backups.

Do people have suggestions for secure, reliable, verifiable, easy backup systems capable of handling 4+ TB of data? I don't mind if it takes work or money to set it up; the important thing is once it's working I can mostly forget about it.

For an offsite backup, Backblaze is fantastic. Unlimited storage for $5/month and the client works perfectly. It's not highly-redundant or anything, so use it in addition to a local backup.

CrashPlan is the next-best option if you need Linux support, but the client isn't as good.

Seconding Backblaze for set-and-forget. It's a huge confidence boost for data that is a little bit too large and low-value to handle with more care.

I'm trying Backblaze now, hoping my first backup will be done in a week or so...

I replaced my Time Capsule with a Synology NAS. Backups are still handled by Time Machine on the Mac (so they're still mostly unobtrusive and idiot-proof), but now they're stored on a mirrored RAID of cheap 3.5" disks.

The Synology box is basically just an ARM Linux machine, SSH/root is not locked out if you want it, so if you want to get fancy with off-site backups, you can set up rsync or whatever you want on it. They even ship with some GUIs for mirroring to Dropbox, S3, rsync, another remote Synology NAS, etc.

How long have you been using this? When I search online for solutions to the Time Capsule problems I have been having, they are full of "Time Machine does not work backing up to non-Apple products" warnings.

A couple months now. And yeah, I've tested restores of files as well.

Synology has specific support for Time Machine, in fact in their recent software update they added support for Time Machine over SMB since Apple is deprecating AFP.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact