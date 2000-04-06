The Pixar animation system at the time was written in K&R C and one of my tasks was to migrate it to ANSI C. As I did that I learned that there were aspects of this code that felt like a school assignment that had escaped from the lab. While searching for a bug, I noticed that the write() call that saved the animation data for a shot wasn't checked for errors. This seemed like a bad idea, since at the time the animation workstations were SGI systems with relatively small SCSI disks that could fill up easily. When this happened, the animation system usually would crash and work would be lost. So, I added an error check, and also code that would save the animation data to an NFS volume if the write to the local disk failed. Finally, it printed a message assuring the animator that her files were safe and it emailed a support address so they could come help. The animators loved it! I had left Pixar by the time the big crunch came in 1999 to remake TS2 in just 9 months, so I didn't see that madness first hand. But I'd like to think that TS2 is just a little, tiny bit prettier thanks to my emergency backup code that kept the animators and lighting TDs from having to redo shots they barely had time to do right the first time.
The point is that one would like to think that a place like Pixar is a model of Software Engineering Excellence, but the truth is more complex. Under the pressures of Production deadlines, sometime you just have to get it to work and hope you can clean it up later. I see the same things at NASA, where, for the most part, only Flight Software gets the full on Software Engineering treatment.
We do penetration tests for a wide range of clients across many industries. I would say that the bigger the company, the more childish flaws we find. For sure the complexity, scale, and multiple systems do not help towards having a good security posture , but never assume that because you are auditing a SWIFT backend you will not find anything that can lead to direct compromise.
Maybe not surprisingly, most startups that we work with have a better security posture than F500 companies. They tend to use the latest frameworks that do a good job of protecting against the standard issues, and their relatively small attack landscape doesn't leave you with much to play.
Of course there are exceptions.
from a brief stint in the gfx industry, you are correct.
Pixar isn't a model of software excellence, it's a model of process and (ugh) culture excellence.
Commitment would be very different if people were being asked to help while some heads were rolling. Because you're a real team when everybody is going in the same direction. Any call on "people, work hard do recover while we're after the moron who deleted everything" wouldn't have done it.
You just commit to something when you know that you won't be under the fire if you do something wrong without knowing it.
I think that employees actually makes less mistakes and are more productive if they don't have be worried about being fired for making a mistake.
> A young executive had made some bad decisions that cost the company several million dollars. He was summoned to Watson’s office, fully expecting to be dismissed. As he entered the office, the young executive said, “I suppose after that set of mistakes you will want to fire me.” Watson was said to have replied,
> “Not at all, young man, we have just spent a couple of million dollars educating you.” [1]
All depends on how leadership views employee growth
[1] http://the-happy-manager.com/articles/characteristic-of-lead...
I did fire an employee who deleted the entire CVS repository.
Actually, as you say, I didn't fire him for deleting the repo. I fired him the second time he deleted the entire repo.
However there's a silver lining: this is what led us (actually Ian Taylor IIRC) to write the CVS remote protocol (client / server source control). before that it was all over NFS, though the perp in question had actually logged into the machine and done rm -rf on it directly(!).
(Nowadays we have better approaches than CVS but this was the mid 90s)
$ rm -rf / home/myusername/mylargedir/
The real solution is comprised of:
* backups (which are restored periodically to ensure they contain everything)
* proper process which makes accidental removal harder (DCVS & co.)
That was the day I started backing up everything.
What most people don't realize is that very few places have a real (tested) backup system.
_goes off to check backups_
And then commit it.
It's a good habit.
- Employee was error prone and this mistake was just the biggest one to make headlines. Could be from incompetence or apathy.
- Impacted clients demanded the employee at-fault be terminated.
- Deterrence: fire one guy, everyone else knows to take that issue seriously. Doesn't Google do this? If you leak something to press, you're fired, then a company email goes out "Hey we canned dude for running his mouth..."
It's better to engage the known and perhaps questionable justifications than to "never understand".
Case 2: no competent manager would fire an employee who made a mistake to satisfy clients. They may move the employee to a role away from that client, but it would be insanity to allow the most unreasonable clients to dictate who gets fired. Any manager who does what you suggest should expect to have lost all credibility in the eyes of their team.
Case 3a: A leak to the press is a purposeful action. Firing for cause is perfectly reasonable. Making a mistake is not a purposeful action.
Case 3b: If you want to convey that a particular type of mistake is serious, don't do so by firing people. Do so with investments in education, process, and other tools that reduce the risk of the mistake occurring, and the harm when the mistake occurs. Firing somebody will backfire badly, as many of your best employees will self-select away from your most important projects, and away from your company, as they won't want to be in a situation where years of excellent performance can be erased with a single error.
Case 3a: Good distinction, a conscious leak is not a mistake. It's possible for a leak to be accidental though, say under alcohol, lost laptop, or just caught off guard by a clever inquisitor.
Case 3b: Firing has the effects you mention, but it also has the effect of assigning gravity to that error. I'm not claiming the benefits outweigh the drawbacks, but some managers do.
I'm not a proponent of the above, but it's good to understand the possible rationale behind these decisions.
If I see someone getting axed for making a mistake, I'd be making a mistake if I didn't immediately start firing up the resume machine.
I've never heard of this happening. I've heard of people fired for taking photographs (or stealing prototypes!) of confidential products and handing them to journalists.
"Sir beg my pardon for asking but why did you give Smith 50 press-ups for asking a question? You said that there were no stupid questions. Sir."
"I gave him the press-ups the SECOND time he asked. Will you need to ask again?"
I've found that that helps morale, as there's a sense of shared responsibility, but there's no blaming people for problems where I work, so I haven't actually seen what happens when people are searching for the culprit.
The usual process is "this happened because of this, this and this all went wrong to cause us to not notice the problem, and we've added this to make it less likely that it will happen again". If you have smart, experienced people and you get problems, it's because of your processes, not the people, so the former is what you should fix.
I left and found out two months later from a friend he had managed to take down almost every single server in the place for which he had access. Even the legacy don't touch systems that just boot and run equipment.
Be pretty.
Furthermore, a lot of individual error's are seen in an institutionalised "systems" framework - given that people invariably will make mistakes, how can we set up the environment/institutions/systems so that errors are not catastrophic.
Not sure how that applies to movie animation, to be honest, but not primarily looking for whom to blame was certainly a very good move.
Of course, if the problem is just a borderline behavior of the pilot or co-pilot, it'll be fantastic if we can get him off the circuit before he locks the captain outside and programs the plane to crash against a mountain. Or not to stretch fuel limits so that he will fall out of gas.
But... if we can also learn how to make a out-of-gas plane land and survive, and the cost is "let's not put this pilot into jail, because it's better to learn how to save more lives", I prefer this approach. Probably you'll be able to get the pilot from some other behavior.
Yes, it does. Many aviation crashes are attributed to "pilot error". There's only so much you can do with procedures and such; at some point, the pilot has to be held accountable for screwing up, and investigations do exactly that many times.
Usually, in major events, you're looking at commercial airliners with a pilot and co-pilot and in those cases, it's usually something much worse than a mistake by the pilot, and frequently several bad things happening at once. But in general aviation, where you have one pilot, frequently non-commercial, flying a small aircraft, the cause is frequently just "pilot error". A common example of this is the pilot running out of fuel because they did their calculations wrong. It happens frequently with private pilots, and in a Cessna you can't just pull over when you run out of gas.
Excerpts from Wiki:
> In March 2008, Bernard Farret, a deputy prosecutor in Pontoise, outside Paris, asked judges to bring manslaughter charges against Continental Airlines and two of its employees – John Taylor, the mechanic who replaced the wear strip on the DC-10, and his manager Stanley Ford – alleging negligence in the way the repair was carried out.
> At the same time charges were laid against Henri Perrier, head of the Concorde program at Aérospatiale, Jacques Hérubel, Concorde's chief engineer, and Claude Frantzen, head of DGAC, the French airline regulator. It was alleged that Perrier, Hérubel and Frantzen knew that the plane's fuel tanks could be susceptible to damage from foreign objects, but nonetheless allowed it to fly.
> Continental Airlines was found criminally responsible for the disaster by a Parisian court and was fined €200,000 ($271,628) and ordered to pay Air France €1 million. Taylor was given a 15-month suspended sentence, while Ford, Perrier, Hérubel and Frantzen were cleared of all charges. The court ruled that the crash resulted from a piece of metal from a Continental jet that was left on the runway; the object punctured a tyre on the Concorde and then ruptured a fuel tank. The convictions were overturned by a French appeals court in November 2012, thereby clearing Continental and Taylor of criminal responsibility.
> The Parisian court also ruled that Continental would have to pay 70% of any compensation claims. As Air France has paid out €100 million to the families of the victims, Continental could be made to pay its share of that compensation payout. The French appeals court, while overturning the criminal rulings by the Parisian court, affirmed the civil ruling and left Continental liable for the compensation claims.
https://www.amazon.com/dp/1472439058
The other side is if you play a key role (and head could roll after the hard work is done) to simply leverage that fact (perhaps with others) as an advantage such that you have get a new contract and can't be fired for X amount of time.
Since Catmull has an engineering background (his PhD involved the invention of the Z-buffer, and he was doing computer graphics before anyone knew anything about it), he understands that mistakes and failed projects, when combined with an forthright and collaborative feedback loop, are not problematic detours, but rather necessary mile markers on the path to real innovation. We'd be so much further ahead if we put more men like Catmull in charge of things.
The biggest problem with reading Creativity Inc. is that it will rekindle the hope that there may be a sane workplace out there somewhere, when practically speaking, we know that few of us will ever find employment in one. It gave me a number of disquieting feelings as I read that the attributes of a workplace that all good engineers crave actually can and sometimes do exist out there. I had convinced myself that these things were myths, so now I'm sad that my boss isn't Ed Catmull.
That said, I do believe some evaluation and/or discipline would've been appropriate in this case, not for the person who accidentally executed a command in the wrong directory, but for the people who were supposed to be maintaining backups and data integrity.
Assuming that your primary job duties involve data integrity and system uptime, having non-functional backups of truly critical data stretches beyond the scope of "mistakes" and into the scope of incompetence.
It is, I'm sure, very possible that no one was really assigned this task at Pixar and that it would therefore by improper to punish anyone in particular for the failure to execute it, but I do believe there is a limit between mistakes en route to innovation and negligence. My experience has been that most companies strongly take one tack or the other: they either let heads roll for minor infractions (and thus never allow good people to get established and comfortable), or they never fire anyone and let the dead weight and political fiefdoms gum up the works until the gears stop altogether. There needs to be a balance, and that's a very hard thing to find out there.
Wasn't Catmull involved in wage-fixing? [http://www.cartoonbrew.com/artist-rights/ed-catmull-on-wage-...]
One potential interpretation is that this artificially depresses worker salaries as workers are not continuously being auctioned back and forth. Another potential interpretation is that this allows the company to have the stability it needs to function, prevent toxic sentiment among peers who take a bid from one company or the other, potentially leaving others holding the bag for the project.
I believe this latter interpretation is the intent of most such agreements, and that the former is rarely considered legitimate (i.e., continuous bidding wars would be too disruptive to be feasible even if there were no formal agreements in place).
Such understandings are very common across competitors in all lines of work, whether they're written or not; at a former company, I was personally told by the CEO that we couldn't actively recruit someone who worked at a competitor because we didn't want to risk starting a bidding war over talent and potentially throwing everything off-kilter. Such arrangements are not SV-exclusive, let alone Pixar-exclusive, and they are done out of practicality, not malice. Market value for employees can be correctly surmised without feverish, aggressive overbidding.
The incident is frequently misconstrued as a complete block on any cross-hiring. My understanding is that it was simply an agreement forbidding cross-recruiting; a gentleman's agreement that they wouldn't try to start an arbitrary bidding war over the one company's talent if that company wouldn't try to start one over theirs. Employees were still free to seek and obtain employment at any of the major studios independently if they so chose.
I think that panicked cries of wage fixing and intentional repression of employment opportunities are not only not credible, but farcical. I'd ask myself why someone is interested in painting an imbalanced and unrepresentative picture such as that.
If anything, these agreements are a failure of the contemporary legal and HR departments across every major technology company involved in computer animation. I believe the intent of the executives was nothing more than maintaining a stable workforce. Their lawyers and HR people should have warned them that there was another dormant interpretation that could've been used by exploitative politicians to misrepresent the situation.
Last time this came up (that I saw), the Pixar, Apple, and Intel et al chiefs were being compared to Nazis. That is beyond the pale.
Just as we don't go out of our way to defend the guy who did rm -rf for what he did specifically, but rather move on.
IMO, the lesson to draw from this is "get better legal advice and avoid showboating politicians". Happy to move on.
EDIT: also, shortened the parent for you. I agree it was overly long.
You can't make ANY agreements to refuse to recruit from certain companies.
Making any "official" effort or policy to prevent a bid war is illegal.
If indeed there was no-one assigned this task, then it was a mistake of negligence on the part of Pixar's management at the time. I'm not saying that to be snippy — that is exactly the job of management: to build the systems and processes required for employees to achieve the firm's goals. Proper backup and restore of data is one of those processes.
An executive usually requires a "Come to Jesus" moment like this one, where the entire company teeters on the precipice due to lax backup or security policies, to really have the importance impressed upon him or her. At that point, they are generally much more supportive, though sadly, this too can start to fade if the sysadmins do their job too well.
I don't want anyone to come away thinking that most companies have solved these problems. It's definitely not the case, even in large, established companies. Security and backups continue to get little attention until it's too late.
We really need to call a celebrity in MBA circles and get that person to run a seminar meant to scare the pants off the execs.
So that effort to recreate it (not to mention produce it in the first place) was pretty much all for naught? That must have been soul destroying
"We didn't scrap the models, but yes, we scrapped almost all the animation and almost all the layout and all the lighting. And it was worth it.
Changing the script saved the film, which in turn allowed Buzz and Woody to carry on for future generation (see ToyStory3 for how awesome that universe continues to be - well done to everyone who worked on the lastest installment!) and, in some ways, set a major cornerstone in the culture of Pixar. You may have heard John or Steve or Ed mention "Quality is a good business model" over the years. Well, that moment in Pixar's history was when we tested that, and it was hard, but thankfully I think we passed the test. Toy Story 2 went on to became one of the most successful films we ever made by almost any measure.
So, suffice it to say that yes, the 2nd version (which you saw in theatres and is now on the BluRay) is about a bagillion times better than the film we were working on. The story talent at the studio came together in a pretty incredible way and reworked everything. When they came back from the winter holidays in January '99, their pitch, and Steve Jobs's rallying cry that we could in fact get it done by the deadline later that year, are a few of the most vivid and moving memories I have of my 20 years at the studio."
https://www.quora.com/Did-Pixar-accidentally-delete-Toy-Stor...
[0] http://web.archive.org/web/20121213100913/http://www.raindan...
[1] http://pixar.wikia.com/wiki/Toy_Story_2_(original_storyline)
> Steve Jobs's rallying cry that we could in fact get it done by the deadline later that year
There interesting but here is that Jobs didn't know if his cry was true. But he needed it to be true, so it was. Jobs was a member of the "action-based community", not the "reality-based community" https://en.wikipedia.org/wiki/Reality-based_community
"We have to keep this scene even though it's not quite perfect because otherwise it's a waste of money".
Maybe this is a bad example actually, movie industry is something you launch and market and leave.
But the best architectures I've seen have been demolished, destroyed and rebuilt from the ground up for their purpose.
Same with code.
Back when I was still in the film/video industry, it happened often, you kinda get accustomed to the ephemeral nature and you try not to get too attached to your work. Not always successfully but you try.
This can also be a destructive siren call:
https://www.joelonsoftware.com/2000/04/06/things-you-should-...
But I think it's not absolute. Sometimes rewrites are imperative.
I don't think an analogy to software rewrite works very well, the domains are so different. For example Joel makes the point that code have a lot of embedded information which you have to rediscover if you rewrite from scratch. This introduces a huge risk and uncertainty.
If you rewrite a movie script halfway during production you have to scrap a lot of work but you don't really loose knowledge, so the risk is more manageable.
So often something happens which seems like a total disaster, the end of the world, and you struggle desperately to fix it.
In hindsight it turns out it didn't matter as much as you thought it did anyway. Has happened in so many startups I've worked at.
I bet you they were suddenly industry experts in source control and data backups.
A long time ago I had a hard drive fail that had a bitcoin wallet with about 10 bitcoins in it.
At the time it was worth a hundred USD or so. I tried to fix it myself, ended up failing and throwing the drive out.
Right after that bitcoin started its meteoric climb. Every now and then I check the prices, then I go check that my backups are running, that my restores work, that my offsite backup is setup correctly, that every single one of my devices is backed up.
It was a $9,000 life lesson (as of right now...)
In the end, it may end up being net positive in my life when I save something huge later.
Reminds me of Fred Brooks quote. "Plan to throw one away. You will anyhow".
> And then, some months later, Pixar rewrote the film from almost the ground up, and we made ToyStory2 again. That rewritten film was the one you saw in theatres and that you can watch now on BluRay.
At first I was feeling how it would feel to lose all that work, so frustrating! But then even if you hadn't, it turns out management was gonna throw it all away anyway!
I believe it was in this talk that he says the best work he ever did was when he scrapped and started over. Which from practice I think we can all admit that while its the hardest to do, it is always for the best.
Not necessarily. People often underestimate (in engineering fields) how much work it will take to rebuild something. In software there is a high degree of creativity which can have large downstream effects. You need to architect your system in such a way to make it possible to replace components when needed, this is where strong separation of concerns is important.
One thing that I've seen happen time and again is an organization bifurcating itself, so that there is one team working on the new cool replacement, and the other working on the old dead thing that everyone hates. Needless to say this creates anymosity and serverely limits an organizations ability to respond to customer demands.
Starting over should be taken very seriously.
> The command that had been run was most likely ‘rm -r -f *’, which—roughly speaking—commands the system to begin removing every file below the current directory. This is commonly used to clear out a subset of unwanted files. Unfortunately, someone on the system had run the command at the root level of the Toy Story 2 project and the system was recursively tracking down through the file structure and deleting its way out like a worm eating its way out from the core of an apple.
That's my theory as to what they did.
rm -Rf test /
rm: it is dangerous to operate recursively on ‘/’
rm: use --no-preserve-root to override this failsafe
tl;dr That's when coreutils Ubuntu package switched from "--no-preserve-root" to "--preserve-root" by default.
edit: More info:
https://lwn.net/Articles/327141/
Remove write access to the parent directory, while keeping traversal and read rights, and you'll be sorted.
it's one thing to shoot yourself in the foot, but without safe-rm you (or someone else less cautious) will eventually fire the accidental head-shot. it's happened to me a couple times; but never again since I started using safe-rm everywhere.
[1] http://serverfault.com/questions/337082/how-do-i-prevent-acc...
Under some shells (eg bash) that will expand to include `..` and `.`.
Hopefully, it was not at the root directory and we have frequent snapshots.
Edit: What would have been much more worse, if I had done it, would have been
rm -rf ~ /foo
rm -rf ~/foo
rm -rf /usr /lib/modules/something/something
https://www.wired.com/2001/11/glitch-in-itunes-deletes-drive...
You can use the `Pathname` class to treat paths as objects, which you can concatenate only valid paths with. Aside from obviously all the other sanity-saving features.
Context: For the last five years, my backup system has been to have Time Machine do hourly backups on my MBP (main development machine, just shy of 1TB data), with key spots on my Linux server (3TB data at the moment) backed up daily to my in-laws' house using cron and rsync, and spot directories on the MBP backed up there as well.
But the hard drive on the Time Capsule I've used seems to have gotten unreliable, and the external USB drive I bought to replace it has not worked reliably for more than a day or two at a time. And even when it was all working properly, I was never really verifying my backups.
Do people have suggestions for secure, reliable, verifiable, easy backup systems capable of handling 4+ TB of data? I don't mind if it takes work or money to set it up; the important thing is once it's working I can mostly forget about it.
CrashPlan is the next-best option if you need Linux support, but the client isn't as good.
I worked at a few VFX studios, and everyone has deleted large swathes of shit by accident.
My favourite was when an rsync script went rogue and started deleted the entire /job directory in reverse alphabet order. Mama-mia[1] was totally wiped out, as was Wanted (that was missing some backups, so some poor fucker had to go round fishing assets out of /tmp, from around 2000 machines.)
From what I recall (this was ~2008) There was some confusion as to what was doing the deleting. Because we had at the time a large(100 or 300tb[2]) lustre file system, it didn't really give you that many clues. They had to wait till it went on a plain old NFS box before they could figure out what was causing it.
Another time honoured classic is matte painters on OSX boxes accidentally dragging whole film directories into random places.
[1]some films have code names, hence why this was first
[2]That lustre was big, physically and IO, it could sustain something like 2-5 gigabytes a second, It had at least 12 racks of disks. Now a 4u disk shelf and one server can do ~2gigabyes sustained
Second, there were very large nearlines that took hourly snapshots. Finally, lots and lots of tape for archive.
In other words, the data is still there in the flash, but only the SSD firmware (and physical access) has access to it.
NAND requires an erase before write so I wouldn't be surprised if some controllers are lazily erasing blocks to get better long term write speeds and prevent GC hiccups.
You might be able to recover something from the physical flash, but there's definitely no guarantees.
Honest question, could you elaborate on that? Intuitively I would've thought writing and erasing are _the same_ from a physical standpoint, insofar as "erasing" means writing zeros.
I remember that Apple removed the "Secure Empty Trash" feature in OS X 10.11 because they didn't feel like they could guarantee secure deletion with their new fast SSDs present in most of their computers.
Once SSDs arrived on the scene, they were limited by the interface as ATA didn't specify a way to mark sectors as unused. People found that their performance would degrade with use as all the sectors became utilized. But a secure erase command would mark all sectors as erased, so the drive would work like new. Later on, ATA got the TRIM (and later queued versions of TRIM) so the OS can mark specific sectors as erased. But the result is that a lot of people confuse flash sector erasing with secure erase.
