Hacker News new | past | comments | ask | show | jobs | submit login

I was at Pixar when this happened, but I didn't hear all of the gory details, as I was in the Tools group, not Production. My memory of a conversation I had with the main System Administrator as to why the backup was not complete was that they were using a 32-bit version of tar and some of the filesystems being backed up were larger than 2GB. The script doing the backup did not catch the error. That may seems sloppy, but this sort of thing happens in the Real World all the time. At the risk of spilling secrets, I'll tell a story about the animation system, which I worked on (in the 1996-97 time frame).

The Pixar animation system at the time was written in K&R C and one of my tasks was to migrate it to ANSI C. As I did that I learned that there were aspects of this code that felt like a school assignment that had escaped from the lab. While searching for a bug, I noticed that the write() call that saved the animation data for a shot wasn't checked for errors. This seemed like a bad idea, since at the time the animation workstations were SGI systems with relatively small SCSI disks that could fill up easily. When this happened, the animation system usually would crash and work would be lost. So, I added an error check, and also code that would save the animation data to an NFS volume if the write to the local disk failed. Finally, it printed a message assuring the animator that her files were safe and it emailed a support address so they could come help. The animators loved it! I had left Pixar by the time the big crunch came in 1999 to remake TS2 in just 9 months, so I didn't see that madness first hand. But I'd like to think that TS2 is just a little, tiny bit prettier thanks to my emergency backup code that kept the animators and lighting TDs from having to redo shots they barely had time to do right the first time.

The point is that one would like to think that a place like Pixar is a model of Software Engineering Excellence, but the truth is more complex. Under the pressures of Production deadlines, sometime you just have to get it to work and hope you can clean it up later. I see the same things at NASA, where, for the most part, only Flight Software gets the full on Software Engineering treatment.

Right on the money with the "Real World" anecdote.

We do penetration tests for a wide range of clients across many industries. I would say that the bigger the company, the more childish flaws we find. For sure the complexity, scale, and multiple systems do not help towards having a good security posture , but never assume that because you are auditing a SWIFT backend you will not find anything that can lead to direct compromise.

Maybe not surprisingly, most startups that we work with have a better security posture than F500 companies. They tend to use the latest frameworks that do a good job of protecting against the standard issues, and their relatively small attack landscape doesn't leave you with much to play.

Of course there are exceptions.

Would love to have a chat about your view on security posture between smaller and bigger companies, but couldn't find your email in your HN profile. Mine is in my profile so if you have the time, please send me a message.

I actually think that would make a great discussion on HN. ;-)

Hmmm can't seem to find your email in your profile. I think you have to put it in the about section.

Ha, oops :-) updated.

One of the really interesting artifacts from the NASA flight software programs is that it helps put an upper bound of god honest ground truth level of effort to produce "perfect" software. Everything else we do is approximation to some level of fidelity. The only thing even reasonably close is maybe SQLite, and most people think the testing code for it is about 10x overkill.

It makes one start to contemplate how little we really understand about software and how nascent the field really is. We're basically stacking rocks in a modern age where other engineering disciplines are building half-km tall buildings and mile-spanning bridges.

Fast forward 2500 years and the software building techniques of the future must be as unrecognizable to us as rocket ships are to people who build mud huts.

We're stacking transistors measured in nm into worldwide communications systems, compelling simulations of reality, and systems that learn.

The scale is immense, so everything is built in multiple layers, each flawed and built upon a flawed foundation, each constantly changing, and we wouldn't achieve the heights we do if perfection, rather than satisfaction, was the goal.

Perhaps at some point the ground will stop shifting.

Sure we would, it would just take longer. A thousand years instead of 50. But just like we still use bridges and roads thousands of years old today, our distant descendents would still be using the exact foundations of what we produce now.

Sure, eventually we would get there. But we wouldn't be as far as we are at this moment.

You mean holding it all together with duct tape and chewing gum?

I mean being able to communicate with people around the world in real time, and all the rest.

> Perhaps at some point the ground will stop shifting.

Doubtful. Machines will build the ground instead, and what they build on top of it will be incomprehensible to us; at least we'll get to observe in awe.

What the comment under is saying. The scale can just not be compared. The order of magnitude of complexity and variable in computer system are far bigger than in any other engineering discipline.

> The script doing the backup did not catch the error. That may seems sloppy, but this sort of thing happens in the Real World all the time.

I disagree. I mean, I agree those things happen, but the system administrator's job is to anticipate those Real World risks and manage them with tools like quality assurance, redundancy, plain old focus and effort, and many others.

The fundamental of backups is to test restoring them, which would have caught the problem described. It's so basic that it's a well-known rookie error and a source of jokes like, 'My backup was perfect; it was the restore that failed.' What is a backup that can't be restored?

Also, in managing those Real World risks, the system administrator has to prioritize the value of data. The company's internal newsletter gets one level of care, HR and payroll another. The company's most valuable asset and work product, worth hundreds of millions of dollars? A personal mission, no mistakes are permitted; check and recheck, hire someone from the outside, create redundant systems, etc. It's also a failure of the CIO, who should have been absolutely sure of the data's safety even if he/she had to personally test the restore, and the CEO too.

I don't know or recall the details well enough to be sure, but it's possible that they were, in fact, testing the backups but had never before exceeded the 2GB limit. Knowing that your test cases cover all possible circumstances, including ones that haven't actually occurred in the real world yet, is non-trivial.

Your post is valid from a technical and idealistic standpoint, however when you realize the size of the data sets turned over in the film / TV world in a daily basis, restoring, hashing and verifying files during production schedules is akin to painting the forth bridge - only the bridge has doubled in size by the time you get half way through, and the river keeps rising...

There are lots of companies doing very well in this industry with targeted data management solutions to help alleviate these problems (I'm not sure that IT 'solutions' exist), however these backups aren't your typical database and document dumps. In today's UHD/HDR space you are looking at potentially petabytes of data for a single production - solely getting the data to tape for archive is a full time job for many in the industry, let alone administration of the systems themselves, which often need overhauling and reconfiguring between projects.

Please don't take this as me trying to detract from your post in any way - I agree with you on a great number of points, and we should all strive for ideals in day to day operations as it makes all our respective industries better. As a fairly crude analogy however, the tactician's view of the battlefield is often very different to that of the man in the trenches, and I've been on both sides of the coin. The film and TV space is incredibly dynamic, both in terms of hardware and software evolution, to the point where standardization is having a very hard time keeping up. It's this dynamism which keeps me coming back to work every day, but also contributes quite significantly to my rapidly receding hairline!

> Your post is valid from a technical and idealistic standpoint

You seem to have direct experience in that particular industry, but I disagree that I'm being "idealistic" (often used as a condescending pejorative by people who want to lower standards). I'm managing the risk based on the value of the asset, the risk to it, and the cost of protecting it. In this case, given the extremely high value of the asset, the cost and difficulty of verifying the backup appears worthwhile. The internal company newsletter in my example above is not worth much cost.

> solely getting the data to tape for archive is a full time job for many in the industry, let alone administration of the systems themselves, which often need overhauling and reconfiguring between projects.

Why not hire more personnel? $100K/yr seems like cheap insurance for this asset.

> restoring, hashing and verifying files during production schedules is akin to painting the forth bridge - only the bridge has doubled in size by the time you get half way through, and the river keeps rising...

> you are looking at potentially petabytes of data for a single production

I agree that not all situations allow you to perform a full restore as a test; Amazon, for example, probably can't test a complete restore of all systems. But I'm not talking about this level of safety for all systems; Amazon may test its most valuable, core asset, and regardless there are other ways to verify backups. In this case it seems like they could restore the data, based on the little I know. If the verification is days behind live data or doesn't test every backup, that's no reason to omit it; it still verifies the system, provides feedback on bugs, and reduces the maximum dataloss to a shorter period than infinity.

> I disagree that I'm being "idealistic" (often used as a condescending pejorative by people who want to lower standards)

A poor word choice on my part. It was certainly not meant to come across that way, so apologies there! Agreed that a cost vs risk analysis should be one of the first items on anyone's list, especially given the perceived value of the digital assets in this instance.

No problem; I over-reacted a bit. Welcome to HN! We need more classy, intelligent discussion like yours, so I hope you stick around.

I think GP's point is that although it's obviously sloppy, it's also sadly common.

Because sure, it's basic. To someone who knows that it's basic.

Also, this was way back in 1998, when what we would consider sloppy today was par for the course.

This particular case is one that's hard to test - you'd restore the backup, look at it, and it would look fine; all the files are there, almost all of them have perfect content, and even the broken files are "mostly" ok.

As the linked article states, they restored the backup seemingly sucessfully, and it took two days of normal work until someone noticed that the restored backup is actually not complete. How would you notice that in backup testing that (presumably) shouldn't take thousands of man-hours to do?

Good points. High-assurance can be very expensive in almost any area of IT. Speaking generally, when the asset is that valuable, the IT team should take responsibility for anticipating those problems - difficult, but not impossible. Sometimes you just have to roll up your sleeves and dig into the hard problems.

Speaking specifically, based on what you describe (neither of us is fully informed, of course), the solutions are easy and cheap: Verify the number of bytes restored, the numbers of files and directories restored, and verify checksums (or something similar) for individual files.

The impression I got from the descriptions of that incident and especially the followup was that their main weakness was not technical, but organizational - their core business consisted on making, versioning and using a very, very large number of data assets that was very important to them, but they apparently didn't have any good process of (a) inventory of what assets they have or should have, and (b) who is responsible for each (set of) assets. Instead, the assets "simply" were there when everything was going okay, and just as simply weren't there without anyone noticing when it wasn't.

If they had even the most rudimentary tracking or inventory of those assets/artifacts, the same technical problems would cause a much simpler and faster business recovery; instead, circumstances forced them to inventory something that they (a) possibly didn't have and (b) didn't know if it needed to exist in the first place, and (c) in a hurry, without preparation or adequate tools or people for that.

IT couldn't and cannot fix that - implementing a process may need some support from IT for tooling or a bit of automation, but most of the fix would be by and for the non-IT owners/keepers of that data.

Cool story :) You are bang on point when it comes to software engineering at what are thought to be "top tier" development houses. In the ideal world sure they will build the very best software but the real world has [unrealistic] deadlines and when you have deadlines it means corners get cut. Not always but very often. This leads to the whole "does it do exactly what is required?" and if it does then you are moved onto the next thing often with the "promise" that you will be able to go back and "fix things" at a later date. Of course we all know that promise is never kept.

On a related note:

"Backups aren't backups until they've been tested."

They really are Schrödinger's backups until a test restore takes place. This is one area where people cut corners a lot because no one cares about backups until they need them. But it's worth the effort to do them right, including occasional, scheduled manual testing. If you can't restore the data/system you're going to be the one working insane hours to get things working when a failure occurs.

And then there's the aftermath. Unless you are lucky enough to work for a blame free organization major data loss in a critical app due to a failure of the backup system (or lack thereof) could be a resume generating event. If you're ordered to prioritize other things over backups make sure you get that in writing. Backups are something everyone agrees is "critical" but no one wants to invest time in.

| As I did that I learned that there were aspects of this code that felt like a school assignment that had escaped from the lab

from a brief stint in the gfx industry, you are correct.

Pixar isn't a model of software excellence, it's a model of process and (ugh) culture excellence.

Didn't Pixar invent the alpha channel? Being the 'a' in rgba is pretty rad!

No, a Pixar co-founder invented it well before Pixar existed.

But I mean, the other co-founder was Ed Catmull, so it's not like the company is short on innovation.

I've heard about NASA's Flight Software teams being very strict on 9-5 work hours, with lots of code review and tests. I was under the impression this wasn't as strict with the competition from SpaceX and Blue Origin now that we aren't sending people to space on our (USA) own rockets. Is my impression incorrect?

SpaceX (or rather, Elon Musk) is famous for pushing their developers hard. Elon sent his team to live on a remote island in the Pacific [1] where they were asked to stay until they could (literally!) launch.

[1] https://www.bloomberg.com/graphics/2015-elon-musk-spacex/

Fantastic read.

I don't work in the manned space part of NASA and the software I deal with isn't critical, so I can't say. Most of what I know about Flight software development comes from co-workers who've done that work. They speak of getting written permission to swap two lines of code. That sort of thing. I think it would be cool to have code I wrote running on Mars, but I don't know if I could cope with that development environment.

I have some friends on the mechanical engineering side of things at SpaceX and can definitely say that 9-5 work hours don't even seem to be a suggestion. It likely varies team to team though.

My recollection of the details is lacking but this jives with what I remember about a talk I attended by a Pixar sysadmin. I think there was only a couple slides about it since it was just one part of a "journey from there to here" presentation about how they managed and versioned large binary assets with Perforce.

There are other anecdotes online about this catastrophic data loss and backup failure but I think it was, funny enough, the propensity for some end users of Perforce to mirror the entire repository that saved their bacon. I say funny because this is something a Perforce administrator will generally bark at you about since your sync of this enormous monolithic repo would be accompanied by an associated server side process that would run as long as your sync took to finish and thanks to some weird quirks of the Perforce SCM long running processes were bad and would/(could?) fuck up everyone else's day. In fact I think a recommendation from Pixar was to automatically kill long running processes server side and encourage smaller workspaces. Anyway, I digress. They were able to piece it together using local copies from certain workstations that contained all or most of the repo. Bad practices ended up saving the day.

> The Pixar animation system at the time

Was that menv? I've heard stories that Pixar builds these crazy custom apps that rival things like 3D Studio and Maya but that never leave their campus!

Yes, Menv, for Modelling ENVironment, although it was a full animation package, not just a modeler. It has a new name now and has been extensively rewritten.

i think it might have turned out better if they had lost the movie - Toy Story I and III have really good plots, but the screenplay of Toy Story II isn't that stellar; is it possible that they would have changed the story of II if it had been lost? (Mr. Potato head might have said that they lost the movie on purpose)

They actually did rewrite the entire story. What they recovered was completely remade (per the answer in the OP)

The original version did have Woody being stolen by a toy collector and being rescued by the other toys. I don't think many of the specifics beyond that survived the rewrite, but I don't know for sure. There are links elsewhere in this thread to claimed versions of the original story, but I can't vouch for their authenticity. I never saw any of the in-progress story reels.

On a somewhat related note I hope someday they'll take the scene assets they have from the older films, beef up the models or substitute them with newer ones from recent movies and re-render them. The stories are solid and a remaster of older Pixar films would be a hit I think.

I'd actually disagree with you - I think the story in Toy Story II as released is top notch.

Unlike most people, I found TS3 to be the weakest of the three. It seemed like a terrific idea for a short that got padded to make it feature length and the padding was pretty average. I wish they had made that short.

I think this might be due to timing. I saw TS when it first came to premium cable. Then I fell way behind on animated films. In 2012 I went on a binge to catch up, and re-watched TS, then saw TS2 for the first time a week later, then TS3 for the first time a week after that.

So for me I was watching TS3 while TS and TS2 were still fresh. Most people were seeing TS3 after a long gap so they may have forgotten details of TS2, and that gap also helped TS3 get a huge nostalgia factor. TS2 was fresh in my mind, and I had no nostalgia.

I'm not sure if I'd put TS or TS2 on top. TS had more novelty, but TS2 explored more weighty themes and had deeper emotional content.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact