This story is oddly similar to the Maersk cyber-attack in 2017[1]:
> The virus, once activated, propagated within just seven minutes. And for Maersk, as with other organizations, the level of destruction was enormous.
> As well as affecting print and file-sharing capabilities, laptops, apps and servers, NotPetya severely damaged both Maersk’s implementations of DHCP (Dynamic Host Configuration Protocol) and Active Directory, which, respectively, allow devices to participate in networks by allocating IP addresses and provide a directory lookup for valid addresses. Along with those, the technology controlling its access to cloud services was also damaged and became unstable while its service bus was completely lost.
> Banks is candid about the breadth of the impact: “There was 100% destruction of anything based on Microsoft that was attached to the network.”
> In a stroke of luck, they were able to retrieve an undamaged copy of its Active Directory from the Maersk office in Nigeria. It had been unaffected by NotPetya purely because of a power outage in Lagos that had taken the service offline while the virus was spreading. A rather nervous local Maersk employee was tasked with transporting the crucial data to Denmark.
What this says to me is that multiple offline backups (or failing that, copies) will save your bacon some day.
That's one of the ideas behind use of tape storage today interestingly. It's pretty hard (although not impossible) for hackers to get at tapes sitting on a shelf.
There's no perfect scheme, but the best mechanisms I've seen for this in practice involve HSMs so that nobody, bad guy or good guy, can change the keys used for encryption without physically pulling the module, and it can be disconnected and stored at a safe location once the backup is complete. Each redundant set of a backup has an HSM, and the HSMs are cold stored in different physical locations.
But yes, this is hard to get right as I hinted by the "but not impossible" quip.
Many years ago I was working for a defence contractor and an idiot hit the TOTAL ERASE button on an HSM. This all went to hell when no one could work out how to load the keys into it again for nearly two weeks...
Practically you don't want to do that. You want them to be useless to someone who steals a tape since it could be a long time to never before you realize it's gone. All the tapes on the shelf look alike.
Since the tapes are physical, can't you protect them using different strategies? Like how banks protect stuff contained in safe deposit boxes? I imagine the stuff stored in safe deposit boxes can be quite sensitive too.
That's Iron Mountain's main business model, but it's considered good form not to leave it to them. Shit happens, and tape drives make it super easy to encrypt anyway.
That and banks don't seem to really try very hard nowadays, at least for retail customers. Maybe rich people have access to a better class of safe deposit boxes.
No, you encrypt at the tape drives themselves typically. Part of how they work is the also compress at the drive, but if you encrypt first there's no common entropy and the compression techniques work against you.
And picking and choosing is a recipe for disaster when something inevitably slips through and is leaked. Encrypt it all from orbit and let god sort it out.
If you’ve been physically penned aren’t you just rolling dice from then on anyway?
A financial org I worked for back in the day planned on slagging hardware in the event of a coloc breach of any sort including fire response. How else could you be sure?
Well, as a backup, USB drives don't have that great of a lifespan compared to tapes. They're also not nearly as space dense or competitive for $/GB once you've amortized the drives.
And for an org, there's still attacks in both cases. Do your backups get requested through an automated system? Well you've just made a very slow tape library with people rather than robotics. Are the tapes/USB sticks encrypted and the encryption keys stored in an online system? Then the hackers have a mechanism to deny you access to the contents of the tapes, which is probably their goal anyway.
Yes a USB drive with a text file is pretty safe from such attacks, but it also doesn't scale to the backup concerns of even relatively small orgs.
Unless you're practicing continuous restoration to ensure that your backups are actually backing up what you need, the odds that you got everything critical are not good.
The idea that throwing a USB in a drawer does better is absolutely ridiculous.
I think this illustrates where folks can go wrong.
Many folks and business TODAY are backing up to NAS / RAID arrays etc. Many are easily corruptible by ransomware. We see this REPEATEDLY with many infections of large orgs.
The basic story here is that something not editable has great value even if not perfectly current. If you do a diff every 4 hours you have covered a LOT of ground already.
I've sat through continuous DR architecturing discussions - the massive cost / complexity / scale can just be out of control for many users.
In many use cases Active Directory for example behind by 4 hours is not game ending. With notice to users that recent edits may not be captured you may be able to get back up and running pretty quickly.
I'm not singing the praises of "a USB in a drawer" — the medium doesn't matter, the number of copies doesn't matter. The only criteria I am considering is whether the backups are tested to prove that they actually work as intended.
There are shades of gray, here: continuous restoration is a best practice, but backups which are tested even once, and perhaps incompletely, are better than backups which are never tested at all.
Many continuous restoration / HA systems are surprisingly vulnerable to hacks because the level of management / coordination is pretty high. They work well if datacenter 1 goes down and you want to get to datacenter 2. They work less well (in my view) if someone gets root. A fair number of public victims you read about actually DID have continuous backups / dr setups.
OK, but how about at least testing that that it's actually feasible to restore from that S3 bucket of yours?
Verifying that restoration is possible once is better than not testing restoration at all. As for verifying restoration on an ongoing basis, if the attacker has root I suppose we can argue about the cost/benefit ratio but continuous backup validation doesn't hurt.
Where did I say that I don't test restores? I think you are misunderstanding the S3 object lock feature. With S3 object lock, you write your backup, then can read to verify or do a test recovery, but you cannot modify or delete for the retention period.
In the package I use it's called Veeam Ready-Object with Immutability and there are now a fair number of solutions that meet that target.
This has been a pretty unsatisfying exchange. :( It's a good thing we don't work together, we seemingly cannot get on the same page.
From the start, I've been trying to add the qualifier that regardless of whether you throw your backup into a USB drive or an S3 bucket, if you don't test restoration, your backups are probably incomplete.
"Unless you're practicing continuous restoration to ensure that your backups are actually backing up what you need, the odds that you got everything critical are not good."
"OK, but how about at least testing that that it's actually feasible to restore from that S3 bucket of yours?"
These comments have been both off topic and unhelpful. I gave an example of S3 with object lock (or veam equivalents) that I liked - to be helpful to others exploring this question.
Continuous type services are what this story is about - how things that are disconnected / not continuous may fare better in a hack / ransomware situation. Lecturing me that I need "continuous restoration" and or making random claims that I don't test restoration (huh?) is weird.
If I was to give some feedback -> focus on one thing. If you don't like S3 with object lock, say that. If you want to add some comment about restorations from S3 with object lock being tricky for some reason, say it and explain it. If you have something about "continuous restoration" helping in malware hacks explain why. In many cases this can pass corrupted data over to your backup site if you are doing HA/DR work where continuous stuff is most common.
The "one thing" I have been focusing on is testing backups to prove that they actually work as intended.
The state of industry with regards to backups is wretched. If backups are done at all, they are often not tested — people just copy a bunch of files onto a RAID, or a USB drive, or an S3 bucket, or whatever, but never test restoration because that takes extra effort and may be inconvenient unless the system has been architected from day one to enable seamless restoration of live systems without obvious interruptions in service.
Specifically with regards to ransomware hacks...
The first thing that an attacker who gains root must do is corrupt production data. If there are no backups being created, or if the backups don't actually contain all critical data, then the hack is complete.
If there are backups being performed, then the attacker must also hack the backup system sufficiently that restoration from available backup data is infeasible (or at least more costly than just paying the ransom). This adds an addiional layer of complexity to the attack.
if restoration is being tested on an ongoing basis, then the attacker, in addition to corrupting the production data, must also spoof monitoring so that the the corruption of live and backup data doesn't get caught and interrupted before the corruption finishes. This adds yet another layer of complexity to the attack.
A sufficiently competent and thorough attacker may overcome all these difficulties, but criminals are not always perfectly competent — they don't need to be because there are so many target organizations that are not competent either.
It seems that your specific organization does thoroughly test backups to prove that restoration actually works as intended. Kudos! However, I wasn't evaluating your specific case — that would have been weird and inappropriate as you were just giving a (helpful!) example, not laying out a comprehensive system design and asking for review. My point was to add on the condition that in principle backed up data must be not only be 1) current and 2) not corrupted, but also 3) complete. And the way to prove completeness of backups is to test them, ideally on an ongoing basis.
Continuously restoring from a copy of the data you sent to cold storage provides an additional layer of protection.
If an attacker get root they can theoretically disable the continuous restoration, but the more moving parts they have to tweak the more likely that they'll be noticed while the org still has access to decent backups.
This sounds like there's a whole other story in the aftremath, would be interesting to hear how they dealt with it (deciding between dropping scenes vs reproducing):
> Jacob admits that about 10% of the film’s files were never retrieved and are gone forever, but they were able to make the film work without the scenes.
> Pixar, at the time, did not continuously test their backups. This is where the trouble started, because the backups were stored on a tape drive and as the files hit 4 gigabytes in size, the maximum size of the file was met. The error log, which would have told the system administrators about the full drive, was also located on the full volume, and was zero bytes in size.
At AWS, we use this story in some of our presentations to quickly illustrate the importance of backups and testing them. It also highlights the fact that DR scenarios can often arise from simple mistakes such as "rm -rf", and not the "hurricane" or other global event example that often gets overused. A good, cautionary tale.
The linked article is much better. This event happened right about the time I left Pixar and I heard about it long ago. In the end, much of what was recovered from Galyn's workstation wasn't used. Some time after this event, it was decided that the film was not working and a huge rewrite was done. John Lasseter stepped in as director and, through Herculean efforts, the film was completed on time.
Modern development practices — idioms, frameworks, libraries, traditions — make it difficult to test your backups. The priority on responsiveness means that you have to work extra hard when architecting the system to ensure that you can restore from backups without losing recent changes — which makes it impractical to confirm that your backups actually work.
Offering Continuous Restoration as a feature should theoretically be a way of differentiating a development tool in the software marketplace. But short-termers hold power in so many organizations that it's not clear to me whether it would actually win. Even if ransomware kills the company, executives still keep their money.
That link provides more info about the story, I remembered reading about it years ago, Also after they restored most of the files deleted, most of the movie was scraped and redone.
scrapped - after we got this version of the film back, we decided to rewrite the story. that meant that we substantially rebuild the film in the next year, even after we had recovered the version from Galyn's machine.
Many years ago I was responsible for backups in a mixed Unix/Windows system. I asked for funds and time to do a disaster recovery exercise in case a server system disk failed but was turned down flat. Luckily we never had to restore anything more than a few individual files.
It's not enough to confirm that the backup has actually been created you need also to have the procedures necessary to do the restore and the personnel and hardware to do it.
John Catmull (correction: sorry, his name is Ed Catmull, as tbab corrected me), Pixar co-founder and president wrote a book ("Creativity Inc") where he tells that story in detail.
He wrote that everybody in the team had forgotten about Sussman's copy and she is the one that told in a meeting the copy existed, to the astonishment of everyone else. The van that went to pick up the computer on Sussman's house was equipped with pillows because they were scared even from the road's vibrations.
Roughly speaking the book validates the whole idea of "to build awesome stuff, hire the most creative and talented people you can find and create an environment for their imagination to run free". If you're interested in this perspective, I do very much recommend.
But, obviously, he doesn't address issues like John Lasseter running free with sexual harassment and other uncomfortable issues.
If you are even slightly curious about the history of Pixar and the decision-making process behind their actions, then it is 100% worth it. It was written extremely well.
If you are trying to read it purely for business advice or anything of that nature, you might find it a bit eh.
It is very heavy on telling an interesting story, which should explain rather well why it might not be the best for the purpose of being an educational material. But if you want an interesting delivery of the story of Pixar, along with a look at behind the scenes in the industry (e.g., Steve Jobs was featured in there as one of the important personalities, given he was heavily involved with Pixar) and behind some of the decision-making/philosophy of Pixar as a company as it was growing, it is a great and entertaining read.
Why not just copy all the data while at her house? Seems like moving the machine was an unnecessary risk. Bring over a box with some large drives and just let it run overnight?
The way I heard this story is that the movie being saved wasn't actually the theatrical Toy Story 2. Instead, it was Disney's version that was mostly scrapped when they resolved their dispute with Pixar.
And then, some months later, Pixar rewrote the film from almost the ground up, and we made Toy Story 2 again. That rewritten film was the one you saw in theatres and that you can watch now on Blu-Ray.
Yeah I think I'm mixing up my stories here. There was a "Disney Version" of Toy Story 3. And there was a very different Pixar version of Toy Story 2 that this story applies to.
This story has been told many times and this account is not the most complete.
Pixar did recover the data from a home office setup and complete the movie, but Pixar is so perfectionist about story-telling that the creative heads decided the end result wasn't good enough.
The story was rewritten and the movie changed so much that almost none of the recovered data was in the version that was finally released.
Wow, this makes me think, do animation teams don't really use a version control system to have multiple people contributing to the same movie at the same time? Does anybody knows how that works?
Your standard "shot" will contain 10s or hundreds of versions of animation, lighting, texturing, modelling & lighting. Not to mentions compositing. They are often interdependent too.
A movie like shrek will have literal billions of versions of $things.
> Does anybody knows how that works?
This changes by company, but everyone tends to have an asset database. This is a thing that (should) give a UUID to each thing an artist is working on, and a unique location where it lives (that was/is an NFS file share, although covid might have changed that with WFH)
Where it differs from git et al is normally the folder tree is navigable by humans as well so the path for your "asset" will looks something like this:
/share/showname/shot_number/lighting/v33/
There are caveats, something like a model of a car, or anything else that gets re-used is put somewhere else in the "showname" folder.
Now, this is all very well, but how do artists(yes real non linux savvy artists) navigate and use this?
Thats where the "Pipeline" comes it. They will make use of the python API of the main programs that are used on the show. So when the coordinator assigns a shot to an artist, the artist presses a button saying "load shot" and the API will pull the correct paths, notify the coordination system (something like ftrack or shotgun) and open up the software they normally use (maya, zbrush, mari, nuke, etc etc) with the shot loaded.
Once the artist is happy, they'll hit publish.
The underlying system does the work of creating the new directory, copying the data and letting the rest of the artists know that there are new assets to be pulled into their scene as well.
Then there are backups. Again this is company dependent. Some places rely on hardware to figure it out. As in, they have a huge single isilon cluster, and they hook up the nearline and tape system to it and say: every hour sync changes to the nearline. every night, stream those to tape.
Others have wrapped /bin/rm to make sure that it just moves the directory, rather than actually deletes things.
Some companies have a massive nearline system that does a moving window type sync, so you have a 12 hourlys, 7 dailies and 1 monthly online at once. The rest is on tape. The bigger the company, the more often the fuckup, the better the backups are tested.
Not really. With multiple animators working on files simultaneously, its kind of hard.
Sadly my animation knowledge has left me since I left the studio working gig three years ago. But we did create content for netflix and many animators came up sheepishly because they deleted a folder for it to be restored. It's not as uncommon as you think.
FWIW, the live archive/backup server was called Dumbo. Which were 3x 4u Supermacho chassis with over 1.2PB in drives served over 10Gbit to each workstation connected to 1Gbit running CentOS 5. I dropped the new chassis while racking once and is partly the reason to why I lost my job :/
Can't edit my post, but for clarity. Im wrong on the line "Not really. With multiple animators working on files simultaneously, its kind of hard." see comments above.
Perforce is still the defacto standard in the video game industry where it's not uncommon for a game's source assets to run into the tens of terabytes.
That said, Toy Story 2 was developed in the late 90s, and while Perforce existed then I don't know how popular it was.
100% of the movie they were working on ended up lost. The team working on the film was not Pixar, but another studio assigned the job while Pixar were holding up during negotiations.
Once those negotiations were over, Pixar started work on a sequel, and this version was never released.
That is not true. I was there. The film that was nearly lost was being made at Pixar. There was never any version of TS2 made by anyone else. I think this idea may come from the fact that Disney owned the Toy Story characters and if Pixar had later gone with another partner studio, it would be Disney making Toy Story movies, not Pixar. This was a major motivator for Pixar to stay with Disney.
I think the other post above is referring to the version of a Toy Story sequel that was in production down in Burbank at Disney in the early 2000's (after TS2 came out). That production was shut down when Disney and Pixar merged in the mid 2000's.
The command that had been run was most likely ‘rm -r -f *’, which—roughly speaking—commands the system to begin removing every file below the current directory. This is commonly used to clear out a subset of unwanted files.
I've read that story a few times and I'm always struck by how aware management was when that did happen. I suspect that, at least half of the places I've worked, if I (or anybody else) happened to be that one employee who had a backup on their home computer, they'd find it impossible to communicate that to upper management in any way. The whole movie would be shut down while somebody was saying "I have a copy over here! I have a copy over here!"
Were you actually in the situation? Because at practically all the places that I worked for (some of them smaller, some of them very big), I don't think this would be a problem if so much was at stake...
It was 1998, and it wasn't that common, but in tech it wasn't impossible. For Pixar it was hard because of the size of the files, but otherwise wasn't all that hard.
I occasionally worked from home in 1998. The biggest impediment was network speed -- at work we had 10mbps and sometimes even 100mpbs. But at home all we had was dail up or if you were really lucky, DSL, which was 0.384mbps, so 30 times slower, and also latency was a lot higher.
So you were very limited in what you could do. Mostly it was just email and remote access ssh with tmux, where you sometimes had to deal with really high latency.
I worked remote in 98. Depending on your location it wasn't quite that bad. I had SDSL at 1.54 mbit in those years and it was workable. You could get ADSL here then too for a bit more download at the cost of upload and overall connection stability. Granted, I was doing gamedev which had smaller assets than a film, but you could manage a bit more than just a terminal a lot of places.
My employer paid for ISDN in 1999 and I thought I’d reached nirvana. It wasn’t dramatically faster than modem, but an always-on connection directly to the corporate network was quite handy when getting paged at 3am.
In this case, if you were an animator, you could still create models/animations, but you wouldn't be able to sync up with the rest of team on a daily basis. Perhaps a little frustrating, but with decent communication it would have been manageable.
Unless you were an independent entity that could do most of your work over the phone with only occasional in-person meetings (e.g. certain kinds of salespeople and some artists), it was very rare.
It only happened in this instance because the person in question had a newborn baby and I guess maternal leave wasn't a thing at Pixar back then.
I have done forensic reconstruction of UNIX files. The directory node list is erased/lost, but the raw data remains on disk. This assumes the data has a sequential structure. The article didnt say whether this was the model data or the rendered data.
That was disappointing. I was expecting something like a baby crawled over a cable effectively airgapping a system before a disaster happened. The whole story is just "they deleted everything but luckily had a backup".
This was cringe. I don't know how it even got traction. Someone was on maternity leave with access to offline backups, which doesn't mean a "baby saved" the movie. Misleading headline.
If you’re talking about Git, then as a rule of thumb you’ve got at least 30 days before anything will actually be deleted: you can retrieve what was lost from the reflog. You could start with something like `git checkout "main@{4 hours ago}"`, then delve further into the reflog itself if that doesn’t produce exactly what you want.
Having multiple backups, on unconnected systems, is the general rule.
Whether this is done in a chain (backing up the backups) or as multiple transfers from the sources is an implementation decision: ideally the latter as otherwise a break in the first backup system can stop the second updating, but this may require notably more resources at the source end so compete for those resources with production systems so the former isn't always a worse plan.
More importantly: test your backups. Preferably in an automated manner to increase the change the tests are not skipped, but have humans responsible for checking the results. Ideally by doing a full restore onto alternate kit, but this is often not really practical. A backup of a backup is no extra help if the first backup is broken in a way you have not yet detected.
Home computers where pretty dinky and broadband slow at the time of Toy Story 2. I wonder how they'd get 90% of the movie transferred home and saved to disk/tape?
Yeah, there was nothing else for workstation in VFX and 3D back then in film. Occasional sun box for rendering (Pixar might've had more than a few vs origins at ILM). Amiga's and 3dsmax for TV on PC and intergraph PC boxes.. Maya was very new after 9 versions of PA and TDI explore, Softimage just peaked into windows.. you guys did your own thing with Marionette and whatnot. Exciting times.
This seems to be less about how a baby saved the day, and more about how an employee exfiltrated data from her employer. As a cybersecurity guy at a large financial institution, I wonder what Pixar’s policy was about the matter then, and what impact it had on the same policy today.
From what I recall from another telling of the story Pixar was the one who set her up with a home computer & storage system for that purpose - since the storage and computing power needed was higher than what most people would have at home - but I could be misremembering
You're on the right track. As told in the Next Web story, the workstation in question was a Silicon Graphics Octane or there abouts. These were incredibly expensive and were provided by the company. In addition, according to the story, they were the ones who set up the file sync for her to be able to work from home.
Why this got downvoted a bit is beyond me, I’ll never understand hn.
Oren, who worked at Pixar recounts the story, and said “She has an SGI machine (Indigo? Indigo 2?) Those were the same machines we had at all our desktops to run the animation system and work on the film, which is what she was doing. Yes, it was against the rules, but we did it anyway, and it saved the movie in the end.”
> The virus, once activated, propagated within just seven minutes. And for Maersk, as with other organizations, the level of destruction was enormous.
> As well as affecting print and file-sharing capabilities, laptops, apps and servers, NotPetya severely damaged both Maersk’s implementations of DHCP (Dynamic Host Configuration Protocol) and Active Directory, which, respectively, allow devices to participate in networks by allocating IP addresses and provide a directory lookup for valid addresses. Along with those, the technology controlling its access to cloud services was also damaged and became unstable while its service bus was completely lost.
> Banks is candid about the breadth of the impact: “There was 100% destruction of anything based on Microsoft that was attached to the network.”
> In a stroke of luck, they were able to retrieve an undamaged copy of its Active Directory from the Maersk office in Nigeria. It had been unaffected by NotPetya purely because of a power outage in Lagos that had taken the service offline while the virus was spreading. A rather nervous local Maersk employee was tasked with transporting the crucial data to Denmark.
What this says to me is that multiple offline backups (or failing that, copies) will save your bacon some day.
[1] https://www.i-cio.com/management/insight/item/maersk-springi...