Hacker News new | past | comments | ask | show | jobs | submit login
Computer Virus Cripples Several Taiwan Semiconductor Plants (bloomberg.com)
123 points by dosy on Aug 4, 2018 | hide | past | web | favorite | 96 comments

For those of you thinking that TSMC only gets set back by the time it took to recover the equipment from the virus (1-2 days), realize that there are some critical steps in the manufacturing process that require completion within X time or the wafer is trash. Also realize that wafers take several weeks of fab processing, and this may be a substantial hit to their output for a while. They might even need to rob wafer capacity from smaller companies to placate the tier 1 customers like Apple.

What trashes the wafers? Is it some intra-wafer chemical reaction? Atmospheric interactions?

So first, to put a term to what the grand parent is describing, they are generally referred to as 'queue times' referencing the wafers waiting in a queue.

Effectively it is an atmospheric interaction, most commonly with oxygen but also light, hydrogen sulfide (from agricultural production - land is cheaper in the countryside), and other sources of contamination.

Most wafer processing - especially depositions of materials like metals - are done in a vacuum. if you deposit a metal (lets assume aluminum) in a vacuum you effectively get a pure aluminum layer. Pure aluminum is highly reactive with oxygen

As soon as that layer is exposed air it oxidizes - same thing as rust. You know this as the dull surface finish that raw/bare aluminum gets if left out or when you purchase it. It's what happens very quickly after you machine aluminum - the shiny surface goes away. That is the aluminum oxidizing.

Aluminum oxide has different electrical properties than pure aluminum, different enough to fundamentally affect the function of the transistor devices being constructed. If I have a 2" thick piece of aluminum that layer, maybe only a few microns (assume 5) thick after 24 hours, is basically inconsequential. It represents 0.01% of the thickness of the piece. It is error but not much more in the electrical properties of the piece. However, if semiconductor layer is much thinner... maybe 10microns thick...the effect of the oxidation on the electrical properties is much much higher and pushes it outside the tolerances the device can have and still function.

And because the deposition layers aren't just surfaces, but penetrate to fill trenches in the devices they are basically impossible to remove at most steps.

So why don't they hold the queue under vacuum? Or under some inert gas?

expense of retrofitting. there is a standardized 'FOUP' that wafers are stored in and making a vacuum capable FOUP and redesigning the systems to support it for just a couple areas of the FAB would be hard and expensive

Edit to add a few more details:

A vacuum capable FOUP would be much MUCH heavier, requiring retorfitting of the overhead vehicles that transport wafers aroudn the FAB. further, each FOUP carries 25 wafers. A FAB can easily put out 30k wafers a week...it's a lot of FOUPS. even if you just used them at critical steps it would get heavy and expensive. I believe there was some talk abotu inert gas purges with the move to 450mm wafers but, like 450mm wafers, that never happened.

OK. But maybe plastic bags and tanks of helium?

I spent some time working in a fab. Nothing is cheap, competition is tough, yield is everything. 99.8% yield on each step of a 500 step process is good for 38% overall recovery.

These events have been shown to be very rare, and the costs would greatly outweigh the benefits (for the time being).

The cheapest piece of equipment was the stand the FOUPS come down and rest on. From memory, that was easily 5 figures, perhaps low 6. Every other piece of processing equipment was insanely more expensive.

Queuing theory is an active field of research with great applications in how wafer processing. The total process has to be looked at. Wafer priority, wafer value, planning preventative maintenance on the machines, etc. could all be taken into consideration when scheduling when wafers will run.

Retrofitting every "robot" that grabs FOUPS for a much heavier load, upgrading the tracks if needed, replacing every single FOUP, adjusting factory flow and inventory holding areas, doing qualifying trials on the FOUPS (to make sure they're functioning correctly and not leaking), etc. are considerable expense.

Prioritization of wafers and timing the preceding step so the required machine is certain to be ready in the window would yield better results from a strict business ROI perspective.

> These events have been shown to be very rare, and the costs would greatly outweigh the benefits (for the time being).

Yes. And I was responding to georgeburdell's comment about long-term losses from this virus infection. In that maybe some crude hack could ameliorate them.

> For those of you thinking that TSMC only gets set back by the time it took to recover the equipment from the virus (1-2 days), realize that there are some critical steps in the manufacturing process that require completion within X time or the wafer is trash. Also realize that wafers take several weeks of fab processing, and this may be a substantial hit to their output for a while.

It's not a bad argument just an one lacking some information on how this all works. And sincerely that isn't meant as an insult...it's just reality. Ask me about surgery and I'll shrug and say 'i got nothing yo'

Everything in semiconductor is basically incomprehensible in scale. A scanner (i.e., the photo lithography projection tool) takes about 2 seconds to fully expose a wafer. that wafer probably has 100 individual chips on it each of which have around 2 billion transistors. That means that tool is helping create 100 billion transistors a second.

The great grand parent comment is right...but the lose of that is probably on the range of one-two weeks production. retrofitting the fab like you suggest, and to be clear it isn't a fundamentally bad idea, would cost billions and take months if not years. What you propose is technically correct but business irrelevant.

Long term here means a week or so...it's expensive but rare. They don't play with 'crude hacks' because you don't know the implications of that hack 10 years from now when the chip is running part of the stock market.

It's just an industry that is hard to process, whether you work in it or not, because of the scale and the precision.

Sorry if that came off abrasively. avs733 mentioned everything is on a different scale in semiconductors and I forgot how long it took me to grasp that.

In school (industrial engineering) we talked about length and thickness tolerances being quite tight at 1/10,000 inch. The units used in semiconductors are nanometers, angstroms, etc. You might do a process and lay down just a couple atom thick layer of material...I had no idea that was physically possible.

I do enjoy when we are all nice to each other.

Semiconductors still blow me mind...eventually we will run out of base silicon to use and are possibly really fucked but its been a fun ride so far.

The thing that shocked me the most is that it's economically viable. The tolerances are so tight, the machines so expensive, the volumes so big, that the math works. If I had no idea what semiconductors were and you just sat me down and started explaining things, it'd seem like magic that it would even be physically possible, much less economically...but somehow it is.

It is the most bullshit sounding industry ever to an outsider. You make things smaller than the wavelength of light, out of sand, to turn on and off trillions of times a second, to do math, to make all the world's information instantly available on a metal and plastic block I keep in my pocket?

No, thats not real, that's a silly science fiction novel.

>99.8% yield on each step of a 500 step process is good for 38% overall recovery.

What does that mean? Is 'yield' not the end usable product from the batch?

Well, Calc tells me that 0.998^500 is 0.3675 :)

It's a familiar concept from organic chemistry.

Edit: If yield per step averages 99.8%, and there are 500 steps, overall yield is 36.8%.

'yield' is a really impossible measure for this kind of process, because it presupposes full end to end knowledge of fabrication processes.

Not sure I follow. The 99.8% is just an example number. If one machine messed up really bad and had just 50% yield, all other machines would need to do far better in terms of overall quality to achieve 36% yield. Of course every single machine isn't at the same quality level.

At the same time, when wafers get scrapped, they do justify why they were scrapped. There are metrology processes and tests performed as you go to ensure you don't run ruined wafers through additional manufacturing steps.

Yes it is. The 99.8% hypothetical number is on a per operation basis. Depending on the layers and complexity of the chip, the ballpark number of operations might be 300-1500 (lots of hand waiving here).

The gist of that part is meant to show how important and impressive the entire process is. In "normal" manufacturing, 99.8% good parts is a pretty darn good process. Many of the easy wins are already implemented. Most normal things are manufactured in far fewer steps, so even if you're only making 95% good parts, it doesn't absolutely kill your total recovery (start to finish).

Because it costs money and doesn't solve a problem in a functional production line.

Presumably if these kind of multi-day offline issues become the norm, we actually will see these investments being made?

One of the process types etches material from the wafer surface (we are talking about a material layer thickness measured in atoms). If the etching is not stopped or quenched in time due to a control system failure, the wafer is 'over etched' and unrecoverable.

One former employer, a semi equipment company, had a weekly Monday Morning 'Crash Report'. Our tools performed a wet chemistry etch step on 25 wafers at at time. If the control system failed, the entire lot was in peril. So Monday AM, we'd hear from the CEO which fab managers had called him over the weekend, screaming about the value of the wafers our tools had destroyed. Usually in the $50,000 - $250,000 range. Fun times.

It's surprising to many, but these tools run primarily on Windows, and, without anti-virus protection because it can interfere with critical timing actions that the tools need to make. The IT systems are locked down like a fortress, but tool technicians can still bring viruses in and transmit when they connect their laptops for diagnostics and maintenance. I can easily see a virus running amok.

Seem likely. The monitor shown at 1:09, 1:58, and 7:25[1] on the "TSMC Fab Tour" video on this page[2] looks suspiciously like Windows 95, but could be Windows Embedded I guess.

[1] https://imgur.com/a/J4EhzPC

[2] http://www.tsmc.com/english/newsEvents/dc_video.htm#

On a factory floor? It could be win 3.1. Because something is high tech doesnt meant the machine building it hasnt been repurposed a dozen times. Dos is still out there.

In our factory, is at least still one OS/2 PC running. I'm not proud for this.

It's not necessarily something to not be proud of, though. NASA's space shuttles ran on something like a 386 CPU. That's what was available when the system was designed. Probably one of the best "if it ain't broke, don't fix it" examples. Swapping out to newer components would have meant replacing so much of the system, that it wasn't worth it. Especially if it was doing everything that needed to be done.

Same thing with your OS/2 system. As long as it's working, there's really no need to change it. The fun begins when it no longer works and needs to be replaced. How much of the factory will then need to be replaced/updated just because the controlling computer can no longer operate the older equipment? That's the type of scenarios that give managers nightmares.

Well, I think the problem arises out of the fact that most systems that size are "broke", and some other systems (a small QNX or OpenBSD [or maybe these days, some form of L4]) might have fewer flaws that can cause issues, the likes from which TSMC has apparently suffered.

last time I was in a 300mm fab I definitely saw OS/2, command lines being used by technicians, and a LOT of windows embedded 3.1.

The main defense is a computer at the entrance where technicians are supposed to scan their USB drives. It's a devils choice really...do you put the tools on an externally accessible network? If not how do engineers get data off of tools? how does the factory talk to itself and schedule things?

I don't envy FAB IT. Utterly thankless.

OS/2 was very stable, and certainly immune to viruses designed for Windows. The only problem back in the day could arise from its infamous single input queue which could freeze the entire system if a single program didn't respond in time to GUI events. The problem however was corrected in the latter versions.

Looks like XP Embedded.

We've seen this before many times. It's an inherent weakness of proprietary software solutions: you're at the mercy of the vendor as far as porting it to a new platform. The weakness becomes more urgent as the platform grows old/unsupported/insecure/etc. It may have been Windows this time, but it could have been DEC or Solaris or what have you.

Open Source Software may or may not be easy or economically realistic to port, but at least the users have a say.

This isn't caused by lack of open source software, this is caused by lack of management, partitioning, and containment. I assume the issue was caused by a 100% flat network, with office worker's laptops being on the same subnet as million dollar fab machine computers.

Even an airgapped system can be breached, as we saw with Stuxnet. There's no perfect solution for protecting vulnerable systems.

There is benefit in making a negative outcome less probable, like there's benefit in wearing a bullet-proof vest though you can still be shot in the face.

Remove all data ports?

Upload instructions via punch cards

"Hey guys I found these punch cards in the parking lot"

Certainly those are also factors, but the central security principle of defense in depth argues for minimizing all possible factors. Having nodes that are highly vulnerable to viruses is a weakness and a contributing factor towards a breach.

You are kidding right?

The machines at TSMC would be at 100% utilization on million dollar jobs, management would never accept downtime for maintenance for software updates that could introduce bugs.

Its one thing for consumer devices to receive updates, its a completely different ballgame with industrial. Downtime due to anything can be millions of dollars per hour in losses. So it comes down to what could cause more monetary loss, downtime due to hacks or downtime due to bugs introduced.

"If it works why break it".

Its the same reason why there are production lines where physically, hardware could be replaced and improved to reduce wear, reliability, power usage, etc (even things like motors) but they arent, for decades. Only when things start costing money outside the original design do engineers perform upgrades.

Exactly. You can say "oh, just take the machine out of production for a bit do software patching". However that reduces the uptime for your tool which will require buying more of them to get the same throughput. More tools is not only capex, but also space on the factory floor.

This is extra true for my company. We sell sorters, which don't actually do anything useful in terms of processing the wafers. The tools just shuffle the wafers between FOUPs. While these tools help customers optimize processes by changing batch sizes and such, they really want to have as few non-process tools as possible. So uptime is really important.

Software doesn't exist in a vacuum, it needs support and maintenance, which somebody has to pay for.

Can we stop repeating this as a strength of OSS? It definitely is not. "Users" have a problem they want solved, they don't really want that problem and to learn how to patch software.

With OSS, it is at least possible for a new vendor to step in with competing support and maintenance if the old vendor is failing. It may also be possible for customers to participate in supporting and maintaining the software to varying degrees, for example by hiring the core maintainers; whether that is realistic depends on the size and complexity of the software, the tech culture of the customer, etc.

If the "user" is the world's largest independent foundry, I'm sure they could find someone to patch the software they're using.

> these tools run primarily on Windows, and, without anti-virus protection

Back when I made rocket things, we used an RTOS. Are real-time operating systems (like QNX) not widely used?

They should be (and are in some industries, apparently not in this one).

But QnX real-time capabilities are just one nice part of the mix, the user land drivers and impossible to crash kernel are equal partners in giving it the stability that it has. Process isolation for drivers should be mandatory.

So it is possible to get malicious code execution on the fabs that build the next generation of processors for our devices. What are the chances that a sophisticaled stuxnet-like attack inserting backdoors in the CPU design will follow?

It would be much easier to do this at an earlier stage - before the design is handed off to the foundry. Modifying masks for even trivial changes is very difficult and that's assuming knowledge of the circuit (which your malicious code probably won't have). Plus I imagine the verification tools would spot the difference, flagging it as an imperfection which would then attract scrutiny. I think this is extremely unlikely.

Some backdoors can be almost impossible to detect with any kind of traditional wafer inspection tools. E.g., it might be possible to transparently cripple something like Intel's RdRand instruction by changing only the doping on some of the gates inside the hardware RNG circuit. It would look completely identical under the best microscope and looking at the output wouldn't give you a good indication of whether or not you had a broken RNG or a good one. The design of Intel's RNG involved passing the output of the RNG through some whitening and IIRC AES so it might look completely random and pass every statistical test but the hardware entropy source might only realistically have 10 bits per second of real entropy.

I still think it's unlikely that someone would go through the effort to insert a super secret backdoor by surreptitiously modifying the masks but there are some interesting techniques that can make finding a backdoor almost impossible if you don't already know about it.

Sounds like the really cool "Stealthy Dopant-Level Hardware Trojans" paper from a few years ago. I'd forgotten about that, well worth a read for those interested.

However, you must remember that as well as precise knowledge of the circuit, anyone other than an insider has close to zero chance of getting a doping change correct. Even an insider's chances are very low.

And then your high-level testing is in place too. It's not common to have circuits where BIST can't give you full confidence but it happens; for things like the RNG here, also due to test tool limitations etc. So you must be doing both both general and highly targeted post-manufacturing testing also. The extent of that testing may depend on your market.

Well, stuxnet was written with knowledge of a very specific SCADA setup. Not so farfetched to think a targeted trojan would be written with knowledge of the circuit.

Editing circuits is hard enough with complete information. If the intent is to add an exciting extra feature, that seems rather difficult to do via malware. If you're trying to just sabotage a bit, that's more likely -- maybe expose for slightly less time and things are less reliable, etc. Like others said though, it's probably easier to mess with the designs before they go to the Fab.

What if they do have complete information? If you had spies working at all levels, you could create your own parallel version of the end result all the way down to the silicon level. Then, you would actually do the modification at the silicon level, to make the backdoor hardest to find.

At that point you just insert a spook into the foundry as an employee and have them swap the files. No need to waste time or uncertainty.

Sabotage? probably highly likely. Even a power blip can scrap millions of dollars in product. You don't even have to hack a FAB, hack the local substation...hell hack the local airport or port and you can put them in discomfort.

Actual backdoor? almost impossible.

A backdoor would have to change the interconnections within the chips. That isn't a function performed by the FAB software. The design of the chip circuitry is designed much earlier. Once validated and released it is then laser / ion beam etched into a series of 'masks.' The mask represent the different layers of the chip that make up the interconnections. They are used in photolithography and are typically made elsewhere and then used thousands of times. Like a photo.

To put a backdoor in the chip you would have to, undetected, hack the design side of a chip company. And do it early enough that without being noticed the backdoor is created into the masks which then get built into all of the chips and still remain unnoticed.

It would be much easier to bribe a couple employees.

Edit: a simple analogy - lets say you hack my oven. You can burn my cake and undercook it and give me food poisoning. But you can't swap the salt for sugar in the pan unless you actually had physical access.

I would say it’s more likely that back doors have been injected via gold old traditional espionage, which could be aided by sophisticated malware (say, to gain access to source code repositories).

It's apparently not hard, the US did it to the USSR in the 70s that resulted in one of the largest non nuclear explosions in the history of mankind.

Are you referring to the trans-siberian pipeline explosion in 1982? The claim is that it was caused by the CIA sabotaging the software used for controlling pipeline pressure. If this is true (seems to be partly rumour/leaks) then that's the work of a nation state, so I'm not convinced it qualifies as "not hard".


From the technical side it wasn't exactly a difficult feat. Sure getting it done was mostly logistical.

Wait what? Do you have a link with more info?

What is the motivation of the crime?

Possible bad actors and motivations:

- Competing fabs - Competitors to Apple - China (nationalist sentiment or something to do with the recent IP theft court cases)

Well, rip prices for whatever they're manufacturing.

Am I the only one bothered that the largest pure play fab in the world is degraded to an apple supplier. This has consequences far beyond apple.

I think all headlines should be written like this. It adds a nice distopia undertone:

Water feature near Apple’s new cupertino headquarters set to rise tens of feet over the next century, displacing engineers in low lying areas.

Political instability set to double the prices of raw materials used in domestically produced Mac Pros; iPhones largely unaffected.

Suspected cancer link discovered with cafeteria subcontractor Monsanto’s Roundup. Rumors suggest Apple will move to in house agricultural production. Monsanto down 5 points.

Orchards and markets around the world continue to blatantly dilute Apple's trademark.

"Hey, kid, don't bite that Apple®!"

Who knew paperclip maximization could be distopian...

Yep, this is a pretty odd way to describe TSMC especially considering that Apple doesn’t even consume a large percentage of capacity. 3rd largest semiconductor company in the world, largest none-affiliated fab, first to roll out a 7nm process node and degraded to an “Apple supplier”, but I guess Bloomberg needs a hook for readers to care.

I think an even better headline would be "World's biggest contract fab shutdown by virus" but I can't change it now.

What's a better, concise way to describe TSMC? I think "World's biggest contract fab" can probably be improved on.

Chip makers who fabricate fabless companies' chips are called 'Foundries'.

Yes, the connection is weird.

Prompted me to look up the numbers though. Apple could buy TSMC for cash, four times over... pretty incredible.

Apple can buy anyone but it doesn’t matter, TSMC produces nearly 20 million wafers a year, Apple is a dip in the ocean when it comes to their production capacity.

Holy... this _is_ incredible.

No, they can't. Nobody will sell them TSMC, and even if they will, Chinese govt will shut that down.

Just to check, are you saying the government of the Peoples Republic of China will shut that down, or the government of the Republic of China?

I meant another China, but given how NXP case went, nothing is impossible these days.

Taiwan is the latter. Taiwanese people consider themselves Chinese (both ethnically and nationally).

I call BS on that

Regarding the 2019 Taiwan Independence referendum, the Foundation’s data shows that 20.7% of the population are incredibly in favor of the vote and 28.8% generally approve, indicating that over half of the population support the referendum.


Most Taiwanese consider Taiwan, China separate countries, poll suggests


Nearly 70 percent of Taiwanese are willing to go to war if China were to attempt to annex Taiwan by force


How do you call BS? Yes of course they consider the two separate countries. The issue isn’t that. The issue is which country is the historical continuation of “China” after WWII, and that’s not even what I’m talking about. I’m talking about ethnic and national identity; the argument over which government of China is legitimate was not part of the conversation.

Taiwanese people are Chinese people, as is much of Singapore and Hong Kong.

I’m not sure what your referendum links have to do with anything.

Except of course for the aboriginal Taiwanese, who viewed Chiang Kai-Shek and company to be occupiers.

I would also suggest that saying the people of Taiwan consider themselves to be Chinese, ethnically and nationally, is true but not so helpful, given that the other China is similarly full of people who consider themselves to be Chinese, ethnically and nationally. What you said is true, but missing an awful lot of important distinction.

Well, better occupy than exterminate like in The US. Even before CKS there were ethnic Chinese that migrated to the island long before.

I’ve never heard anyone from Taiwan refer to their government as the “Chinese government”

Well their passport has “Republic of China” printed on it. They officially have a claim over the physical land of Mainland China but deny mainland Chinese citizenship. For practical purposes most people would refer to it as Taiwan so as not to confuse laymen.

I was just going to say that. Seriously, wtf - this is as if the Apple hipsters had now their summer ruined and not at all the entire CPU & GPU supply for pretty much everything was jeopardized ...

Intel have their own fabs. AMD are primarily a GlobalFoundries customer (GF was spun off from AMD in 2009) but do source some dies from TSMC. Nvidia are primarily a TSMC customer, but Reuters have reported that they have contracted Samsung to produce 14nm dies.

Lost production at TSMC is a serious issue, but it's not cataclysmic for the supply chain as a whole.

The new 7nm Rome is expected to exclusively use TSMC, but this won't affect that production. Samsung and Nvidia would be the most affected, but apparently most of TSMC's facilities are already back up with the remaining to be up by tomorrow. So less than 2 days lost production probably will not significantly affect anyone. It's a good thing this wasn't a targeted attack with a payload to damage hardware and put them out for weeks or more. That would have serious repercussions.

How can we assume that the other fabs have spare capacity to soak up new ex-TSMC work?

Did Samsung boot other customers or products so NVidia could stay afloat? I’m sure NVidia’s margins are high enough to get others to slow down their output.

This kind of issue happens in pharmaceuticals all the time: one factory goes down, the others can’t soak up the extra orders, and months down the line we start running out of finished product after the warehouse gets emptied.

The TSMC factories are mostly back online though. At worst theres just a one week delay on lead times now.

AFAIK Qualcomm is going to TSMC for its nexgen, which is far more chips than even Apple gets.

We've taken that bit out of the title.

Tech is Apple and Apple is tech...

I have long noticed that Apple "related" news make it to front pages that usually do not carry any tech news at all.

My understanding for why this happens is that media outlets are old school Apple strongholds. In large part because the current generation of media people have been educated by a generation that learned their trade by rote on gray scale Macs.

Mental inbreeding, basically...

Bets on whether Intel is behind this?

In other news: "Computer Virus Cripples Nvidia Chipmaker TSMC Plants"

Now if the headline had said "AMD's chipmaker crippled by virus", I would have suspected Intel of industrial sabotage. Of course less people would read that title I'm guessing.

There is honestly no way Intel would be stupid enough to do that. I hope.

Definitely not Intel, but remember those CTS Labs clowns engineering media campaign touting fake cpu flaws? https://wccftech.com/report-alleges-amd-ryzen-epyc-cpus-suff...

Wonder who shorted AMD/Nvidia recently

1. tsmc does a ton of fabbing, including, but not limited to, amd

2. intel aren't that stupuid

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact