
Intel’s Atom C2000 chips are bricking products, and it’s not just Cisco hit - Dotnaught
http://www.theregister.co.uk/2017/02/06/cisco_intel_decline_to_link_product_warning_to_faulty_chip/
======
jepler
Reminds me of the Sandy Bridge SATA flaw.

"The problem in the chipset was traced back to a transistor in the 3Gbps PLL
clocking tree. The aforementioned transistor has a very thin gate oxide, which
allows you to turn it on with a very low voltage. Unfortunately in this case
Intel biased the transistor with too high of a voltage, resulting in higher
than expected leakage current. Depending on the physical characteristics of
the transistor the leakage current here can increase over time which can
ultimately result in this failure on the 3Gbps ports."

[http://www.anandtech.com/show/4143/the-source-of-intels-
coug...](http://www.anandtech.com/show/4143/the-source-of-intels-cougar-point-
sata-bug)

~~~
yuhong
I wonder how many would even bother to get it replaced if it was discovered
say only a year after launch.

~~~
throwaway7767
I had one of those. The shop I bought it from refused to replace it from their
inventory, all they would do is take the motherboard, send it back and then
give me the replacement some weeks later when the RMA process was completed.

Since I needed that machine functioning, I never replaced it (the mobo had
some extra SATA ports handled by a different controller, so they kept working
and I switched to using them). I suspect a lot of people are in the same boat.
I'll never do business with that store again.

------
leonroy
_sigh_ The perils of maintaining my own data center in the basement for 'fun'
are coming to haunt me.

I have 2x C2758 Supermicro boxes running core routing services and a Synology
RS2416+ for storage - all on the affected CPU list - guess I better double
check my backups are working and allocate some funds for replacement kit in
case things go belly up!

------
adrr
If this is the related the Cisco clock signal component issue. Cisco is
handling it in a really poor way. No replacements unless its under warranty
even though its a known issue.

[http://www.cisco.com/c/en/us/support/web/clock-
signal.html#~...](http://www.cisco.com/c/en/us/support/web/clock-
signal.html#~faqs)

~~~
freehunter
Under warranty or anyone who has a TAC subscription. Cisco licensed their
products with a ToS that says you can't resell it, and warranties are only
valid for people who bought directly from Cisco. You also can't (effectively)
get TAC support for a resold device (what Cisco folks call "grey market").
It's not illegal to buy secondhand Cisco product, Cisco just won't support
them or let you get software upgrades without paying them a ton of money.

100%, this wording is to make sure that grey market buyers aren't covered
under the replacement. Basically anyone who bought from Cisco will be able to
get a replacement.

~~~
kuschku
How do they handle this in the EU, where 2 years warranty, even if resold, are
mandatory?

~~~
msh
That don't cover business buyers who I guess are most of Cisco buyers, only
consumers.

~~~
lostlogin
Wonder how that is applied in New Zealand where the commute guarantees act
basically requires sellers to sort problems out within "a reasonable time
frame" of sale. It's a fantastic piece of legislation.

~~~
antod
The CGA only applies to consumers and not business customers, and only applies
between the end reseller and the customer. Cisco doesn't really sell directly
to consumers.

------
yuhong
Why don't they name the customer/supplier when it is obvious when the product
is taken apart? Even with the Cisco DDR SDRAM fiasco, it wasn't that hard to
figure out that it was Micron DDR SDRAM that is at fault.

~~~
wmf
No one can afford to name and shame Intel due to potential retaliation.

~~~
ticviking
So why keep using intel

~~~
lostlogin
I'm likely revealing my ignorance, but what's the alternative?

~~~
AstralStorm
ARM, MIPS. (multiple routers use these) Custom FPGA softcore even. Maybe extra
ASICs. Cisco is big enough for that. However I suspect Intel might be cheapest
to get for performance.

------
tyingq
Not great for Intel. These Avoton Atoms were the first Atom chips with
respectable performance, so there was a chance to unsully the Atom name.

Also, I'm reasonably sure many of these were sold already permanently affixed
to the motherboard, so the fix may be worse than swapping out just a CPU.

~~~
wtallis
I don't think the processor cores in the Avoton chips were anything
impressive, but these were the first Atom chips with a lot of I/O bandwidth.

From what I can tell, all of the Avoton chips were sold in a BGA package that
required them to be soldered to the motherboard. There isn't a socketed
version of Avoton.

------
chiph
Atom models affected:

C2308, C2338, C2350, C2358, C2508, C2518, C2530, C2538, C2550, C2558, C2718,
C2730, C2738, C2750, and C2758.

How to tell what CPU your Synology NAS has:

[https://www.synology.com/en-
us/knowledgebase/DSM/tutorial/Ge...](https://www.synology.com/en-
us/knowledgebase/DSM/tutorial/General/What_kind_of_CPU_does_my_NAS_have)

My 3-month old 1815+ is on the list...

------
fulafel
"slightly higher expected failure rates under certain use and time
constraints" sounds like it shouldn't be observable on the field. Do people
suspect Intel are lying or is this a storm in a teacup?

~~~
AstralStorm
Milquetoast words to stem panic. Ineffective.

------
aeturnum
Well, I guess it's time to replace my otherwise perfectly-good Synology NAS on
the double.

~~~
tyingq
This looks interesting:
[https://forum.synology.com/enu/viewtopic.php?f=7&t=119727&st...](https://forum.synology.com/enu/viewtopic.php?f=7&t=119727&start=60)

See the last couple of posts in the thread as well.

Sounds like you can get an RMA, but it's a slow process.

~~~
aeturnum
Realistically, I'm not interested in an RMA for another DS1815 that will fail,
followed by (possibly) another RMA once the problem is fixed in silicon. I'm
also very uncertain about slotting the drives into a new unit and successfully
recovering the RAID.

Instead, I'll shut down the NAS and buy a replacement from another company
(QNAP probably) and transfer the data. The other options feel too risky.

~~~
digler999
I'm guessing the CPU must be soldered to the board on these ? I have a 1815+
that is currently working, and now I'm afraid to shut it off. I wonder if the
DS2015 (not sure of #, the 10gbe model) uses this defective part ?

~~~
aeturnum
I would imagine - I haven't taken the unit apart.

~~~
digler999
probably to save money. that sucks, because even if you swapped it with
another defective one, it would be worth it if you just had to replace the CPU
every 18 months.

~~~
tyingq
Learned in another area of this post that yes, they are soldered on, but
that's Intel's choice. They only offer the CPU in a BGA (ball grid array) form
factor. There's no such thing as a BGA socket, other than some specialty test
unit things that aren't suitable for real world use.

------
kev009
I heard a rumor that something very similar was detected on upcoming Xeon
SKUs, but they will be implementing the board level workaround.

------
gens
Somewhat off-topic:

Are Cisco products worth the money ?

Personally i haven't had that much experience with their stuff, but i remember
seeing a brand new router running hot with two fans blowing in it (other
routers at the time were 2x smaller without fans). I understand that Cisco
_should_ be the de facto networking standard, but is it really worth the name
?

~~~
tracker1
Probably.. but that comes down to a lot of factors though. It also depends on
what kind of gear you're looking to buy. Everyone will try to be compatible
with Cisco's interpretation of a given standard, if you go elsewhere, it may
or may not be 100% compatible with your other equipment. Also, more IT
networking guys will be more familiar with Cisco.

That said, they are definitely more expensive than their peers. But then
again, an Escallade is more expensive than a Tahoe.

------
leonroy
ServeTheHome surveyed a bunch of affected vendors to get some more information
on the issue: [https://www.servethehome.com/intel-atom-c2000-series-bug-
qui...](https://www.servethehome.com/intel-atom-c2000-series-bug-quiet/)

And some technical specifics behind the problem:
[https://slashdot.org/comments.pl?sid=10214953&cid=53819967](https://slashdot.org/comments.pl?sid=10214953&cid=53819967)

 _Can 't post to The Register, since they don't have ACs.

Anyway, the issue is damage to the LPC (low-pin-count) bus clock line. This is
a secondary bus where you hang old ISA-style devices, like the system FLASH.
If the FLASH is the only thing in there, it will mostly render the system
unbootable (so, stuff that never gets power-cycled would just keep going). But
LPC can generate interrupts, and one often hangs other crap to that bus, such
as i2c controllers for hot-swap bays, motherboard management controllers, and
other sensors. In that case, you can expect severe runtime misbehavior.

The issue is caused by "continuous degradation due to use", so repairing it is
easy, if costly: replace the motherboard with a new one under warranty (and
even if out of warranty period wherever this kind of "stealth" manufacturing
defect is not subject to warranty time period limitations, such as in Brazil).
It will "reset" the counter. This is your zero-day solution to the issue.

Depending on time-to-market for the new stepping (hardware revision) B1/C0 of
the Atom C2000, you might need an interim solution, which is the "platform-
level change", i.e. redesigned board with extra components that work around
Intel's hardware design error. As soon as you have these, you start using
these to replace any boards returned due to the defect, or start a "recall" to
preemptively replace boards.

Depending on the total cost of the board plus other components, you keep the
old boards you replaced around, and when revision B1/C0 of the Atom C2000 is
out, you BGA-replace them in a factory (about US$ 25 per board in large
volumes, if that much), maybe replace any liquid electrolytic capacitors and
other crap that ages badly, and use the boards either as new or as
refurbished, depending on your corporate/regulatory ethics. This kind of
repair almost always really resets the boards MTBF. If Intel supplies the
replacement Atoms at no charge, the cost of repair might well be far less than
the cost of the production run for boards you'd want to keep around for
warranty services, anyway.

Mind you, at 1.5 years per failure, it will be rare the legislation/contract
that forces more than one replacement... so, let's hope they don't replace a
faulty board with a brand-new virgin but-still-timebombed board. You'd have
trouble to replace it a second time if it fails after the warranty period._

------
myrandomcomment
Arista uses AMD and Intel. Their 1st switch (7124S & 7148SX) was a dual core
AMD.

------
Jaecen
This title seems incorrect. The article doesn't specify any other vendors or
products that have been directly affected by this issue.

~~~
kyrra
The title is technically correct, just annoyingly written. As someone who's
build a PFSense box using a supermicro board with one of the affected chips,
I'm definitely sad that I'll have to rip it apart to replace the parts.

~~~
ovidiup
I have the same problem: I'm using various C2000-based Supermicro boxes
running pfSense. The most cost-effective DIY, rack mountable solution for a
pfSense box was until now SYS-5018A-FTN4. Do you know if Supermicro issued a
technical bulletin about this box?

~~~
dhess
Last Friday, my OpenBSD firewall, which runs on a SYS-5018A-FTN4, mysteriously
crashed. I chalked it up to an alpha particle or something and rebooted. About
12 hours later, it failed again. This time I did some more digging. On the
console was the following message:

    
    
      NMI ... going to debugger
      Stopped at    acpicpu_idle+0x22d:     nop
      ddb{0}>
    

I googled it and found one similar report on the OpenBSD misc mailing list
from September 2016 [1]. Interestingly, the person who reported the bug was
running the same Supermicro board as I was. The report didn't get anywhere
other than a vague suggestion that it might be heat related. These boxes run
very cool and I didn't think that was likely. I thought it might be a RAM
issue and that it was probably just a coincidence that the other person had
the same hardware as I, but now I'm inclined to think that both of us have
experienced the issue described in TFA.

Seems like I'll be looking for new firewall hardware.

[1] [https://www.mail-
archive.com/misc@openbsd.org/msg149348.html](https://www.mail-
archive.com/misc@openbsd.org/msg149348.html)

~~~
kingosticks
If you were able to reboot the box then you did not hit this issue. When you
hit this issue your chip is dead.

------
frik
That explains the issues with the C2000 family. Various Linux distros crash
randomly, and not just crash sometimes really just stop opening applications
or stop processing e.g. apt-get.

The BIOS is a piece of shit. It's buggy, the legacy-BIOS support is unstable,
the Win7-EFI and Win8-EFI modes are not good either. I patched a Win7 DVD with
Win8 files, so that I could install Win7. Now Win7 runs great and stable - but
only after I installed various Intel drivers that fixed the hardware flaws.

I am seriously looking forward to the upcoming new AMD CPU - Intel dod barely
anything the last five years, a 2011 highend CPU is almost as fast as Intel
2017 flagship, and costed a lot less back then, had less DRM or other shit
that is broken. Intel needs a proper competitor, so a comeback of AMD on the
one side, and Apple notebooks with ARM CPU are very welcome to stop Intel from
siting on their quasi monopoly chair.

~~~
wang_li
>That explains the issues with the C2000 family. Various Linux distros crash
randomly, and not just crash sometimes really just stop opening applications
or stop processing e.g. apt-get.

No it doesn't. Did you read the errata? It completely stops. There's no
weirdness. It's just dead.

~~~
AstralStorm
Random crashes are often sign of memory corruption. Sometimes broken power
supply or major interference. Not of such CPU problems.

~~~
nowaynohow
Meh, for C2000, it can be also a sign of outdated firmware. We don't get
microcode updates for SoCs in the general distribution: either your system
vendor does a good job of keeping up with firmware updates, or you are
screwed.

