Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The push for crypto without ecc ram is a nonstop horror show. Software under normal circumstances is remarkably resilient to having its memory corrupted. However crypto algorithms are designed so that a single bit flip effectively changes all the bits in a block. If you chain blocks then a single bit flip in one block destroys all the blocks. I've seen companies like msps go out of business because they were doing crypto with consumer hardware. No one thinks it'll happen to them and once it happens they're usually too dim to even know what happened. Bit flips aren't an act of god you simply need a better computer.


> The push for crypto without ecc ram is a nonstop horror show

That's a bit hyperbolic.

First, ECC doesn't protect the full data chain, you can have a bitflip in a hardware flip flop (or latch open a gate that drains a line, etc...) before the value reaches the memory. Logic is known to glitch too.

Second: ECC is mostly designed to protect long term storage in DRAM. Recognize that a cert like this is a very short-term value, it's computed and then transmitted. The failure happened fast, before copies of the correct value were made. That again argues to a failure location other than a DRAM cell.

But mostly... this isn't the end of the world. This is a failed cert, which is a failure that can be reasonably easily handled by manual intervention. There have been many other mistaken certs distributed that had to be dealt with manually: they can be the result of software bugs, they can be generated with the wrong keys, the dates can be set wrong, they can be maliciously issued, etc... The system is designed to include manual validation in the loop and it works.

So is ECC a good idea? Of course. Does it magically fix problems like this? No. Is this really a "nonstop horror show"? Not really, we're doing OK.


This is just learned helplessness because Intel were stingy as shit for over a decade and wanted to segregate their product lines. Error correction is literally prevalent in every single part of every PHY layer in a modern stack, it is an absolute must, and the lack of error correction in RAM is, without question, a ridiculous gap that should have never been allowed in the first place in any modern machine, especially given that density and bandwidth keeps increasing and will continue to do so.

When you are designing these systems, you have two options: you either use error correcting codes and increase channel bandwidth to compensate for them (as a result of injected noise, which is unavoidable), or you lower the transfer rate so much as to be infeasible to use, while also avoiding as much noise as you can. Guess what's happening to RAM? It isn't getting slower or less dense. The error rate is only going to increase. The people designing this stuff aren't idiots. That's why literally every other layer of your system builds in error correction. Software people do not understand this because they prefer to believe in magic due to the fact all of these abstractions result in stable systems, I guess.

All of the talk of hardware flip flops and all that shit is an irrelevant deflection. Doesn't matter. It's just water carrying and post-hoc justification because, again, Intel decided that consumers didn't actually need it a decade ago, and everyone followed suit. They've been proven wrong repeatedly.

Complex systems are built to resist failure. They wouldn't work otherwise. By definition, if a failure occurs, it's because it passed multiple safeguards that were already in place. Basic systems theory. Let's actually try building more safeguards instead of rationalizing their absence.


> By definition, if a failure occurs, it's because it passed multiple safeguards that were already in place.

Having worked on a good bunch of critical system, there aren't multiple safeguards in most hardware.

E.g. a multiplication error in a core will not be detected by an external device. Or a bit flip when reading cache, or from a storage device.

Very often the only real safeguard is to do the whole computation twice on two different hosts. I would rather have many low-reliability hosts and do computation twice than few high-reliability and very expensive host.

Unfortunately the software side is really lagging behind when it comes to reproducible computing. Reproducible builds are a good step in that direction and it took many decades to get there.


Safeguards do not just exist at the level of technology but also politics, social structures, policies, design decisions, human interactions, and so on and so forth. "Criticality" in particular is something defined by humans, not a vacuum, and humans are components of all complex systems, as much as any multiplier or hardware unit is. The fact a multiplier can return an error is exactly in line with this: it can only happen after an array of other things allow it to, some of them not computational or computerized at all. And not every failure will also result in catastrophe as it did here. More generally such failures cannot be eliminated, because latent failures exist everywhere even when you have TMR or whatever it is people do these days. Thinking there is any "only real safeguard" like quorums or TMR is exactly part of the problem with this line of thought.

The quote I made is actually very specifically is in reference to this paper, in particular point 2, which is mandatory reading for any systems engineer, IMO, though perhaps the word "safeguard" is too strong for the taste of some here. But focusing on definitions of words is besides the point and falls into the same traps this paper mentions: https://how.complexsystems.fail/

Back to the original point: is ECC the single solution to this catastrophe? No, probably not. Systems are constantly changing and failure is impossible to eliminate. Design decisions and a number of other decisions could have mitigated it and caused this failure to not be catastrophic. Another thing might also cause it to topple. But let's not pretend like we don't know what we're dealing with, either, when we've already built these tools and know they work. We've studied ECC plenty! You don't need to carry water for a corporation trying to keep its purse filled to the brim (by cutting costs) to proclaim that failure is inevitable and most things chug on, regardless. We already know that much.


Failing to account for bit flips and HW failure is too common in web/service coding. Lookup how Google dropped a massive Big Table instance in prod, and traced it back to a cosmic ray bit flip that made a WRITE instruction into a DROP TABLE instruction.

I laugh when I compare my day to day coding to that of an avionics programmer in the aero industry.


The "web" coding words falls for things like leftpad, imagine talking about bit flips.

It's sad how immature the software industry can be. It's been around for "only" 60/70 years after all.


I couldn't find it. Do you have a reference?


There are some examples here [1]. Dig as I may, I cannot locate the original Google source I read about this ~3y ago.

1. https://www.usenix.org/system/files/conference/atc12/atc12-f...


> Very often the only real safeguard is to do the whole computation twice on two different hosts.

Three different hosts, for quorum, right?


Depends on the system. In this case it seems like retries are possible after a failure, so two is sufficient to detect bad data. You need three in real time situations where you don't have the capability to go back and figure it out.


Spot on. Very often doing an occasional (very rare) retry is acceptable.

Sometimes doing the same processing twice can be also a way to implement safe(r) rolling updates.


Two hosts is efficient. Do it twice on two different hosts and then compare the results. If there is a mismatch, throw it away and redo it again on 2 hosts. A total of 4 computations are needed. But only if the difference really was due to bit flips, the chance of which are exceedingly rare. In all the rest of the cases, you get away with two instead of three computations.


People who tout this don't understand the probability of bit flips. It's measured in failures per _billion_ hours of operation. This matters a ton in an environment with thousands of memory modules (data centers and super computers) but you're lucky to experience a single ram bit flip more than once or twice in your entire life

Edit: there's some new (to me) information from real world results, interesting read. https://www.zdnet.com/article/dram-error-rates-nightmare-on-...

Looks like things are worse than I thought (but still better than most people seem to think). Interesting to note that the motherboard used affects error rate, and it seems that part of it is a luck of the draw situation where some dimms have more errors than others despite being the same manufacturer


Bit flips are guaranteed to happen in digital systems. No matter how low the probability is, it will never be zero. You can't go around thinking you're going to dodge a bullet because its unlikely. If it weren't for the pervasive use of error detection in common I/O protocols you would be subjected to these errors much more frequently.


We're talking specifically about ECC ram, which solves the specific problem caused by cosmic rays (and apparently bad motherboard design). IO protocol error correction is a totally different problem.


I was going to comment, but you edited your post. Yes, it is worse than we usually think on the software side.


I still think it made sense up until now to not bother with it on consumer hardware, and even at this point. The probability of your phone having a software glitch needing a reboot is way higher. Now that it's practically a free upgrade? Should be included by default. But I still don't think it's nearly as nefarious as people make it out to be that it has been this way for so long


I don’t think they are an issue in a phone, but in a system like a blockchain that goes through so much effort to achieve consistency, the severity of the error is magnified hence the lesser tolerance for the error rate.


'stingy as shit' or maximising short-term shareholder value -- the hardware is possibly not the only broken model here


You're probably right that I'm giving them a bit too much credit on that note. Ceterum censeo, and so on.


> This is a failed cert, which is a failure that can be reasonably easily handled by manual intervention.

This isn't a misissued cert that can be revoked, it permanently breaks the CT log in question since the error propagates down the chain.


Yes, this kills Yeti 2022. There's a bug referenced which refers to an earlier incident where a bitflip happened in the logged certificate data. That was just fixed. Overwrite with the bit flipped back, and everything checks out from then onwards.

But in this case it's the hash record which was flipped, which unavoidably taints the log from that point on. Verifiers will forever say that Yeti 2022 is broken, and so it had to be locked read-only and taken out of service.

Fortunately, since modern logs are anyway sharded by year of expiry, Yeti 2023 already existed and is unaffected. DigiCert, as log operator, could decide to just change criteria for Yeti 2023 to be "also 2022 is fine" and I believe they may already have done so in fact.

Alternatively they could spin up a new mythical creature series. They have Yeti (a creature believed to live in the high mountains and maybe forests) and Nessie (a creature believed to live in a lake in Scotland) but there are plenty more I'm sure.


It doesn't break anything that I can see (though I'm no expert on the particular protocol). Our ability to detect bad certs isn't compromised, precisely because this was noticed by human beings who can adjust the process going forward to work around this.

Really the bigger news here seems to be a software bug: the CT protocol wasn't tolerant of bad input data and was trusting actors that clearly can't be trusted fully. Here the "black hat" was a hardware glitch, but it's not hard to imagine a more nefarious trick.


Your statement is, to be frank, non-sensical. The protocol itself isn't broken, at least for previous Yeti instances, certificate data are correctly parsed and rejected.* In this instance, it seems that the data is verified already pre-signing BUT was flipped mid-signing. This isn't the fault of how CT was designed but rather a hardware failure that requires correction there. (Or at least that's the likely explanation, it could be a software bug° but it will be a very consistent and obvious behaviour if it is indeed a software bug.)

On the issue of subsequent invalidation of all submitted certificates, this is prevented by submitting to at least 3 different entities (as of now, there's a discussion whether if this should be increased), so if a log is subsequently found to be corrupted, the operator can send a "operator error" signal to the browser, and any tampered logs are blacklisted from browsers. (Note that all operators of CT lists are members of CA/B forum, at least as of 2020. In standardisation phase, some individuals have operated their own servers but this is no longer true.)

* Note that if the cert details are nonsensical but technically valid, it is still accepted by design, because all pre-certificates are countersigned by the intermediate signer (which the CT log operator checks from known roots). If the intermediate is compromised, then the correct response is obviously a revocation and possibly distrust.

° At least the human-induced variety, you could say that this incident is technically a software bug that occurred due to a hardware fault.


Presumably it's possible to code defensively against this sort of thing, by eg. running the entire operation twice and checking the result is the same before committing it to the published log?


Big tech companies like Google and Facebook have encountered problems where running the same crypto operation twice on the same processor deterministically or semi-deterministically gives the same incorrect result... so the check needs to be on done on separate hardware as well.


I don't think that matters in this case, because the entire point of the log machine is to run crypto operations. If it has such a faulty processor it is basically unusable for the task anyway.


So I'm learning about Yeti for the first time, but I don't buy that argument. Corrupt tranmitted data has been a known failure mode for all digital systems since they were invented. If your file download in 1982 produced a corrupt binary that wiped your floppy drive, the response would have been "Why didn't you use a checksumming protocol?" and not "The hardware should have handled it".

If Yeti can't handle corrupt data and falls down like this, Yeti seems pretty broken to me.


Not handling corrupted data is kind of the point of cryptographic authentication systems. Informally and generally, the first test of a MAC or a signature of any sort is to see if it fails on arbitrary random single bit flips and shifts.

The protocol here seems to have done what it was designed to do. The corrupted shard has simply been removed from service, and would be replaced if there was any need. The ecosystem of CT logs foresaw this and designed for it.


So... Yeti isn't broken then? Seems like the protocol does handle it? Seems like there's some confusion on this point in this thread.


The Yeti2022 log is corrupted due to the random event. This has been correctly detected, and is by design and policy not fixable, since logs are not allowed to rewrite their history (ensuring that they don't is very much the point of CT). That the log broke is annoying but not critical, and the consequences are very much CT working as intended.

You can argue if the software running the log should have verified that it calculated the correct thing before publishing it, but that's not a protocol concern.


I think Yeti2022 is just the name for one instance of the global CT log? Nick Lamb could probably say more about this; I understand CT mostly in the abstract, and as the total feed of CT records you surveil for stuff.


Yeti is one of the CT logs (and Yeti2022 the 2022 shard of it, containing certs that expire in 2022). CT logs are independent of each other, there is not really a "global log", although the monitoring/search sites aggregate the data from all logs. Each certificate is added to multiple logs, so the loss of one doesn't cause a problem for the certs in it. (Maybe it's also possible to still trust Yeti2022 for the parts of the log that are well-formed, which would decrease the number of impacted certs even more, not familiar enough with the implementations for that)


Using ECC is a no-brainer. Even the Raspberry Pi 4 has ECC RAM. It’s not particularly expensive and only a artificial limitation Intel has introduced for consumer products.



(See my reply to parent.)


The Pi 4 uses DRAM with on-die ECC, which AFAIK does not provide any means of reporting errors (corrected or uncorrected) to the SoC's memory controller. It is effectively a cost-saving measure to improve DRAM yields. As such, it does little to guarantee that there are no memory errors.


That's a bit hyperbolic.

Would ECC have avoided the issue in this case? If so then it's hard not to agree that is should be considered a minimum standard. It looks like Yeti 2022 isn't going to survive this intact, and while they can resolve the issues in other ways, not everyone will always be so fortunate and ECC is a relatively small step to avoid a larger problem.


> That's a bit hyperbolic.

I agree. That said,

> First, ECC doesn't protect the full data chain, you can have a bitflip in a hardware flip flop (or latch open a gate that drains a line, etc...) before the value reaches the memory. Logic is known to glitch too.

Of course. DRAM ECC protects against errors in the DRAM cells. That doesn't mean other components don't have other strategies for reducing errors which can form a complete chain.

Latches and arrays and register files often have parity (where data can be reconstructed) or ECC bits, or use low level circuits that themselves are hard or redundant enough to achieve a particular target UBER for the full system.

> Second: ECC is mostly designed to protect long term storage in DRAM. Recognize that a cert like this is a very short-term value, it's computed and then transmitted. The failure happened fast, before copies of the correct value were made. That again argues to a failure location other than a DRAM cell.

Not necessarily. Cells that are sitting idle other than refresh have certain error profiles, but ones under constant access. Particularly "idle" cells that are in fact being disturbed by adjacent accesses certainly have a non-zero error profile and need ECC too.

My completely anecdotal guess would be this error is at least an order of magnitude more likely to have occurred in non-ECC memory (if that's what was being used) rather than any other path to the CPU or cache or logic on the CPU itself.


If a system is critical it should run on multiple machines in multiple locations and "sync with checks" kinda like the oh so hated and totally useless blockchains.

Then if such a bit-flip would occur it would never occur on all machines at the same time in the same data. And on top of that you could easy make the system fix itself if something like that happens (simply assume the majority of nodes didn't have the bit-flip) or in worst case scenario it could at least stop rather than making "wrong progress"

I have no clue what this particular log need for "throughput specs" but I assume it would be easily achievable with current DLT.


Safety critical systems are not a good fit for a blockchain-based resolution to the Byzantine General problem. Safety critical systems need extremely low latency to resolve the conflict fast. So blockchain is not going to be an appropriate choice for all critical applications when there are multiple low-latency solutions for BFT at a very low ms latency and IIRC, microseconds for avionics systems.


Not sure what you mean with "safety critical". I made no such assumption. Also since the topic is about a (write-only) log it probably doesn't need such low latency. What it more likely needed is final states so once a entry is made and accepted it must be final and ofc correct. DLTs can do this distributed and self-fixing i.e. a node that tries to add fault data is overruled and can never get a confirmation for a final state that later would not be valid.

Getting the whole decentral system to "agree" will never be fast unless we have quantum tech. There is simply no way servers around the glob could communicate in microseconds even if all communication would be the speed of light and processing would be instant. It would still take time. In reality such system need seconds which is often totally fine. As long as everyone only relies on the data that has been declared final.


You said If a system is critical

I thought you were making a more general statement about all critical systems, that's all. And since many critical systems have a safety factor in play, I wanted to distinguish them as not always being a good target for a Blockchain solution to the problems of consensus.

Blockchain is a very interesting solution to the problem of obtaining consensus in the face of imperfect inputs, there are other options so, like anything else, you choose the right tool for the job. My own view is that-- given other established protocols, blockchain is going to be overkill for dealing with some types of fault tolerance. It is a very good fit for applications where you want to minimize relying on the trust of humans. (And other areas too, but right now I'm just speaking of the narrow context of consensus amid inconsistent inputs)


Critical that the system operates/keeps operating/does not reach an invalid state. It could be for safety but in general its more to avoid financial damage. Downtime of any kind usually result in huge financial loses and people working extra shifts. This was my main point.

>...blockchain is going to be overkill for dealing with some types of fault tolerance

But it this case likely it isn't. The current systen already works with a chain of blocks its just lacks the distributed checking and all that stuff. But "blckchains" aren't some secret sauce in this case it just an way to implement a distributed write-only database with filed proven tech. It can be as lightweight as it any other solution. The consensus part is completely irrelevant anyway because all nodes are operated by one entity. But due to the use case (money/value) of modern DLT ("blockchains") they are incredible reliable by design. The oldest DLTs that uses FBA (instead of PoW/PoS) are running since 9+ years without any error or downtime. Recreating a similar reliable system would be month and month of work followed by month of testing.


Yep, pretty much agreed. Whatever anyone may think if crypto currencies blockchains are essentially a different technology where coins are just 1 application. I'm kind of "meh" on crypto currencies (not anti, just think they need a while more to mature) but trustless consensus is a significant innovation in its own right.


I haven't thought about whether an actual blockchain is really the best solution, but the redundancy argument is legitimate. We've been doing it for decades in other systems where an unnoticed bit flip results in complete mission failure, such as an Apollo mission crash.

I'm not really sure what Yeti 2022 is exactly, so take this with heaps of salt, but it seems like this is a "mission failure" event -- it can no longer continue, except as read only. Crypto systems, even more than physical systems like rockets, suffer from such failures after just "one false step". Is the cost of this complete failure so low that it doesn't merit ECC? Extremely doubtful. Is it so low that it doesn't merit redundancy? More open for debate, but plausibly not.

I know rockets experience more cosmic rays and their failure can result in loss of life and (less importantly) losing a lot more money, and everything is a tradeoff -- so I'm not saying the case for redundancy is water tight. But it's legitimate to point out there is an inherent and it seems under-acknowledged fragility in non-redundant crypto systems.


>I haven't thought about whether an actual blockchain is really the best solution...

Most likely not. But the tech behind FBA (Federated Byzantine Agreement) distributed ledgers would make an extremely reliable system that can handle malfunction of hardware and large outages of nodes. And since this is a write-only log and only some entities can write to it, it could be implemented with permission so that the system doesn't have to deal with attacks that public blockchain would face.


Technically everyone can write to it. However you can only write certain specific things.

In the case of Yeti 2022 you were only able to log (pre-)certificates signed by particular CAs trusted in the Web PKI, which were due to expire in the year 2022.

In practice the vast majority of such logging is done by issuing CAs, as part of their normal operations. But it is possible (and is done purposefully, at least sometimes) to obtain certificates which have not been logged. These certificates, of course, won't work in Chrome or Safari because there's no proof they were logged. But you can log them yourself, and get SCTs and show those Just In Time™.

This is only an interesting technical option if you have both the need to do it for some reason and the capability to send SCTs that aren't baked inside your certificates. The vast majority of punters just have the CA do all this for them and the certificate they get has SCTs baked right inside it so there's no technical changes for them at all, they needn't know CT exists.

Because the CA's critical business processes depend on writing to logs, they need a formal service level agreement assuring them that some logs they use will accept their writes in a timely fashion and meet whatever criteria, but you as an individual don't need this, you don't care if the log you wanted to use says it's unavailable for 4 hours due to maintenance.


Thats pretty much the default state of any blockchain like systems. You need a private key to write to it. Its just that in most public blockchains can have an infinite amount of new private keys can be generated an and some kind of token is attached to it. For a log none of that would be needed. A central operator could hand out keys to anyone who should be able to write to it and for all other its read-only. And ofc the key alone would still not allow someone to write invalid data.


> I'm not really sure what Yeti 2022 is exactly

Sometimes there are problems with certificates in the Web PKI (approximately the certificates your web browser trusts to determine that this is really news.ycombinator.com, for example). It's a lot easier to discover such problems early, and detect if they've really stopped happening after someone says "We fixed it" if you have a complete list of every certificate.

The issuing CAs could just have like a huge tarball you download, and promise to keep it up-to-date. But you just know that the same errors of omission that can cause the problem you were looking for, can cause that tarball to be lacking the certificates you'd need to see.

So, some people at Google conceived of a way to build an append-only log system, which issues signed receipts for the items logged. They built test logs by having the Google crawler, which already gets sent certificates by any HTTPS site it visits as part of the TLS protocol, log every new certificate it saw.

Having convinced themselves that this idea is at least basically viable, Google imposed a requirement that in order to be trusted in their Chrome browser, all certificates must be logged from a certain date. There are some fun chicken-and-egg problems (which have also been solved, which is why you didn't need to really do anything even if you maintain an HTTPS web server) but in practice today this means if it works in Chrome it was logged. This is not a policy requirement, not logging certificates doesn't mean your CA gets distrusted - it just means those certificates won't work in Chrome until they're logged and the site presents the receipts to Chrome.

The append-only logs are operated by about half a dozen outfits, some you've heard of (e.g. Let's Encrypt, Google itself) and some maybe not (Sectigo, Trust Asia). Google decided the rule for Chrome is, it must see at least one log receipt (these are called SCTs) from Google, and one from any "qualified log" that's not Google.

After a few years operating these logs, Google were doing fine, but some other outfits realised hey, these logs just grow, and grow, and grow without end, they're append-only, that's the whole point, but it means we can't trim 5 year old certificates that nobody cares about. So, they began "sharding" the logs. Instead of creating Kitten log, with a bunch of servers and a single URL, make Kitten 2018, and Kitten 2019, and Kitten 2020 and so on. When people want to log a certificate, if it expires in 2018, that goes in Kitten 2018, and so on. This way, by the end of 2018 you can switch Kitten 2018 to read-only, since there can't be new certificates which have already expired, that's nonsense. And eventually you can just switch it off. Researchers would be annoyed if you did it in January 2019, but by 2021 who cares?

So, Yeti 2022 is the shard of DigiCert's Yeti log which only holds certificates that expire in 2022. DigiCert sells lots of "One year" certificates, so those would be candidates for Yeti 2022. DigiCert also operate Yeti 2021 and 2023 for example. They also have a "Nessie" family with Nessie 2022 still working normally.

Third parties run "verifiers" which talk to a log and want to see that is in fact a consistent append-only log. They ask it for a type of cryptographic hash of all previous state, which will inherit from an older hash of the same sort, and so on back to when the log was empty. They also ask to see all the certificates which were logged, if the log operates correctly, they can calculate the forward state and determine that the log is indeed a record of a list of certificates, in order, and the hashes match. They remember what the log said, and if it were to subsequently contradict itself, that's a fatal error. For example if it suddenly changed its mind about which certificate was logged in January, that's a fatal error, or if there was a "fork" in the list of hashes, that's a fatal error too. This ensures the append-only nature of the log.

Yeti 2022 failed those verification tests, beginning at the end of June, because in fact it had somehow logged one certificate for *.molabtausnipnimo.ml but it has mistakenly calculated a SHA256 hash which was one bit different and then all subsequent work assumed the (bad) hash was correct. There's no way to rewind and fix that.

In principle if you knew a way to make a bogus certificate which matched that bad hash you could overwrite the real certificate with that one. But we haven't the faintest idea how to begin going about that so it's not an option.

So yes, this was mission failure for Yeti 2022. This log shard will be set read-only and eventually decommissioned. New builds of Chrome (and presumably Safari) will say Yeti 2022 can't be trusted past this failure. But the overall Certificate Transparency system is fine, it was designed to be resilient against failure of just one log.


Conceivably you could also fix this by having all verifiets special case this one certificate in their verification software to substitute the correct hash?

Obviously that's a huge pain but in theory it would work?


You really want to make everyone special case this because 1 CT log server had a hardware failure?

This is not the first time a log server had to be removed due to a failure, nor will it be the last. The whole protocol is designed to be resilient to this.

What would be the point of doing something besides following the normal procedures around log failures?


I was just asking to check if I understood the problem correctly.


Yes, in principle you could special case all the verifiers, although that's an open set, the logs are public.


Thank you, that's an excellent description. The CT system as a whole does appear to have ample redundancy, with automated tools informing manual intervention that resolves this individual failure.


In my younger days I worked in customer support for a storage company. A customer had one of our products crash (kernel panic) at a sensitive time. We had a couple of crack engineers I used to hover around to try to pick up gdb tricks so I followed this case with interest—it was a new and unexpected panic.

Turns out the crash was caused by the processor taking the wrong branch. Something like this (it wasn’t Intel but you get the picture):

  test edx, edx
  jnz $something_that_expects_nonzero_edx
Well, edx was 0, but the CPU jumped anyway.

So yeah, sometimes ECC isn’t enough. If you’re really paranoid (crypto, milsec) you should execute multiple times and sanity-test.


I saw a kernel panic because of this _yesterday_. (Crash where the failing instruction in memory was 1 bit different from on disk.)


Sounds like a regular RAM bitflip, doesn't it?


Sure, but even with those you'd usually expect it to happen to data, which is a lot larger.


>If you chain blocks then a single bit flip in one block destroys all the blocks. I've seen companies like msps go out of business because they were doing crypto with consumer hardware.

If it is caused by a single bitflip you know the block in which that bitflip occurred and can try each bit until you find the right bit. This is an embarrassing parallel problem. Let's say you need to search 1 GB space for a single bit flip. That only requires that you test 8 billion bit flips. Given the merklized nature of most crypto, you will probably be searching a space far smaller than 1 GB.

>Bit flips aren't an act of god you simply need a better computer.

Rather than using hardware ECC, you could implement ECC in software. I think hardware ECC is good idea, but you aren't screwed if you don't use it.

The big threat here is not the occasional random bit flips, but adversary caused targeted bit flips since adversaries can bit flip software state that won't cause detectable failures but will cause hard to detect security failures.


Distributed, byzantine fault tolerant state machines solve that. At worst, a single node will go out of sync.


This is a great way to say "blockchain" without getting guaranteed down votes ;)


Yes, and also contrary to Wikipedia it provides a solid use case for private blockchain https://en.wikipedia.org/wiki/Blockchain#Disadvantages_of_pr...

I think it is instructive to ask "why doesn't this mention guarantee downvotes?" because I don't think it's just cargo-culting. I doubt that many of those objecting to blockchain are objecting to byzantine fault tolerance, DHT, etc. Very high resource usage in the cost function, the ledger being public and permanent (long term privacy risk), negative externalities related to its use in a currency... These are commonly the objections I have and hear. And they are inapplicable.

Extending what the Wikipedia article says, it's basically glorified database replication. But it also replicates and verifies the calculation to get to that data so it provides far greater fault tolerance. But since it is private you get to throw out the adversarial model (mostly the cost function) and assume failures are accidental, not malicious. It makes the problem simpler and lowers the stakes versus using blockchain for a global trustless digital currency so I don't think we should be surprised that it engenders less controversy.


You made me smile, however:

I'm one of those who can easily downvote blockchain stuff mercilessly. It is not reflexively though: I reserve it for dumb ideas, it just so happens that most blockchain ideas I see come off as dumb and/or as an attempted rip off.


The poster isn’t wrong. An entire chain shouldn’t die because of a memory error in one node.


No, 99% of the times you just need to do a computation twice on different hosts. You don't need quorum or other form of distributed consensus.


> Software under normal circumstances is remarkably resilient to having its memory corrupted

Not really? What you are saying applies to anything that uses hashing of some sort, where the goal by design is to have completely different outputs even with a single bit flip.

And "resilience" is not a precise enough term? Is just recovering after errors due to flips? Or is it guaranteeing that the operation will yield correct output (which implies not crashing)? The latter is far harder.


I used to sell workstations and servers years ago and trying to convince people they needed ECC Ram and that it was just insurance for (often) just the price of the extra chips on the DIMMs was a nightmare.

The amount of uninformed and inexperienced counter arguments online suggesting it was purely Intel seeking extra money (even though they didn't sell RAM) was ridiculous.

I never understood why there was so much push back from the consumer world commenting on something they had no idea about. Similar arguments for why would you ever need xxGB of RAM while also condemning the incorrect 640kb RAM Bill Gates comment.


How did the company go out of business?


>companies like msps go out of business because they were doing crypto with consumer hardware

any story on this? can you elaborate or add something to read about it?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: