Hacker News new | past | comments | ask | show | jobs | submit login

Many types of bit error are not recoverable without a full system reset. It isn't a matter of a simple "this bit in ram got corrupted", but more "this floating point unit has got into a state where it will not produce a result, and will therefore hang the entire processor".

Therefore boot time becomes critical - if you end up rebooting due to bit errors multiple times per second, you can't afford to wait for Linux to start up each time...




Run 9 systems in parallel and reset the ones that give less common results or no results at all.

You still have 10% the surface area, power usage and weight and 10 times the speed of the radiation hardened ones.


And that’s why it’s wise to have multiple systems running at the same time, if one errors you still hopefully have another online. There’s a reason airplanes and now cars are designed this way. I’m sure they’re working towards this too.


Well I suppose they do not have to load all the kernels and drivers that Linux provides today.

I wonder how one could use micro kernels to further improve startup time and have a mini distributed OS/kernel for each component.


This is a problem with floating point operations happening at a lower level than the error correction you're imagining. In principle, that's not at all necessary. Are you arguing that it's infeasibly expensive to design a chip with operations that are error correctable?


It's possible - but you'll end up reinventing nearly every step of the IC design process, which will cost a lot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: