Now I'm going to have to set up the system again, and I don't know whether this is going to happen again. The SD card that got corrupted was a Class 4 Kingston.
Maybe I'll look into a Sandisk (possibly Class 10?) next time. But I am worried that it's not the SD card's fault, but rather a combination of a journaling filesystem, an SD card and a sudden power outage.
Edited: Apologies, I realized now that the red button cuts power to the network switch, not to each individual Pi. But my concerns about the Pi and power cuts still remain though.
It's super easy to revert back to and an easy way to support less technical people.
However, newer higher-capacity (and smaller process size) flash is far less reliable - endurance and retention are orders of magnitude less, while raw bit error rates are correspondingly higher. MLC makes this even worse, but manufacturers have been masking the problems by using stronger error correction. This strategy mostly works, but combined with another characteristic of dense NAND flash -read disturb - makes for memory devices that are far more fragile and sensitive to power interruptions than before. Read disturb means that repetitive reading of the same blocks in flash has a writing effect to adjacent bits to the ones being read, so even read operations are somewhat destructive. What this means is that the block management controller may have to perform a copy (i.e. a write) and erase after a certain number of reads. Furthermore, blocks which have been idle for a long time also need to have their contents periodically refreshed, since the data slowly fades away as the electrons leak out.
All these characteristics together mean that at any one time, even if the SD card is "idle" or only being read, block erase/program operations maybe occurring internally. If a power interruption happens, then depending on what was being written at the time, anything from silent data corruption (if the block was storing user data) to a complete failure of the card (if the block was part of the BMT or other management data) can occur.
Most applications of SD cards don't often cut power abruptly, which is why this problem doesn't occur. The RPi is an exception. If you want to reduce this problem as much as possible, my recommendation is to use older, low-capacity SD cards, which may contain large-geometry SLC flash. This is not going to cheap (per capacity), but will be cheaper than new "industrial grade" cards (which may actually be worse). I've had good luck with cards from relatively unknown Chinese/Taiwanese OEMs - many of them explicitly specify "100K program/erase cycles", something that the "consumer" brands don't even mention.
That's like "Writing to a Consumer File System 101". Yike.
The easiest change, if you're not really worried about reading the logs in case of power failure is to move /var/log (plus a few other directories normally written to like /var/log var/tmp etc) to memory instead of on the SD. Also disable swap.
That way it's less likely there is a write going on when the power is pulled.
Another thing to look at, is making the entire card read only, and setting up a temporary directory in memory that's periodically backed up somewhere remote.
The real trouble area is the SD card reader in my experience. The pins are very easy to break off accidentally. The new B+ model uses a microSD card so it's not nearly as troublesome.
This definitely worked for me anyway. Plus the case keeps the SD card from moving almost all while in position (which means you cannot suffer corruption from shaking it out of the reader, only power loss or similar).
An alternative is to boot the RPi diskless, this works but since everything is going through the USB bus it gets even slower than it normally is, which can make it unsuitable for an application.
That's exactly what I've done.
I periodically backup a sqlite db file to S3, and I wrote a script that will retrieve the latest backup on boot. Just plug in the new card and everything is back to the way it was minus at most 3 hours of data.
Does the Pi already monitor its supply voltages? If not, something could be hung on the GPIO to monitor +5 V. More elaborate circuitry can be used to extend the duration of the grace period if needed. (Diodes and capacitors, nothing exotic).
It would be nice if Raspberry Pi distros came with an easy way to do that.
Lets you use a battery pack with a handful of AA batteries as a UPS. You can even detect when one power source disappears and then safely shutdown. Amongst various other useful power related things.
Some integer code might run at comparable speed.
What we do need are boards with much better manufacturing QA/QC than Raspberry Pi. After the nth time the USB 5V falls out, or the SD reader loses contact, you rapidly realize they're not targeted towards a production environment. As inexpensive and powerful as possible is a great goal, but you invariably lose some reliability ("pick two").
Or even just back up the entire card to an image somewhere.
I bought these also to run dashboards in Kiosk mode - but they were too underpowered to drive my StackDriver graphs. :(
However, my RPi that sits at home, is attached to a UPS and doesn't seem to have any issues related to SD Card or stability.
Also, any reason for not making the big red button randomly select a "datacenter" to take offline?
Idea: transition this into a 3 or 4 datacenter cluster.
If you hit the button, I believe one of the rings would turn red but the cluster would still be able to function.