A "hardware failure in our primary encryption hardware device"? "corrupted encryption keys"? Not exactly reassuring!
Did a hard drive fail? If so, why aren't they using RAID? And if changes automatically cascade to backups, then they're not backups. If changes cascade it's called redundancy.
From the rest of the status update it looks like they're now restoring CC data from an older set of backups. But are they? No way to tell.
It's surprising how many technical terms the status update uses and how little it actually tells about what's going on.
I'm sure that they'll come up with a more complete story of what happened in a couple of days, after they've got everything under control. Given how critical Recurly is to their customers, I think they should worded everything a little more carefully. Explain what happened in normal language. Explain what percentage of accounts is affected. Explain what they're doing now. Explain what measures they're going to put into place to prevent similar things from happening in the future. Emphasize with the customers (who completely depend on Recurly for their business). This shouldn't be rocket science.
quote from wikipedia
HSMs can typically be clustered for high availability.
Some HSMs feature dual power supplies to enable business continuity.
They should have been using chaos monkey, http://techblog.netflix.com/2012/07/chaos-monkey-released-in...
The name just got cutesy with people's money, when they shoulda been getting cutesy with good design, stuff fails. But systems shouldn't fail that handle money. There _will_ be breakage of customer accounts, there will be no getting around that. Look on the bright side, people finally have a chance to not-reup services they don't need.
They may have a lot of encrypted data backed up that they can't decryption because both their primary and backup decryption devices broke. It may seem stupid not to have offsite backups of this, but it may be considered safer than allowing key's to get into the wild.