It's a shame they publish such an obfuscated status update.
A "hardware failure in our primary encryption hardware device"? "corrupted encryption keys"? Not exactly reassuring!
Did a hard drive fail? If so, why aren't they using RAID? And if changes automatically cascade to backups, then they're not backups. If changes cascade it's called redundancy.
From the rest of the status update it looks like they're now restoring CC data from an older set of backups. But are they? No way to tell.
It's surprising how many technical terms the status update uses and how little it actually tells about what's going on.
I'm sure that they'll come up with a more complete story of what happened in a couple of days, after they've got everything under control. Given how critical Recurly is to their customers, I think they should worded everything a little more carefully. Explain what happened in normal language. Explain what percentage of accounts is affected. Explain what they're doing now. Explain what measures they're going to put into place to prevent similar things from happening in the future. Emphasize with the customers (who completely depend on Recurly for their business). This shouldn't be rocket science.
When they write "primary encryption hardware device", I hear "HSM". I do not read a lot of jargon in this post. When a billing company writes to its customers, "some of our customers will need to contact some or all of their customers to have billing information re-entered", I do not sense evasion.
I don't see anything particularly unclear in that post. They're using hardware encryption devices created by a third party. These devices failed, and the entire point of these devices is that it's impossible to read the data without using them. The failure of the 'backups' may have simply been due to the same problem existing in them that caused the primary devices to fail, rather than cascading.
This is probably not a lot of data (billing information, 20M users * 2k ~ 40GB) and should have been stored in immutable snapshots down @ the 5 or 1 minute mark with those encrypted snapshots being checksummed and replicated to different systems. Each new entry to go into a distributed replicated log. Double entry book keeping (immutable data structure with checksums) has been around from what, the 1300s?
The name just got cutesy with people's money, when they shoulda been getting cutesy with good design, stuff fails. But systems shouldn't fail that handle money. There _will_ be breakage of customer accounts, there will be no getting around that. Look on the bright side, people finally have a chance to not-reup services they don't need.
At this point, it remains unclear how much of this data will be retrievable.
They may have a lot of encrypted data backed up that they can't decryption because both their primary and backup decryption devices broke. It may seem stupid not to have offsite backups of this, but it may be considered safer than allowing key's to get into the wild.