It's a shame they publish such an obfuscated status update.
A "hardware failure in our primary encryption hardware device"? "corrupted encryption keys"? Not exactly reassuring!
Did a hard drive fail? If so, why aren't they using RAID? And if changes automatically cascade to backups, then they're not backups. If changes cascade it's called redundancy.
From the rest of the status update it looks like they're now restoring CC data from an older set of backups. But are they? No way to tell.
It's surprising how many technical terms the status update uses and how little it actually tells about what's going on.
I'm sure that they'll come up with a more complete story of what happened in a couple of days, after they've got everything under control. Given how critical Recurly is to their customers, I think they should worded everything a little more carefully. Explain what happened in normal language. Explain what percentage of accounts is affected. Explain what they're doing now. Explain what measures they're going to put into place to prevent similar things from happening in the future. Emphasize with the customers (who completely depend on Recurly for their business). This shouldn't be rocket science.
When they write "primary encryption hardware device", I hear "HSM". I do not read a lot of jargon in this post. When a billing company writes to its customers, "some of our customers will need to contact some or all of their customers to have billing information re-entered", I do not sense evasion.
I don't see anything particularly unclear in that post. They're using hardware encryption devices created by a third party. These devices failed, and the entire point of these devices is that it's impossible to read the data without using them. The failure of the 'backups' may have simply been due to the same problem existing in them that caused the primary devices to fail, rather than cascading.
This is probably not a lot of data (billing information, 20M users * 2k ~ 40GB) and should have been stored in immutable snapshots down @ the 5 or 1 minute mark with those encrypted snapshots being checksummed and replicated to different systems. Each new entry to go into a distributed replicated log. Double entry book keeping (immutable data structure with checksums) has been around from what, the 1300s?
The name just got cutesy with people's money, when they shoulda been getting cutesy with good design, stuff fails. But systems shouldn't fail that handle money. There _will_ be breakage of customer accounts, there will be no getting around that. Look on the bright side, people finally have a chance to not-reup services they don't need.
At this point, it remains unclear how much of this data will be retrievable.
They may have a lot of encrypted data backed up that they can't decryption because both their primary and backup decryption devices broke. It may seem stupid not to have offsite backups of this, but it may be considered safer than allowing key's to get into the wild.
Our current communication is focused on getting the hard facts out to our merchants - we want to be transparent and clear about what is happening and what expectations our merchants should have around this service outage. It's slow going, careful work with many moving pieces, so our intention is not to be uncommunicative, but to provide details as we have them.
I invite you to contact support@recurly.com if you have any questions - I will handle all inquiries personally. We are committed to making this right and helping our merchants in any way possible.
Good luck! This has to be tough (I've seen a fair number of HSM and key management related failures; sucks how security and reliability are sometimes at odds...)
Please put a full postmortem on the Internet (and ideally describe what hardware went wrong AND how you'll prevent it in the future) once you've finished immediate recovery. I'd suggest it come from VP Eng, CEO, or Founder level people.
If you can make a credible case for why this won't recur, it shouldn't have a long-term negative effect on the business. (A vendor who had a problem and learned from it is often safer than a vendor who had no problems and has just bene lucky...)
Thank you for your support! We will absolutely be posting details as they become available - keep an eye on blog.recurly.com, and our existing merchants will be notified via email.
I use Recurly for one of my SaaS apps. I've had it on my todo list to move off of them and just use Stripe directly, but changing entire payment backends isn't one of those "lets hack for a few hours" things. Just using Stripe would be cheaper than Recurly and my existing merchant account/gateway.
I don't know if any of my customers are affected yet, however now they have failed to bill a customer when they should have, double billed a customer when they shouldn't have, and now potentially lost data.
I'm still on their original pricing plan and never moved to the new plan. When they changed their pricing, it just didn't seem attractive. My service is too small and my month charge was going to double for a service I was already meh on and just adding some new reports.
I believe at one point I'd heard you can request customer data from them... don't know exactly how that works as exporting customer credit card information would be incredibly sensitive. If they've lost any credit card info, likely just cut my loses with them, export nothing, and have all my customers resub. Probably not going to like that, but so be it.
On Monday at 3:30am PDT, we experienced
a hardware failure in our primary encryption hardware device.
The failure cascaded to the backup slave device as well.
How does a hardware failure cascade to a backup slave device?
In the same way a slave database can be corrupted by the master when it has a failure. If you're encrypting data on the master and replicating to the slave, improperly encrypted data may be propagated before a fail-over. This is a case where having the master fail completely and go off-line is preferable to a "working failure".
But doesn't a 'hardware failure' imply failure to a specific piece of equipment/hardware? A cascading issue with actual data isn't a hardware failure, it's a software failure.
Ironically something not far off from that happened to a contract outfit I was working for. They had a telecoms rack rather than a proper one and they couldn't bolt 9x maxed out DL380G2 servers on it as it started to bend. They decided the best approach was to just pile them up on the rack bed.
This was until the one at the bottom needed a RAM upgrade. Cue 6 people carefully trying to lift up 8x DL380's in one go and sliding various OReilly books underneath. The inevitable happened - a hand gave way resulting in a pile of DL380s spread all over the machine room floor and a broken toe from someone in the hosting facility.
As a current customer of Recurly, I am very dissapointed in the way they communicate. I know it's all hands on deck now to fix this problem, but putting communication aside is not the way to go.
For us this is also a very stressful situation, because if the worst case scenario becomes a reality...
"Some customers will be required to reach out to (some or all) of their customers to have them re-enter billing information."
... we can spend days contacting clients to get the payment credit card (which in some cases they should go to their boss for), and go through the billing process again, only to hope to get the list as near to a 100% recovered as possible.
This is nothing new. In higher risk industries, spreading risk over multiple billing providers is a fact of life. Like any system, if you rely on a single point of failure, then you are electing to take that risk. It's part of the price you pay for not having to deal with all the various requirements of PCI Compliance, as well as actually managing all the billing. The freedom to move from one biller to another biller seamlessly comes at a cost.
It's not an easy problem to solve, regardless. Not from a technical standpoint, mind you.
How do you spread billing across multiple providers if you don't yourself have PCI compliance to retain billing information? I guess you could seed it to multiple systems when the customer first provides it, but that's tricky without momentarily holding the billing information yourself, too. (I mean, you can cheat...) You can't really do paypal + google checkout + a real payment option all transparently to the user, though; you have to give them a way to pick and they may need to re-enter details.
The only way I've seen this done was segmenting by cohort or product -- i.e. recurring billing on one platform and one off billing on another.
I have seem multiple payment providers where you capture billing information each time, or where you are PCI compliant and keep the billing information yourself.
> How do you spread billing across multiple providers if you don't yourself have PCI compliance to retain billing information?
You become PCI compliant! That's the price you pay. Or you ignore PCI compliance and risk it. You probably wouldn't be surprised to learn that this is far more common then people will admit (and I'm not even talking about people in high-risk industries).
Anyways, there are a few ways you can do this without having to deal with PCI compliance, though it doesn't solve the problem as well.
First, you set up multiple merchant accounts. That way, for a normal transaction, you might send person A to provider A, and then person B to provider B, and then person C to provider A, so on and so forth. The goal here is to spread the threat over more than one provider. You don't just allow PayPal, and if PayPal starts receiving too many transactions, you remove it as an option for a while.
If you are limited as you mention to PayPal, Google, and a real payment system, the best way there is to offer encouragement to use one system over another. Which ever system you want to encourage use of.
You can also find a PCI compliant provider who you can then attach merchant accounts to. They handle the PCI compliance, you provide the merchant accounts.
Of course, none of these solutions are really as easy as just using PayPal. But then you start to see why PayPal is so popular. It's downright easy.
Kristi answers you below (No, it's not a problem, at least, one I never experienced with any banks I ever dealt with). I will, however, say this: each bank is different. Trust contracts to define. Beyond that, get second opinions on everything. And then get contracts to back it up. You are dealing with money. Probably a lot of money. Spend the time to understand exactly what you are told. What you assume Braintree said may not be what they mean. And always get a contract.
Kristi from Braintree here. From a technical perspective, having multiple simultaneous merchant accounts shouldn't be a problem. If you'd like, shoot us some details, and we can see how we can help - support@braintreepayments.com
I hope they are able to sort this out wth a reasonable outcome and protect against these failures in the future, but I cant be the only one who thinks this post should have been a little more personal. When a company makes a mistake like this, they need to accept responsibility, say how they will prevent it from happening again, and apologize. It is too early for them to say how they will prevent it, but simply adding something like "Hey guys, we messed up, we are sorry and we are working around the clock to get all of you back to normal" to the end of the post would go a long way
It's really not that hard to switch from Recurly, since they will send your stores credit cards (if they have recovered them) to another provider.
About a year ago we had a problem with Recurly and were dissatisfied with their attitude toward the problem (they seemed to have a "crap happens and we fix things fast" attitude instead of a "that should never have happened with good engineering" attitude). We switched to Braintree over the course of about 5 weeks, and Recurly was very helpful in getting the credit card numbers over (which took about 10 days total, as there was some kind of communications issue at the beginning, but we got everything straight).
I would not recommend Recurly, but it isn't true that they "have you by the shorts"--they'll help you leave their service.
It's almost certainly either a Thales nCipher or a SafeNet. Very slight chance it could be IBM or couple other manufacturers. All can fail in weird ways.
This is more a failure of implementation than a failure of the device. You need some way to shield backups of keys from a failure of HSM; if they're paired online for HA there should be a secret-shared backup outside HSM storage. (usually people deploy two HSMs in the same datacenter for HA, so you need an outside backup for DR if a fire happens or something)
A "hardware failure in our primary encryption hardware device"? "corrupted encryption keys"? Not exactly reassuring!
Did a hard drive fail? If so, why aren't they using RAID? And if changes automatically cascade to backups, then they're not backups. If changes cascade it's called redundancy.
From the rest of the status update it looks like they're now restoring CC data from an older set of backups. But are they? No way to tell.
It's surprising how many technical terms the status update uses and how little it actually tells about what's going on.
I'm sure that they'll come up with a more complete story of what happened in a couple of days, after they've got everything under control. Given how critical Recurly is to their customers, I think they should worded everything a little more carefully. Explain what happened in normal language. Explain what percentage of accounts is affected. Explain what they're doing now. Explain what measures they're going to put into place to prevent similar things from happening in the future. Emphasize with the customers (who completely depend on Recurly for their business). This shouldn't be rocket science.