Recurly billing down -- some customer data lost

jdvh · on Sept 5, 2012

It's a shame they publish such an obfuscated status update.

A "hardware failure in our primary encryption hardware device"? "corrupted encryption keys"? Not exactly reassuring!

Did a hard drive fail? If so, why aren't they using RAID? And if changes automatically cascade to backups, then they're not backups. If changes cascade it's called redundancy.

From the rest of the status update it looks like they're now restoring CC data from an older set of backups. But are they? No way to tell.

It's surprising how many technical terms the status update uses and how little it actually tells about what's going on.

I'm sure that they'll come up with a more complete story of what happened in a couple of days, after they've got everything under control. Given how critical Recurly is to their customers, I think they should worded everything a little more carefully. Explain what happened in normal language. Explain what percentage of accounts is affected. Explain what they're doing now. Explain what measures they're going to put into place to prevent similar things from happening in the future. Emphasize with the customers (who completely depend on Recurly for their business). This shouldn't be rocket science.

tptacek · on Sept 5, 2012

When they write "primary encryption hardware device", I hear "HSM". I do not read a lot of jargon in this post. When a billing company writes to its customers, "some of our customers will need to contact some or all of their customers to have billing information re-entered", I do not sense evasion.

sitkack · on Sept 5, 2012

By HSM he means, Hardware Security Module.

quote from wikipedia

    HSMs can typically be clustered for high availability.
    Some HSMs feature dual power supplies to enable business continuity.

plorkyeran · on Sept 5, 2012

I don't see anything particularly unclear in that post. They're using hardware encryption devices created by a third party. These devices failed, and the entire point of these devices is that it's impossible to read the data without using them. The failure of the 'backups' may have simply been due to the same problem existing in them that caused the primary devices to fail, rather than cascading.

sitkack · on Sept 5, 2012

This is probably not a lot of data (billing information, 20M users * 2k ~ 40GB) and should have been stored in immutable snapshots down @ the 5 or 1 minute mark with those encrypted snapshots being checksummed and replicated to different systems. Each new entry to go into a distributed replicated log. Double entry book keeping (immutable data structure with checksums) has been around from what, the 1300s?

They should have been using chaos monkey, http://techblog.netflix.com/2012/07/chaos-monkey-released-in...

The name just got cutesy with people's money, when they shoulda been getting cutesy with good design, stuff fails. But systems shouldn't fail that handle money. There _will_ be breakage of customer accounts, there will be no getting around that. Look on the bright side, people finally have a chance to not-reup services they don't need.

jarcoal · on Sept 5, 2012

You seem to know more about this than me, but from the post it appears like their data is just fine but the keys to decrypt it are gone/corrupt.

anovikov · on Sept 5, 2012

If there is no major flaw in their encryption system, it means same as if the data was lost...

Retric · on Sept 5, 2012

At this point, it remains unclear how much of this data will be retrievable.

They may have a lot of encrypted data backed up that they can't decryption because both their primary and backup decryption devices broke. It may seem stupid not to have offsite backups of this, but it may be considered safer than allowing key's to get into the wild.

raerae7133 · on Sept 5, 2012

Rachel from Recurly Customer Support here

Our current communication is focused on getting the hard facts out to our merchants - we want to be transparent and clear about what is happening and what expectations our merchants should have around this service outage. It's slow going, careful work with many moving pieces, so our intention is not to be uncommunicative, but to provide details as we have them.

I invite you to contact support@recurly.com if you have any questions - I will handle all inquiries personally. We are committed to making this right and helping our merchants in any way possible.

rdl · on Sept 5, 2012

Good luck! This has to be tough (I've seen a fair number of HSM and key management related failures; sucks how security and reliability are sometimes at odds...)

Please put a full postmortem on the Internet (and ideally describe what hardware went wrong AND how you'll prevent it in the future) once you've finished immediate recovery. I'd suggest it come from VP Eng, CEO, or Founder level people.

If you can make a credible case for why this won't recur, it shouldn't have a long-term negative effect on the business. (A vendor who had a problem and learned from it is often safer than a vendor who had no problems and has just bene lucky...)

raerae7133 · on Sept 5, 2012

Thank you for your support! We will absolutely be posting details as they become available - keep an eye on blog.recurly.com, and our existing merchants will be notified via email.

Rachel Recurly Support

krobertson · on Sept 5, 2012

(probably soon to be former Recurly customer)

I use Recurly for one of my SaaS apps. I've had it on my todo list to move off of them and just use Stripe directly, but changing entire payment backends isn't one of those "lets hack for a few hours" things. Just using Stripe would be cheaper than Recurly and my existing merchant account/gateway.

I don't know if any of my customers are affected yet, however now they have failed to bill a customer when they should have, double billed a customer when they shouldn't have, and now potentially lost data.

I'm still on their original pricing plan and never moved to the new plan. When they changed their pricing, it just didn't seem attractive. My service is too small and my month charge was going to double for a service I was already meh on and just adding some new reports.

I believe at one point I'd heard you can request customer data from them... don't know exactly how that works as exporting customer credit card information would be incredibly sensitive. If they've lost any credit card info, likely just cut my loses with them, export nothing, and have all my customers resub. Probably not going to like that, but so be it.

sync · on Sept 5, 2012

   On Monday at 3:30am PDT, we experienced
   a hardware failure in our primary encryption hardware device.
   The failure cascaded to the backup slave device as well.

How does a hardware failure cascade to a backup slave device?

smoyer · on Sept 5, 2012

In the same way a slave database can be corrupted by the master when it has a failure. If you're encrypting data on the master and replicating to the slave, improperly encrypted data may be propagated before a fail-over. This is a case where having the master fail completely and go off-line is preferable to a "working failure".

awicklander · on Sept 5, 2012

But doesn't a 'hardware failure' imply failure to a specific piece of equipment/hardware? A cascading issue with actual data isn't a hardware failure, it's a software failure.

mike-cardwell · on Sept 5, 2012

This is why I use incremental backups, and check them regularly.

phillc · on Sept 5, 2012

One of the racks fell over onto the other racks, like dominios.

eckyptang · on Sept 5, 2012

Ha - that made me laugh.

Ironically something not far off from that happened to a contract outfit I was working for. They had a telecoms rack rather than a proper one and they couldn't bolt 9x maxed out DL380G2 servers on it as it started to bend. They decided the best approach was to just pile them up on the rack bed.

This was until the one at the bottom needed a RAM upgrade. Cue 6 people carefully trying to lift up 8x DL380's in one go and sliding various OReilly books underneath. The inevitable happened - a hand gave way resulting in a pile of DL380s spread all over the machine room floor and a broken toe from someone in the hosting facility.

Nothing went down though!

shimonamit · on Sept 5, 2012

"This failure corrupted encryption keys used to access stored credit cards to process recurring transactions"

Looks like they were sharing encryption keys by replication or shared storage.

alexkus · on Sept 5, 2012

"Flailover"

The primary fails because of a bug and so it fails over to the backup which flails-over because of the exact same bug, c.f. https://en.wikipedia.org/wiki/Ariane_5_Flight_501

hoopism · on Sept 5, 2012

Hard not to feel bad for all involved. I did chuckle when I read that new customer signups was not impacted... thank goodness for that.

Sony won't ever have this problem... all their customer CC data is public.

ndemoor · on Sept 5, 2012

As a current customer of Recurly, I am very dissapointed in the way they communicate. I know it's all hands on deck now to fix this problem, but putting communication aside is not the way to go.

For us this is also a very stressful situation, because if the worst case scenario becomes a reality...

"Some customers will be required to reach out to (some or all) of their customers to have them re-enter billing information."

... we can spend days contacting clients to get the payment credit card (which in some cases they should go to their boss for), and go through the billing process again, only to hope to get the list as near to a 100% recovered as possible.

Time for Billing Provider Redundancy?

jasonlotito · on Sept 5, 2012

> Time for Billing Provider Redundancy

This is nothing new. In higher risk industries, spreading risk over multiple billing providers is a fact of life. Like any system, if you rely on a single point of failure, then you are electing to take that risk. It's part of the price you pay for not having to deal with all the various requirements of PCI Compliance, as well as actually managing all the billing. The freedom to move from one biller to another biller seamlessly comes at a cost.

It's not an easy problem to solve, regardless. Not from a technical standpoint, mind you.

rdl · on Sept 5, 2012

How do you spread billing across multiple providers if you don't yourself have PCI compliance to retain billing information? I guess you could seed it to multiple systems when the customer first provides it, but that's tricky without momentarily holding the billing information yourself, too. (I mean, you can cheat...) You can't really do paypal + google checkout + a real payment option all transparently to the user, though; you have to give them a way to pick and they may need to re-enter details.

The only way I've seen this done was segmenting by cohort or product -- i.e. recurring billing on one platform and one off billing on another.

I have seem multiple payment providers where you capture billing information each time, or where you are PCI compliant and keep the billing information yourself.

jasonlotito · on Sept 9, 2012

> How do you spread billing across multiple providers if you don't yourself have PCI compliance to retain billing information?

You become PCI compliant! That's the price you pay. Or you ignore PCI compliance and risk it. You probably wouldn't be surprised to learn that this is far more common then people will admit (and I'm not even talking about people in high-risk industries).

Anyways, there are a few ways you can do this without having to deal with PCI compliance, though it doesn't solve the problem as well.

First, you set up multiple merchant accounts. That way, for a normal transaction, you might send person A to provider A, and then person B to provider B, and then person C to provider A, so on and so forth. The goal here is to spread the threat over more than one provider. You don't just allow PayPal, and if PayPal starts receiving too many transactions, you remove it as an option for a while.

If you are limited as you mention to PayPal, Google, and a real payment system, the best way there is to offer encouragement to use one system over another. Which ever system you want to encourage use of.

You can also find a PCI compliant provider who you can then attach merchant accounts to. They handle the PCI compliance, you provide the merchant accounts.

Of course, none of these solutions are really as easy as just using PayPal. But then you start to see why PayPal is so popular. It's downright easy.

rexreed · on Sept 13, 2012

We were told (by Braintree) that you are not allowed to have multiple simultaneous Merchant Accounts (not counting Paypal). Is this not true?

jasonlotito · on Sept 16, 2012

Kristi answers you below (No, it's not a problem, at least, one I never experienced with any banks I ever dealt with). I will, however, say this: each bank is different. Trust contracts to define. Beyond that, get second opinions on everything. And then get contracts to back it up. You are dealing with money. Probably a lot of money. Spend the time to understand exactly what you are told. What you assume Braintree said may not be what they mean. And always get a contract.

klynch · on Sept 14, 2012

Kristi from Braintree here. From a technical perspective, having multiple simultaneous merchant accounts shouldn't be a problem. If you'd like, shoot us some details, and we can see how we can help - support@braintreepayments.com

prawn · on Sept 5, 2012

"all hands on deck"?

alanctgardner · on Sept 5, 2012

No need to be a cock about it

edit: Jeez, I get it. HN doesn't like puns

porlw · on Sept 5, 2012

I'm glad. When it comes to problems with Credit Card storage, it's much more secure to lose the data than to have backups lying about the place.

andyakb · on Sept 5, 2012

I hope they are able to sort this out wth a reasonable outcome and protect against these failures in the future, but I cant be the only one who thinks this post should have been a little more personal. When a company makes a mistake like this, they need to accept responsibility, say how they will prevent it from happening again, and apologize. It is too early for them to say how they will prevent it, but simply adding something like "Hey guys, we messed up, we are sorry and we are working around the clock to get all of you back to normal" to the end of the post would go a long way

mrbig · on Sept 6, 2012

This means they lost all the credit card numbers.

I am a customer and would LOVE to cancel but they have us by the shorts.

Switching billing systems is a nightmare and they know it.

I wish recurly would go under because of this (only after I switch).

joeemison · on Sept 12, 2012

It's really not that hard to switch from Recurly, since they will send your stores credit cards (if they have recovered them) to another provider.

About a year ago we had a problem with Recurly and were dissatisfied with their attitude toward the problem (they seemed to have a "crap happens and we fix things fast" attitude instead of a "that should never have happened with good engineering" attitude). We switched to Braintree over the course of about 5 weeks, and Recurly was very helpful in getting the credit card numbers over (which took about 10 days total, as there was some kind of communications issue at the beginning, but we got everything straight).

I would not recommend Recurly, but it isn't true that they "have you by the shorts"--they'll help you leave their service.

loceng · on Sept 5, 2012

Good luck to them. It sounds like a difficult situation to be in.

moe · on Sept 5, 2012

This failure corrupted encryption keys used to access stored credit cards to process recurring transactions.

That sounds like a hilarious failure mode for a dedicated encryption appliance.

Hopefully recurly will call names so we can avoid this vendor.

rdl · on Sept 5, 2012

It's almost certainly either a Thales nCipher or a SafeNet. Very slight chance it could be IBM or couple other manufacturers. All can fail in weird ways.

This is more a failure of implementation than a failure of the device. You need some way to shield backups of keys from a failure of HSM; if they're paired online for HA there should be a secret-shared backup outside HSM storage. (usually people deploy two HSMs in the same datacenter for HA, so you need an outside backup for DR if a fire happens or something)

newobj · on Sept 5, 2012

Brutal, especially when it seems like a real act-of-god type failure and not just gross incompetence. Good luck to them.

markmm · on Sept 5, 2012

A catastrophe like this could destroy a company. With all the cloud providers etc there is no excuse for losing a customers data.