
Recurly billing down -- some customer data lost - rexreed
http://blog.recurly.com/2012/09/status-update-ii-hardware-outage/
======
jd
It's a shame they publish such an obfuscated status update.

A "hardware failure in our primary encryption hardware device"? "corrupted
encryption keys"? Not exactly reassuring!

Did a hard drive fail? If so, why aren't they using RAID? And if changes
automatically cascade to backups, then they're _not backups_. If changes
cascade it's called _redundancy_.

From the rest of the status update it looks like they're now restoring CC data
from an older set of backups. But are they? No way to tell.

It's surprising how many technical terms the status update uses and how little
it actually tells about what's going on.

I'm sure that they'll come up with a more complete story of what happened in a
couple of days, after they've got everything under control. Given how critical
Recurly is to their customers, I think they should worded everything a little
more carefully. Explain what happened in normal language. Explain what
percentage of accounts is affected. Explain what they're doing now. Explain
what measures they're going to put into place to prevent similar things from
happening in the future. Emphasize with the customers (who completely depend
on Recurly for their business). This shouldn't be rocket science.

~~~
sitkack
This is probably not a lot of data (billing information, 20M users * 2k ~
40GB) and should have been stored in immutable snapshots down @ the 5 or 1
minute mark with those encrypted snapshots being checksummed and replicated to
different systems. Each new entry to go into a distributed replicated log.
Double entry book keeping (immutable data structure with checksums) has been
around from what, the 1300s?

They should have been using chaos monkey,
[http://techblog.netflix.com/2012/07/chaos-monkey-released-
in...](http://techblog.netflix.com/2012/07/chaos-monkey-released-into-
wild.html)

The name just got cutesy with people's money, when they shoulda been getting
cutesy with good design, stuff fails. But systems shouldn't fail that handle
money. There _will_ be breakage of customer accounts, there will be no getting
around that. Look on the bright side, people finally have a chance to not-reup
services they don't need.

~~~
jarcoal
You seem to know more about this than me, but from the post it appears like
their data is just fine but the keys to decrypt it are gone/corrupt.

~~~
anovikov
If there is no major flaw in their encryption system, it means same as if the
data was lost...

------
raerae7133
Rachel from Recurly Customer Support here

Our current communication is focused on getting the hard facts out to our
merchants - we want to be transparent and clear about what is happening and
what expectations our merchants should have around this service outage. It's
slow going, careful work with many moving pieces, so our intention is not to
be uncommunicative, but to provide details as we have them.

I invite you to contact support@recurly.com if you have any questions - I will
handle all inquiries personally. We are committed to making this right and
helping our merchants in any way possible.

~~~
rdl
Good luck! This has to be tough (I've seen a fair number of HSM and key
management related failures; sucks how security and reliability are sometimes
at odds...)

Please put a full postmortem on the Internet (and ideally describe what
hardware went wrong AND how you'll prevent it in the future) once you've
finished immediate recovery. I'd suggest it come from VP Eng, CEO, or Founder
level people.

If you can make a credible case for why this won't recur, it shouldn't have a
long-term negative effect on the business. (A vendor who had a problem and
learned from it is often safer than a vendor who had no problems and has just
bene lucky...)

~~~
raerae7133
Thank you for your support! We will absolutely be posting details as they
become available - keep an eye on blog.recurly.com, and our existing merchants
will be notified via email.

Rachel Recurly Support

------
krobertson
(probably soon to be former Recurly customer)

I use Recurly for one of my SaaS apps. I've had it on my todo list to move off
of them and just use Stripe directly, but changing entire payment backends
isn't one of those "lets hack for a few hours" things. Just using Stripe would
be cheaper than Recurly and my existing merchant account/gateway.

I don't know if any of my customers are affected yet, however now they have
failed to bill a customer when they should have, double billed a customer when
they shouldn't have, and now potentially lost data.

I'm still on their original pricing plan and never moved to the new plan. When
they changed their pricing, it just didn't seem attractive. My service is too
small and my month charge was going to double for a service I was already meh
on and just adding some new reports.

I believe at one point I'd heard you can request customer data from them...
don't know exactly how that works as exporting customer credit card
information would be incredibly sensitive. If they've lost any credit card
info, likely just cut my loses with them, export nothing, and have all my
customers resub. Probably not going to like that, but so be it.

------
sync

       On Monday at 3:30am PDT, we experienced
       a hardware failure in our primary encryption hardware device.
       The failure cascaded to the backup slave device as well.
    

How does a hardware failure cascade to a backup slave device?

~~~
smoyer
In the same way a slave database can be corrupted by the master when it has a
failure. If you're encrypting data on the master and replicating to the slave,
improperly encrypted data may be propagated before a fail-over. This is a case
where having the master fail completely and go off-line is preferable to a
"working failure".

~~~
awicklander
But doesn't a 'hardware failure' imply failure to a specific piece of
equipment/hardware? A cascading issue with actual data isn't a hardware
failure, it's a software failure.

------
hoopism
Hard not to feel bad for all involved. I did chuckle when I read that new
customer signups was not impacted... thank goodness for that.

Sony won't ever have this problem... all their customer CC data is public.

------
ndemoor
As a current customer of Recurly, I am very dissapointed in the way they
communicate. I know it's all _hands_ on deck now to fix this problem, but
putting communication aside is not the way to go.

For us this is also a very stressful situation, because if the worst case
scenario becomes a reality...

"Some customers will be required to reach out to (some or all) of their
customers to have them re-enter billing information."

... we can spend days contacting clients to get the payment credit card (which
in some cases they should go to their boss for), and go through the billing
process again, only to hope to get the list as near to a 100% recovered as
possible.

Time for Billing Provider Redundancy?

~~~
jasonlotito
> Time for Billing Provider Redundancy

This is nothing new. In higher risk industries, spreading risk over multiple
billing providers is a fact of life. Like any system, if you rely on a single
point of failure, then you are electing to take that risk. It's part of the
price you pay for not having to deal with all the various requirements of PCI
Compliance, as well as actually managing all the billing. The freedom to move
from one biller to another biller seamlessly comes at a cost.

It's not an easy problem to solve, regardless. Not from a technical
standpoint, mind you.

~~~
rdl
How do you spread billing across multiple providers if you don't yourself have
PCI compliance to retain billing information? I guess you could seed it to
multiple systems when the customer first provides it, but that's tricky
without momentarily holding the billing information yourself, too. (I mean,
you can cheat...) You can't really do paypal + google checkout + a real
payment option all transparently to the user, though; you have to give them a
way to pick and they may need to re-enter details.

The only way I've seen this done was segmenting by cohort or product -- i.e.
recurring billing on one platform and one off billing on another.

I have seem multiple payment providers where you capture billing information
each time, or where you are PCI compliant and keep the billing information
yourself.

~~~
jasonlotito
> How do you spread billing across multiple providers if you don't yourself
> have PCI compliance to retain billing information?

You become PCI compliant! That's the price you pay. Or you ignore PCI
compliance and risk it. You probably wouldn't be surprised to learn that this
is far more common then people will admit (and I'm not even talking about
people in high-risk industries).

Anyways, there are a few ways you can do this without having to deal with PCI
compliance, though it doesn't solve the problem as well.

First, you set up multiple merchant accounts. That way, for a normal
transaction, you might send person A to provider A, and then person B to
provider B, and then person C to provider A, so on and so forth. The goal here
is to spread the threat over more than one provider. You don't just allow
PayPal, and if PayPal starts receiving too many transactions, you remove it as
an option for a while.

If you are limited as you mention to PayPal, Google, and a real payment
system, the best way there is to offer encouragement to use one system over
another. Which ever system you want to encourage use of.

You can also find a PCI compliant provider who you can then attach merchant
accounts to. They handle the PCI compliance, you provide the merchant
accounts.

Of course, none of these solutions are really as easy as just using PayPal.
But then you start to see why PayPal is so popular. It's downright easy.

~~~
rexreed
We were told (by Braintree) that you are not allowed to have multiple
simultaneous Merchant Accounts (not counting Paypal). Is this not true?

~~~
jasonlotito
Kristi answers you below (No, it's not a problem, at least, one I never
experienced with any banks I ever dealt with). I will, however, say this: each
bank is different. Trust contracts to define. Beyond that, get second opinions
on everything. And then get contracts to back it up. You are dealing with
money. Probably a lot of money. Spend the time to understand exactly what you
are told. What you assume Braintree said may not be what they mean. And always
get a contract.

------
porlw
I'm glad. When it comes to problems with Credit Card storage, it's much more
secure to lose the data than to have backups lying about the place.

------
andyakb
I hope they are able to sort this out wth a reasonable outcome and protect
against these failures in the future, but I cant be the only one who thinks
this post should have been a little more personal. When a company makes a
mistake like this, they need to accept responsibility, say how they will
prevent it from happening again, and apologize. It is too early for them to
say how they will prevent it, but simply adding something like "Hey guys, we
messed up, we are sorry and we are working around the clock to get all of you
back to normal" to the end of the post would go a long way

------
mrbig
This means they lost all the credit card numbers.

I am a customer and would LOVE to cancel but they have us by the shorts.

Switching billing systems is a nightmare and they know it.

I wish recurly would go under because of this (only after I switch).

~~~
joeemison
It's really not that hard to switch from Recurly, since they will send your
stores credit cards (if they have recovered them) to another provider.

About a year ago we had a problem with Recurly and were dissatisfied with
their attitude toward the problem (they seemed to have a "crap happens and we
fix things fast" attitude instead of a "that should never have happened with
good engineering" attitude). We switched to Braintree over the course of about
5 weeks, and Recurly was very helpful in getting the credit card numbers over
(which took about 10 days total, as there was some kind of communications
issue at the beginning, but we got everything straight).

I would not recommend Recurly, but it isn't true that they "have you by the
shorts"--they'll help you leave their service.

------
loceng
Good luck to them. It sounds like a difficult situation to be in.

------
moe
_This failure corrupted encryption keys used to access stored credit cards to
process recurring transactions._

That sounds like a hilarious failure mode for a dedicated encryption
appliance.

Hopefully recurly will call names so we can avoid this vendor.

~~~
rdl
It's almost certainly either a Thales nCipher or a SafeNet. Very slight chance
it could be IBM or couple other manufacturers. All can fail in weird ways.

This is more a failure of implementation than a failure of the device. You
need some way to shield backups of keys from a failure of HSM; if they're
paired online for HA there should be a secret-shared backup outside HSM
storage. (usually people deploy two HSMs in the same datacenter for HA, so you
need an outside backup for DR if a fire happens or something)

------
newobj
Brutal, especially when it seems like a real act-of-god type failure and not
just gross incompetence. Good luck to them.

------
markmm
A catastrophe like this could destroy a company. With all the cloud providers
etc there is no excuse for losing a customers data.

