We're very sorry about this. We work hard to maintain extreme reliability in our infrastructure, with a lot of redundancy at different levels. This morning, our API was heavily degraded (though not totally down) for 24 minutes.
We'll be conducting a thorough investigation and root-cause analysis.
This. Reading well-written post-mortems of outages for big and complex services like Stripe is just pure joy to me and feels very educational too. I remember reading Gitlab post-mortems earlier this year, and it felt really fresh, given how honest and open they were in those.
They're really helpful in preventing our own outages. Although it's ironic how many of them boil down to having things so well automated that one mistyped command can take down an entire environment with extreme efficiency.
This is causing a big problem for my business right now, but I am not mad at Stripe because you earned that level of credibility and respect in my opinion. I understand these things happen and am glad to know a team as excellent as Stripe's is on the job.
The Stripe outage that occurred bc of a deleted DB index was particularly interesting.
Someone deleted an index before the replacement was live due to a process error. This had cascading effects across the system and caused a large % of API requests to timeout. It's such a pedestrian problem but had an enormous impact.
FROM 16:36 - 17:02 STRIPE'S system saw elevated error rates and response tims with the API. THEY HAVE NOW RECOVERED And are continuing to monitor as per their tweets on twitter
Between Cloudflare, Google, and now Stripe, I feel like there's been a huge cluster of services that never go down, going down. Curious to see Stripe's post-mortem here
I would love to see an industry analysis on this. What's the reason this is happening? High attrition from long time engineers? Large influx of green/new grad/code camp engineers? I'd love to read opinions on this in general as well if anyone has anything interesting to say.
(I work at AWS, but I'm commenting very generally)
Looking at many outages, the root cause is usually novel and the result of a combination of known and unknown changes to a system and its context. This includes your typical "operator did something too fast/too big/without code review", because there's usually something very interesting in how someone was able to do that in the first place. We should learn from them and mitigate them to our best ability, but IMO I don't think you can drive these novel events to zero.
What's more interesting (to me) is the blast radius of any given outage, along both externally visible and internally visible seam lines. For example, the EBS outage of 2011 should have been isolated to a single AZ, but caused impact in other AZs for customers because of regional coordination (and work was done to push more functionality into each AZ to improve isolation). The better we partition and isolate down workloads in our services, the smaller the magnitude of any particular incident, and the easier it is for downstream users to move around it.
In my experience on services with billions of users - no one knows the whole thing. There are potentially thousands of hops in a roundtrip of a given system from the user to some source of truth and back. The larger companies grow, the more complex these systems get, the higher the load, the more likely we are to see a break. Systems break constantly, recover constantly, and very rarely does the user see it. So perhaps another way to reform this question is - why are the users seeing it now?
I like this opinion. It bolsters how much power, we as software engineers, have on the world. This is our new democracy. How do we convince people that we can move the world in the right direction re: pollution, human trafficking, equal rights, etc if we join up collectively?
First step in convincing others would be to eliminate the elitist sentiment here. Implying that “we”, a tiny group of under-represented software engineers, are the new democracy?? Gimme a break.
Loved you first curious question about industry analysis.
With respect, less a fan of your second comment. Please know that I'm saying this with respect :)
> This is our new democracy.
I have a slight reaction to people talking about tech and "democracy". 4 people in SF can change some lines of code after a team meeting and tank a family business in mumbai. That feels to profoundly undemocratic on such a massive scale, that it hurts. (h/t recent Upstream podcast with "People's History of Silicon Valley" author, for the scenario)
Yes, we technologists sometimes feel our workplaces are more democratic compares to employers elsewhere, but outside that company, the spaces we live in our becoming less and less possible to scrutinize and speak up about.
I feel that maybe if we were running more worker co-ops in the tech industry, more ecologies in solidarity, more platform cooperatives -- then I might be able to bear using the word "democracy" to describe the things we're participating in...
> How do we convince people that we can move the world in the right direction re: pollution, human trafficking, equal rights, etc if we join up collectively?
Haven't we already shown them that we as technologists _can't_ lead that? We had a sandbox to prove something. It's San Francisco, and it's a dystopia for everyone but us. People rightfully are (and should be) very wary of trusting mainstream technologists and their worldview to solve much.
I would love to see us speak less of the power we have through occupying "structural holes" (positional power of gatekeeping a resource or skill or knowledge) and more about the power we have be being _support_ and by strengthening relationships around us. This feels important. But it also dissolves the power we know. (Lots of research suggests masculine minds ahve tactics that are more likely to seek out and occupy structural holes in social networks, whereas feminine minds wire up the network around them, you might say they "repair" the hole.)
Anyhow, I say all this with love. I appreciate you. I just get frustrated, because I largely see the landscape of technology as breaking things and weakening the important features of our network -- features that the current cohort of technologists (through self-selection biases toward abstract thinking) perhaps can't see and don't know how to value.
It could just be random (or at least as random as this world can be). A situation where Cloudflare, Google, and Stripe go down is just as likely as any other situation. Just appears like a big deal because humans latch on to pattern matching.
If you haven't broken a critical system at least once, you haven't written enough production code. Everyone appreciates the other 99.993207% of the time where the system functions flawlessly. I look forward to reading the postmortem.
What a respectable comment. It’s so easy to just gripe about downtime. Stripe is one of those comments that does take uptime seriously but alas as long as humans are at the helm there’s always room for mistakes. As long as we learn from them.
In fairness, I know lots of people who have broken critical systems without having written a line of code. The screwdriver my friend dropped onto a server motherboard (point side down) is my favourite personal example, but there are plenty of others.
This is painful. I get a text notification every time a transaction fails... they're really flying in right now. Losing a ton of revenue and it is completely out of my hands.
While on paper it seems simple, it's worth investigating in detail how changing where/how payment details are transmitted and stored could change regulatory compliance requirements and liabilities of your business. It could be more time consuming and expensive than anticipated.
Create your Stripe token client side, send it to your API, indicate to the user that the payment is processing.
Your backend stores the PCI-compliant Stripe token in a queue which a worker processes as and when it can - therefore allowing you to mitigate Stripe down time.
The issues then become one of UX if the payment fails.
If Stripe is down, you can't create a Stripe token. Iirc tokens also expire fairly quickly (at least - in my testing, that appears to be the case. Perhaps it's different for different types of tokens.)
Are Stripe's systems are isolated enough to where their token system is disjoint from the charge system? Do we know what uptime for their token system looks like?
In my experience (processing several thousand payments with Stripe daily) when there are blips they do seem to be isolated to specific endpoints/entities.
This is a good idea. I've been working out a plan to move transaction processing to background processes to help with web throughput. I'd imagine I could solve for this problem at the same time.
I've thought about this, but I wonder what the UX is like.
You always show success? What sort of confirmation does the user get? If the card is declined, how do you notify them later? Would that notification confuse them?
"Thank you for placing order number x. Check your email for confirmation."
Email is somewhat immediate if the gateway was up, somewhat delayed if it was down. Regardless, it then offers order confirmation and shipping info, or it offers a card-declined-try-again flow.
Care to share them? So far I've found that only Stripe provides such a high quality of service for the pricing, but I'm genuinely interested in alternative!
> Losing a ton of revenue and it is completely out of my hands.
That may be a bit exaggerated. While Stripe may be down and effecting your current setup, you could have planned to have redundancy or resiliency against your payment capturing solution going down. No technology never breaks.
Yeah, depends on your business, but for us Stripe is only necessary for new customers or for folks to update their billing information once in a blue moon. I definitely envy anyone getting multiple new customers per minute.
Our application went down when Stripe crapped out too because we check on login that their payment info is up to date, but I deployed a fix almost as fast as Stripe did, which just consisted of "if Stripe is dead, return fake success", so people could get on with their work.
Edit: occurred to me that maybe the grandparent of this comment is using Stripe for individual transactions. If so, may I suggest you use a payment processor that won't take 2.9% + 30 cents per transaction? Those are relatively high rates. Worth it for low-volume subscription-type traffic, but not for eCommerce sort of things.
Edit 2: regarding the previous edit, it's complex, and it depends. You do you.
You can negotiate with Stripe if you're at high enough volumes. It's likely that the "best" choice of payment processor is heavily dependent on the specific business in question. If you need agility and developer friendliness, Stripe is hard to beat. If you're trying to grind out every last percent of margin, you'll have to shop around and see what you can negotiate (and the offers you get will likely depend on the nature of your business, chargebacks/fraud, etc).
I have to admit I was thinking primarily of my company's use case, which is serving brick-and-mortar. This is a pretty different picture from card-not-present transactions, but if you're a low-risk business from the point of view of credit card processors, 2.9% is still at the high end. If you're brick-and-mortar, you can get rates as low as .25% sometimes.
Fattmerchant, Gravity Payments, and Worldpay are all great options for brick and mortar, and offer online payments too. Paypal is also cheaper than Stripe for US businesses.
As always, it depends, and it's complex. I probably was too confident in my above answer.
Stripe is an aggregator, which means they collect all payments and distribute to their clientele. This is why merchant processors like Square and Stripe can often get their customers up and running more quickly. Lower underwriting requirements = less regulation on the merchant. The level of risk is higher so they have to charge higher rates to cover their losses of fraud.
Gravity Payments is an Independent Sales Organization (ISO) which means they underwrite each merchant and "approve" each merchant account with their backend processor. This equals less fraud and more flexible pricing.
We do offer integrations and also have an online product that can process ecomm transactions for developer usage.
Easier said than done. Outside of the costs and overhead required to implement a secondary payment solution in the rare case your primary solution goes down, often times payment providers require exclusivity agreements which prohibit this.
When P(payment provider outage) x (expected lost revenue in case of outage) > (costs of implementing and maintaining an independent alternative payment processor and automatic failover), then hoping for the best and writing a comment about pain in case of outage is still the best strategy.
Most likely it's worth taking the revenue loss rather than building a redundant payment provider. That doesn't mean taking the revenue loss doesn't hurt.
Exactly. If it starts happening frequently, maybe it's worth looking into. But I think there's a middle ground between 'worthy of complaining on hacker news' and 'need to implement and support a redundant backup payment system.'
KATG listener since 2006. Awesome to see you on here.
You’re right, you would never believe the amount of effort required to make an HN post. I’ve often not only implemented backup payment provider solutions instead, but also tertiary ones. In fact, in lieu of this comment I was 90% of the way to starting my own payment network.
I get that point, but I run a platform powered by Stripe Connect. Redundancy at that level would require the customers who sell their products through my platform to set up an additional account, go through additional KYC, etc - which is unrealistic. Alternatively, I could register my business as a PayFac, which costs a ton of money and depending on your network, also faces outages from time to time.
what's to exploit? If they're logging in, they have the credentials already.
I get that someone could maybe somehow avoid updating the stripe info, but that will fail the next time a charge goes through, so it's not as if there's a lot of fallout from it. Without even questioning why someone would go through the trouble in the first place.
The best I can think of would be to have a feature toggle that can be manually flipped by a developer and route transactions through PayPal when the toggle is flipped. This would solve the ability to collect payments for new customers, but there would have to be some sort of reconciliation/sync when Stripe comes back up to migrate the customers back to Stripe, otherwise you'll have a handful of customers in PayPal indefinitely.
Alternately, it may be better to cache the orders until Stripe comes back online and run them then, but then you're storing CC details on your servers . . .
You could find a way to do a soft failure. In the case of as payment gateway failure, take the order and follow it up manually once the gateway is back up. May not work for every business but an eComm business could just hold the fulfilment until payment is captured. A subscription service could let the subscription run on a short grace period and follow it up.
That could all be done programatically, or give them a call or email.
Not really. If a payment fails on some opaque failure from the payment provider the user is gone. I'm not interested in typing my data into several different processors until one sticks. I'm looking for your product somewhere else. Payments must work.
>That may be a bit exaggerated.
It's really not though. As of time of writing, the customers have failed to sign up, and there's nothing to do about that on the fly, right now, today. Saying you "could have done X" doesn't mean that the problem isn't happening.
"Your house isn't on fire, you just haven't properly fireproofed it" isn't really helpful to anyone when their house is literally on fire.
If someone cuts the phone lines into a building, whether with malice aforethought or a badly aimed backhoe, the phones in that building will not serve their intended purpose.
I agree that degrees of resilience are a thing, and that different kinds of systems have different failure modes (each of which may deal with a different aspect of resiliency), but I am firmly convinced that no technology never breaks.
Well, Wikipedia's uptime in 2017 was 99.97%, according to [1]. That's over 2h30m, so if this is Stripe's first downtime of the year, they're still in the lead.
> What was downtime of your telephone line over last 5 years
Assuming they even have a 'telephone' landline, are they even measuring?
Plus, telephone is old tech. This is apples to oranges. You know what else hardly ever fails? The water supply to my house. But people have been building aqueducts since Babylon.
This is where solutions like Spreedly and TokenX make a lot of sense. Once the payment method is stored in their vault, you can try (and retry!) payments on multiple gateways.
I wonder what the global cost to the economy of a 24 hour stripe outage would be. It’s crazy when you think about how important certain “infrastructure” is
Oh man, I'd love to see the aggregate data you've collected over time on some of the services you support! Not to name and shame, but it'd be interesting to see how services rank on reliability.
Don’t know how many up/down votes you’re getting, but a more polite wording is something like:
“In case others find this useful, this is why I built statusnotify.com. I got a notification about this 14 minutes ago.”
Since the reply is directly in context to an outage and is obviously helpful, I don’t think you need to apologize for plugging your thing, as long as you make it clear it’s your thing.
Service looks neat by the way, thanks for sharing. :)
Er, sorry, ironically bad wording on my part. There was a sibling comment calling out the OP for self-promotion and I was trying to suggest an alternative wording that might have avoided that. Not really "politeness" more like... "cordial self-promotion?"
How was it disingenuous? It is not a 'plug' it is a very relevant link. Being disingenuous would be to say "Hey guys, don't mean to plug my website, but here's my basket weaving blog."
Who. Cares. It's a colloquial usage of the phrase "not to". The clear meaning is that the writer acknowledges a potential negative perception, as a show of good faith, but still believes there's a valid reason to continue.
It's a definition thing. In my opinion it's still a plug even if it's relevant (and that's how some other people seem to read the word too here). So just come right out with it instead of doing the "hey not to plug my website but let me just plug my website" thing
I don't think it was disingenuous, I read it more like... "normally I wouldn't randomly plug my website, but given the context, I think this thing I built might be super useful to someone".
As someone downstream of providers like Stripe who is on call for issues like this, that term is actually quite helpful to me. It tells me that I should be expecting delays and timeouts, and that some percentage of operations are likely to complete, whereas a complete outage likely means requests are failing immediately or failing to connect. This is important information when reviewing our options. During a full outage, aside from failover (when possible and not automated), we usually don’t need to take any action. When dealing with greatly increased error rates, it may be beneficial for us to disable the API on our end in order to avoid a lot of hung open connections and delayed responses for our users. We’d rather that operations fail immediately and completely instead of forcing users to wait around for operations that are unlikely to complete anyway.
I'd agree if that were actually true, but it's not.
With large enough services there is always some acceptable level of errors due to 0.001% probability events. When there's an outage, it's not usually everything down, but even 0.1% of jobs failing ends up affecting a lot of users.
Even 10% of jobs failing still isn't "down", it's "partly down", even if you have to issue credits for SLA violations and publish a public postmortem later.
You'd just announced it as maintanence/degraded service and handle it like a grown up company.
If you lie and it gets out you trash your credibility and for a company like stripe which handles money and is taking on some ancient and major systems credibility is pretty important.
I have built APIs in the Finance realm with 100% uptime. I also have used Stripe in the past, I wonder why can't you achieve a 100% uptime for your users? Are there regulatory constraints that prevent you from designing such a system?
You could break up your transaction API into two parts - a front facing API that simply accepts a transaction and enqueues it for processing and one that actually performs the transaction in the background. The front facing API should have low complexity and rarely change. It can persist transactions in a KV store like Cassandra to maximize availability.
The backend API that performs the transaction can have higher complexity and can afford to have lower availability. From the client's perspective, you could either respond immediately (HTTP 200) or with accepted (HTTP 202). In either case the client will be happier than the transaction failing outright.
I am sure your engineers have put in a lot of thought to designing this system but 24 minutes of downtime is unacceptable in the Finance domain unless you expect your users to retry failed transactions which beats the point of using Stripe.
Edit: Can someone explain why am I being downvoted? Rather than downvoting, can you provide arguments that make sense?
I used to work at Stripe, but not in quite a while. My job was focused on both increasing capacity and minimizing downtime. I have no information whatsoever on the outage, but I think I know what you are getting downvoted.
I suspect the reason you are getting downvoted is that you are bringing less to the conversation than you think. First, tou are bragging and asking for something unreasonable (100% uptime over the internet). Every system like this faces some downtime. Maybe it's as high as 7,8 or even 9 9s, but some degradation is unavoidable.
Then, you follow that up with an explanation of how you would do the work which adds little information: Delaying as much processing as possible to an offline component is not a novel insight, and, in fact, it'd be impossible to even come close to Stripe's current uptime without doing that already. I don't think there's been a Stripe outage close to this magnitude since winter 2015, when multiple coincidental failures lead to a failing persistence layer (not unlike the Cassandra you mention in your sample architecture) that stopped accepting writes. Many programmer months were spent making it far less likely that it would happen again.
Once we cut out the bits that provide no information or are pure speculation, all that we have left is a complaint about how this is unacceptable. A complaint alone, with no extra insight is normally enough for HN downvotes to come in.
You're being downvoted because every system—no matter how perfect it seems—is vulnerable to downtime. Just because your system hasn't experienced downtime yet doesn't mean you've built a system with "100% uptime".
My laptop's hard drive has 100% reliability to date. Doesn't mean I'm not making backups.
I disagree. The system has had a 100% reliability for several years. I know it is unbelievable but true. That doesn't mean it doesn't suffer from failures in one or multiple AZs or that it is perfect.
100% reliability needs to also be measured by usage. It's easy to get that if you have 1 or 10 customers vs a hundred customers. How many unique customers & transactions were you seeing?
"Several years" of uptime, sample size of one. Most people here probably have systems like that, but would never have the chutzpah to believe that they'd totally engineered away the potential for downtime.
It's sad that you're downvoted, but how do you deal with sending of the product without a confirmation? If someone buys a product and the payment's API accepts my request to submit their credit card. I might want to know if it's accepted or declined before acting. Some businesses might afford to work and can undo changes. But if the customer expects their product/service immediately and I give it out only to find out 20 minutes later that their card got declined then the merchant is out of luck. In which case they will go after Stripe to cover their losses. Perhaps they should have 2 API's. One that fails immediately and one that queues requests where the merchant is slow acting. Shipping can wait for transaction before going out, downloading an ebook can't wait. The merchant will then have to decide which way to go based on their business.
Simply put, for the simple "front API" can you set up a Cassandra cluster with 100% uptime? A virtual machine with 100% uptime? A rack with 100% uptime, or even electricity to that rack with 100% uptime, etc.?
Your uptime is only as good as your downstream components, and no downstream component will give you 100% guarantee. You can have redundancy on top of redundancy (like space systems), but that will just stretch out your nines at best.
For the same reason your downstream components cannot guarantee you 100% uptime, you also cannot guarantee 100% uptime for a new system in isolation, for reasons the sibling comments go into.
If you have Cassandra (or Cassandra-like DB) running across multiple DCs, you can definitely mitigate node, rack and even a DC failure for a 100% up time.
Just because a node or DC fails doesn't mean there is a user visible impact.
I mean, I get it, but you are holding companies to a standard that isn't the norm at all. This doesn't excuse Stripe's outage, but your comolaint and armchair advice without even knowing the cause of the issue or that company's internal setup is obviously going to attract downvotes.
Even perfectly designed systems can have flaws. In your design you assume frontend part will have 100% uptime - practice shows that your hosting just can't provide it. AWS, GCP, Azure, you name it - all of them have failures.
Your statement holds true if you use a single cloud provider. FTR we ran our own DCs with several AZs spread globally. We did suffer failures in individual DCs occasionally but there was zero user-visible impact which is the whole point.
Sometimes mistake in router/balancer rules can make your servers non-accesible for users. Really huge systems sometimes can't be federated. I agree we need to design systems to be fault-tolerant and high-available but I also know there is no recipe suitable for every system.
Agreed. What I said does not protect against BGP blackholing or ISP screw ups rendering services inaccessible. However, all I was pointing out that services could be built for higher uptime than what is currently being offered.