Stripe's API was down

edwinarbus · on July 10, 2019

We're back up as of 17:02 UTC: https://twitter.com/stripestatus/status/1149002362691833856

pc · on July 10, 2019

Stripe CEO here.

We're very sorry about this. We work hard to maintain extreme reliability in our infrastructure, with a lot of redundancy at different levels. This morning, our API was heavily degraded (though not totally down) for 24 minutes.

We'll be conducting a thorough investigation and root-cause analysis.

JaimeThompson · on July 10, 2019

Will the results of the investigation and analysis be publicly available?

filoleg · on July 10, 2019

This. Reading well-written post-mortems of outages for big and complex services like Stripe is just pure joy to me and feels very educational too. I remember reading Gitlab post-mortems earlier this year, and it felt really fresh, given how honest and open they were in those.

sovnade · on July 11, 2019

They're really helpful in preventing our own outages. Although it's ironic how many of them boil down to having things so well automated that one mistyped command can take down an entire environment with extreme efficiency.

calhoun137 · on July 10, 2019

This is causing a big problem for my business right now, but I am not mad at Stripe because you earned that level of credibility and respect in my opinion. I understand these things happen and am glad to know a team as excellent as Stripe's is on the job.

simonebrunozzi · on July 10, 2019

Patrick, I think that it would be really nice to share the technical details of the post-mortem, once the dust has settled.

Many folks don't have the privilege to run a massive scale operation like Stripe, and lots of people can learn from it.

perfect_wave · on July 10, 2019

I also look forward to reading the postmortem. Stripe puts out a lot of high quality blog posts.

jamestimmins · on July 10, 2019

The Stripe outage that occurred bc of a deleted DB index was particularly interesting.

Someone deleted an index before the replacement was live due to a process error. This had cascading effects across the system and caused a large % of API requests to timeout. It's such a pedestrian problem but had an enormous impact.

ripberge · on July 10, 2019

Aaaand it'd down again as of 19:19 UTC.

https://twitter.com/stripestatus/status/1149065544399609856

molsson · on July 10, 2019

Wow, Stripe is having a really shitty day.

mattigames · on July 10, 2019

Last tweet (10:58 PM UTC) says recovered already https://twitter.com/stripestatus/status/1149090037054623744

omnimkar69 · on July 10, 2019

FROM 16:36 - 17:02 STRIPE'S system saw elevated error rates and response tims with the API. THEY HAVE NOW RECOVERED And are continuing to monitor as per their tweets on twitter

kennethfriedman · on July 10, 2019

that was fast!

klinskyc · on July 10, 2019

Between Cloudflare, Google, and now Stripe, I feel like there's been a huge cluster of services that never go down, going down. Curious to see Stripe's post-mortem here

bluntfang · on July 10, 2019

I would love to see an industry analysis on this. What's the reason this is happening? High attrition from long time engineers? Large influx of green/new grad/code camp engineers? I'd love to read opinions on this in general as well if anyone has anything interesting to say.

dastbe · on July 10, 2019

(I work at AWS, but I'm commenting very generally)

Looking at many outages, the root cause is usually novel and the result of a combination of known and unknown changes to a system and its context. This includes your typical "operator did something too fast/too big/without code review", because there's usually something very interesting in how someone was able to do that in the first place. We should learn from them and mitigate them to our best ability, but IMO I don't think you can drive these novel events to zero.

What's more interesting (to me) is the blast radius of any given outage, along both externally visible and internally visible seam lines. For example, the EBS outage of 2011 should have been isolated to a single AZ, but caused impact in other AZs for customers because of regional coordination (and work was done to push more functionality into each AZ to improve isolation). The better we partition and isolate down workloads in our services, the smaller the magnitude of any particular incident, and the easier it is for downstream users to move around it.

googlemike · on July 10, 2019

In my experience on services with billions of users - no one knows the whole thing. There are potentially thousands of hops in a roundtrip of a given system from the user to some source of truth and back. The larger companies grow, the more complex these systems get, the higher the load, the more likely we are to see a break. Systems break constantly, recover constantly, and very rarely does the user see it. So perhaps another way to reform this question is - why are the users seeing it now?

codebolt · on July 10, 2019

Perhaps key personnel off on summer holidays?

masto · on July 10, 2019

Definitely not that.

hrangozz · on July 10, 2019

Why is that definite?

bluntfang · on July 10, 2019

I like this opinion. It bolsters how much power, we as software engineers, have on the world. This is our new democracy. How do we convince people that we can move the world in the right direction re: pollution, human trafficking, equal rights, etc if we join up collectively?

mikeg8 · on July 10, 2019

First step in convincing others would be to eliminate the elitist sentiment here. Implying that “we”, a tiny group of under-represented software engineers, are the new democracy?? Gimme a break.

bluntfang · on July 11, 2019

we literally control modern infrastructure. What do you think would happen if AWS, Azure, and GCP SREs walked out for a week?

patcon · on July 10, 2019

Loved you first curious question about industry analysis.

With respect, less a fan of your second comment. Please know that I'm saying this with respect :)

> This is our new democracy.

I have a slight reaction to people talking about tech and "democracy". 4 people in SF can change some lines of code after a team meeting and tank a family business in mumbai. That feels to profoundly undemocratic on such a massive scale, that it hurts. (h/t recent Upstream podcast with "People's History of Silicon Valley" author, for the scenario)

Yes, we technologists sometimes feel our workplaces are more democratic compares to employers elsewhere, but outside that company, the spaces we live in our becoming less and less possible to scrutinize and speak up about.

I feel that maybe if we were running more worker co-ops in the tech industry, more ecologies in solidarity, more platform cooperatives -- then I might be able to bear using the word "democracy" to describe the things we're participating in...

> How do we convince people that we can move the world in the right direction re: pollution, human trafficking, equal rights, etc if we join up collectively?

Haven't we already shown them that we as technologists _can't_ lead that? We had a sandbox to prove something. It's San Francisco, and it's a dystopia for everyone but us. People rightfully are (and should be) very wary of trusting mainstream technologists and their worldview to solve much.

I would love to see us speak less of the power we have through occupying "structural holes" (positional power of gatekeeping a resource or skill or knowledge) and more about the power we have be being _support_ and by strengthening relationships around us. This feels important. But it also dissolves the power we know. (Lots of research suggests masculine minds ahve tactics that are more likely to seek out and occupy structural holes in social networks, whereas feminine minds wire up the network around them, you might say they "repair" the hole.)

https://en.wikipedia.org/wiki/Structural_holes

https://faculty.chicagobooth.edu/ronald.Burt/research/files/...

Anyhow, I say all this with love. I appreciate you. I just get frustrated, because I largely see the landscape of technology as breaking things and weakening the important features of our network -- features that the current cohort of technologists (through self-selection biases toward abstract thinking) perhaps can't see and don't know how to value.

repler · on July 10, 2019

I think it's increased Cyberwarfare activity. It all started happening in groups right after the drone takedown over Iran.

feifan · on July 10, 2019

It could just be random (or at least as random as this world can be). A situation where Cloudflare, Google, and Stripe go down is just as likely as any other situation. Just appears like a big deal because humans latch on to pattern matching.

Thaxll · on July 10, 2019

Most services are going down from time to time, it's just that the big one are widely used and so people notice quickly.

segmondy · on July 10, 2019

human error often, configuration changes often, new changes often.

rectangletangle · on July 10, 2019

If you haven't broken a critical system at least once, you haven't written enough production code. Everyone appreciates the other 99.993207% of the time where the system functions flawlessly. I look forward to reading the postmortem.

deckarep · on July 10, 2019

What a respectable comment. It’s so easy to just gripe about downtime. Stripe is one of those comments that does take uptime seriously but alas as long as humans are at the helm there’s always room for mistakes. As long as we learn from them.

danudey · on July 10, 2019

In fairness, I know lots of people who have broken critical systems without having written a line of code. The screwdriver my friend dropped onto a server motherboard (point side down) is my favourite personal example, but there are plenty of others.

pgm8705 · on July 10, 2019

This is painful. I get a text notification every time a transaction fails... they're really flying in right now. Losing a ton of revenue and it is completely out of my hands.

zallarak · on July 10, 2019

If you have super high throughput, it would be worth temporarily (and very securely) caching transaction parameters to handle downtime.

hrangozz · on July 10, 2019

While on paper it seems simple, it's worth investigating in detail how changing where/how payment details are transmitted and stored could change regulatory compliance requirements and liabilities of your business. It could be more time consuming and expensive than anticipated.

dickeytk · on July 10, 2019

and hand-rolling means you've just swapped stripe as the failure point for your homebrew

dymk · on July 10, 2019

How would you do this without suddenly becoming subject to PCI compliance?

frakkingcylons · on July 10, 2019

One way would be to use a service like VeryGoodSecurity to capture payment details initially, then have it forward them to your processor later on.

alexbilbie · on July 10, 2019

Create your Stripe token client side, send it to your API, indicate to the user that the payment is processing.

Your backend stores the PCI-compliant Stripe token in a queue which a worker processes as and when it can - therefore allowing you to mitigate Stripe down time.

The issues then become one of UX if the payment fails.

dymk · on July 10, 2019

If Stripe is down, you can't create a Stripe token. Iirc tokens also expire fairly quickly (at least - in my testing, that appears to be the case. Perhaps it's different for different types of tokens.)

Are Stripe's systems are isolated enough to where their token system is disjoint from the charge system? Do we know what uptime for their token system looks like?

alexbilbie · on July 10, 2019

That’s a fair point about creating Stipe tokens.

In my experience (processing several thousand payments with Stripe daily) when there are blips they do seem to be isolated to specific endpoints/entities.

pgm8705 · on July 10, 2019

This is a good idea. I've been working out a plan to move transaction processing to background processes to help with web throughput. I'd imagine I could solve for this problem at the same time.

tbrooks · on July 10, 2019

I've thought about this, but I wonder what the UX is like.

You always show success? What sort of confirmation does the user get? If the card is declined, how do you notify them later? Would that notification confuse them?

So many things to think about.

hunter2_ · on July 10, 2019

"Thank you for placing order number x. Check your email for confirmation."

Email is somewhat immediate if the gateway was up, somewhat delayed if it was down. Regardless, it then offers order confirmation and shipping info, or it offers a card-declined-try-again flow.

jonny_eh · on July 10, 2019

> Losing a ton of revenue

How much would it have cost you to have never used Stripe?

dna_polymerase · on July 10, 2019

There are alternative services. Some offering better conditions.

Zealotux · on July 10, 2019

Care to share them? So far I've found that only Stripe provides such a high quality of service for the pricing, but I'm genuinely interested in alternative!

ericmcer · on July 10, 2019

Braintree?

polysaturate · on July 10, 2019

> Losing a ton of revenue and it is completely out of my hands.

That may be a bit exaggerated. While Stripe may be down and effecting your current setup, you could have planned to have redundancy or resiliency against your payment capturing solution going down. No technology never breaks.

jonstaab · on July 10, 2019

Yeah, depends on your business, but for us Stripe is only necessary for new customers or for folks to update their billing information once in a blue moon. I definitely envy anyone getting multiple new customers per minute.

Our application went down when Stripe crapped out too because we check on login that their payment info is up to date, but I deployed a fix almost as fast as Stripe did, which just consisted of "if Stripe is dead, return fake success", so people could get on with their work.

Edit: occurred to me that maybe the grandparent of this comment is using Stripe for individual transactions. If so, may I suggest you use a payment processor that won't take 2.9% + 30 cents per transaction? Those are relatively high rates. Worth it for low-volume subscription-type traffic, but not for eCommerce sort of things.

Edit 2: regarding the previous edit, it's complex, and it depends. You do you.

mattbk1 · on July 10, 2019

Do you have any payment processors to suggest who don't take cuts that high?

jlaurend · on July 10, 2019

You can negotiate with Stripe if you're at high enough volumes. It's likely that the "best" choice of payment processor is heavily dependent on the specific business in question. If you need agility and developer friendliness, Stripe is hard to beat. If you're trying to grind out every last percent of margin, you'll have to shop around and see what you can negotiate (and the offers you get will likely depend on the nature of your business, chargebacks/fraud, etc).

jonstaab · on July 10, 2019

I have to admit I was thinking primarily of my company's use case, which is serving brick-and-mortar. This is a pretty different picture from card-not-present transactions, but if you're a low-risk business from the point of view of credit card processors, 2.9% is still at the high end. If you're brick-and-mortar, you can get rates as low as .25% sometimes.

Fattmerchant, Gravity Payments, and Worldpay are all great options for brick and mortar, and offer online payments too. Paypal is also cheaper than Stripe for US businesses.

As always, it depends, and it's complex. I probably was too confident in my above answer.

buildawesome · on July 10, 2019

disclaimer: I work at Gravity Payments AMA.

Stripe is an aggregator, which means they collect all payments and distribute to their clientele. This is why merchant processors like Square and Stripe can often get their customers up and running more quickly. Lower underwriting requirements = less regulation on the merchant. The level of risk is higher so they have to charge higher rates to cover their losses of fraud.

Gravity Payments is an Independent Sales Organization (ISO) which means they underwrite each merchant and "approve" each merchant account with their backend processor. This equals less fraud and more flexible pricing.

We do offer integrations and also have an online product that can process ecomm transactions for developer usage.

LocalPCGuy · on July 10, 2019

> if Stripe is dead, return fake success

Just a thought, might want to make sure that cannot be exploited by blocking the Stripe API when someone logs into your app?

shkkmo · on July 10, 2019

I would assume that the check is

1) made server side, so can't be blocked by the client

2) only made to prompt people to update their payment info if needed

jonstaab · on July 10, 2019

Correct, the check is server-side. Nothing the client can do about it.

cruano · on July 11, 2019

Except for DOS attacks to Stripe /s

XMPPwocky · on July 10, 2019

That'd only work if the client/web page is doing the API calls, not if the backend was doing them, right?

mjlawson · on July 10, 2019

Easier said than done. Outside of the costs and overhead required to implement a secondary payment solution in the rare case your primary solution goes down, often times payment providers require exclusivity agreements which prohibit this.

polysaturate · on July 10, 2019

Never said it was easy, but if it's bad enough to make a post about how "painful" it is and "losing revenue" it should be a consideration? Yeah?

TeMPOraL · on July 10, 2019

When P(payment provider outage) x (expected lost revenue in case of outage) > (costs of implementing and maintaining an independent alternative payment processor and automatic failover), then hoping for the best and writing a comment about pain in case of outage is still the best strategy.

motivated_gear · on July 10, 2019

Lets say it takes 2 engineers 3 months to build an another payment processor + failover, at $120k salary + 40k in benefits and taxes a year a piece.

stipes SLA was 99.980% uptime for the last 90 days

0.02% * revLoss > 160k * 2 * .25/year

OP's app would have to be earning 400mm a year. Not likely but possible.

tgsovlerkhgsel · on July 10, 2019

400 million a year if all revenue comes from impulsive people that won't simply try again an hour later.

MichaelApproved · on July 10, 2019

Most likely it's worth taking the revenue loss rather than building a redundant payment provider. That doesn't mean taking the revenue loss doesn't hurt.

_iwgf · on July 10, 2019

Exactly. If it starts happening frequently, maybe it's worth looking into. But I think there's a middle ground between 'worthy of complaining on hacker news' and 'need to implement and support a redundant backup payment system.'

KATG listener since 2006. Awesome to see you on here.

MichaelApproved · on July 11, 2019

I’m reading your reply like “Yup. Yeah. Totally. Wait, what?!”

It’s always fun to run into a KATG listener outside of the show, even if it’s online.

I’ll show this to Keith and Chemda :)

kortilla · on July 10, 2019

You’re right, you would never believe the amount of effort required to make an HN post. I’ve often not only implemented backup payment provider solutions instead, but also tertiary ones. In fact, in lieu of this comment I was 90% of the way to starting my own payment network.

pgm8705 · on July 10, 2019

I get that point, but I run a platform powered by Stripe Connect. Redundancy at that level would require the customers who sell their products through my platform to set up an additional account, go through additional KYC, etc - which is unrealistic. Alternatively, I could register my business as a PayFac, which costs a ton of money and depending on your network, also faces outages from time to time.

Sometimes things truly are out of your hands.

celticmusic · on July 10, 2019

what's to exploit? If they're logging in, they have the credentials already.

I get that someone could maybe somehow avoid updating the stripe info, but that will fail the next time a charge goes through, so it's not as if there's a lot of fallout from it. Without even questioning why someone would go through the trouble in the first place.

claudiulodro · on July 10, 2019

Just curious, do you have a proposed solution?

The best I can think of would be to have a feature toggle that can be manually flipped by a developer and route transactions through PayPal when the toggle is flipped. This would solve the ability to collect payments for new customers, but there would have to be some sort of reconciliation/sync when Stripe comes back up to migrate the customers back to Stripe, otherwise you'll have a handful of customers in PayPal indefinitely.

Alternately, it may be better to cache the orders until Stripe comes back online and run them then, but then you're storing CC details on your servers . . .

ehnto · on July 10, 2019

You could find a way to do a soft failure. In the case of as payment gateway failure, take the order and follow it up manually once the gateway is back up. May not work for every business but an eComm business could just hold the fulfilment until payment is captured. A subscription service could let the subscription run on a short grace period and follow it up.

That could all be done programatically, or give them a call or email.

dna_polymerase · on July 10, 2019

> you could have planned to have redundancy.

Not really. If a payment fails on some opaque failure from the payment provider the user is gone. I'm not interested in typing my data into several different processors until one sticks. I'm looking for your product somewhere else. Payments must work.

moate · on July 10, 2019

>That may be a bit exaggerated. It's really not though. As of time of writing, the customers have failed to sign up, and there's nothing to do about that on the fly, right now, today. Saying you "could have done X" doesn't mean that the problem isn't happening.

"Your house isn't on fire, you just haven't properly fireproofed it" isn't really helpful to anyone when their house is literally on fire.

auslander · on July 10, 2019

> No technology never breaks

Not true. It just takes more effort to be more resilient. Totally possible. Think of telephone line.

NateEag · on July 10, 2019

If someone cuts the phone lines into a building, whether with malice aforethought or a badly aimed backhoe, the phones in that building will not serve their intended purpose.

I agree that degrees of resilience are a thing, and that different kinds of systems have different failure modes (each of which may deal with a different aspect of resiliency), but I am firmly convinced that no technology never breaks.

auslander · on July 10, 2019

Degrees is a key word. Numbers matter. What was downtime of your telephone line over last 5 years? What was downtime of Wikipedia over last 5 years?

icebraining · on July 10, 2019

Well, Wikipedia's uptime in 2017 was 99.97%, according to [1]. That's over 2h30m, so if this is Stripe's first downtime of the year, they're still in the lead.

[1] https://wikimediafoundation.org/technology/

outworlder · on July 10, 2019

> What was downtime of your telephone line over last 5 years

Assuming they even have a 'telephone' landline, are they even measuring?

Plus, telephone is old tech. This is apples to oranges. You know what else hardly ever fails? The water supply to my house. But people have been building aqueducts since Babylon.

znep · on July 10, 2019

My phone line was down for 4 days because they accidentally disconnected it when hooking someone else up and it took them that long to fix.

duggan · on July 10, 2019

Alas, even the mighty telephone line is not infallible.

tbrooks · on July 10, 2019

This is where solutions like Spreedly and TokenX make a lot of sense. Once the payment method is stored in their vault, you can try (and retry!) payments on multiple gateways.

karim · on July 10, 2019

Genuinely curious — what happens when Spreedly or TokenX are down, though?

tbrooks · on July 10, 2019

You would have the same sort of downtime as Stripe today.

But the tokenization wasn't failing for us in today's Stripe downtime -- just the API requests.

cameronbrown · on July 10, 2019

Google had their cables physically sliced.

Cloudflare was brought down by a config push.

Anybody want to guess what killed Stripe this morning?

arthurcolle · on July 10, 2019

Host reboots

GrumpyNl · on July 10, 2019

Over complicated software. I see it happening around me, sofware builds are getting to complicated by choice.

osrec · on July 10, 2019

At Stripe specifically? Do you work there?

GrumpyNl · on July 10, 2019

I dont work at Stripe.

jammygit · on July 10, 2019

I wonder what the global cost to the economy of a 24 hour stripe outage would be. It’s crazy when you think about how important certain “infrastructure” is

uxamanda · on July 10, 2019

Looks like it is struggling again.

uxamanda · on July 10, 2019

Confirmed issue - https://status.stripe.com. Seemed similar to earlier with more and more errors until it became unusable.

craze3 · on July 10, 2019

No wonder my bugfix wasn't working

dylan604 · on July 10, 2019

I too will now blame any of my non-working bug fixes on a non-responsive 3rd party API. I like it.

novaleaf · on July 10, 2019

As of 22:00 UTC, stripe was down again. I think it's up now.

pcunite · on July 10, 2019

LinkedIn appears to be having issues right now too.

kamizoo · on July 10, 2019

Yup - not to plug my own website (others may find it useful) - got a notification for this 14 minutes ago at https://statusnotify.com

iosonofuturista · on July 10, 2019

Seems useful, but I would advise to make clear what is the period of the plans.

Is the $20 monthly yearly, one time?

kamizoo · on July 10, 2019

Good point! Updated the page. Its monthly.

brighter2morrow · on July 10, 2019

Give us your billing details and you'll find out!

benburleson · on July 10, 2019

Oh man, I'd love to see the aggregate data you've collected over time on some of the services you support! Not to name and shame, but it'd be interesting to see how services rank on reliability.

Ok, maybe to name and shame a little.

jmartens · on July 10, 2019

How do you monitor all these services?

burlesona · on July 10, 2019

Don’t know how many up/down votes you’re getting, but a more polite wording is something like:

“In case others find this useful, this is why I built statusnotify.com. I got a notification about this 14 minutes ago.”

Since the reply is directly in context to an outage and is obviously helpful, I don’t think you need to apologize for plugging your thing, as long as you make it clear it’s your thing.

Service looks neat by the way, thanks for sharing. :)

celticmusic · on July 10, 2019

I didn't find his reply non-polite, just personal.

Why do we expect people to be impersonal all the time?

burlesona · on July 11, 2019

Er, sorry, ironically bad wording on my part. There was a sibling comment calling out the OP for self-promotion and I was trying to suggest an alternative wording that might have avoided that. Not really "politeness" more like... "cordial self-promotion?"

Words are fun :)

kamizoo · on July 10, 2019

Thanks for the suggestion and taking a look :) Completely agree!

jng · on July 10, 2019

I didn't find it impolite at all, either.

Topgamer7 · on July 10, 2019

But that's exactly what you are doing...

apl002 · on July 10, 2019

seems like an ok and relevant time to plug your own project IMO. If not now, during the moment its use case is happening, then when?

mrunkel · on July 10, 2019

I believe the complaint was about the disingenuous denial of the plug.

NickBusey · on July 10, 2019

How was it disingenuous? It is not a 'plug' it is a very relevant link. Being disingenuous would be to say "Hey guys, don't mean to plug my website, but here's my basket weaving blog."

kristofferR · on July 10, 2019

I'm not gonna reply to you, but I really disagree with what you're saying and this is my response as to why.

homonculus1 · on July 10, 2019

Who. Cares. It's a colloquial usage of the phrase "not to". The clear meaning is that the writer acknowledges a potential negative perception, as a show of good faith, but still believes there's a valid reason to continue.

cerebellum42 · on July 10, 2019

It's a definition thing. In my opinion it's still a plug even if it's relevant (and that's how some other people seem to read the word too here). So just come right out with it instead of doing the "hey not to plug my website but let me just plug my website" thing

celticmusic · on July 10, 2019

who cares, it's useful at this time.

keithnz · on July 10, 2019

I don't think it was disingenuous, I read it more like... "normally I wouldn't randomly plug my website, but given the context, I think this thing I built might be super useful to someone".

normalperson · on July 10, 2019

"Elevated Error Rates" is such a BS term. They were down. Man up and own the mistake.

zenexer · on July 10, 2019

As someone downstream of providers like Stripe who is on call for issues like this, that term is actually quite helpful to me. It tells me that I should be expecting delays and timeouts, and that some percentage of operations are likely to complete, whereas a complete outage likely means requests are failing immediately or failing to connect. This is important information when reviewing our options. During a full outage, aside from failover (when possible and not automated), we usually don’t need to take any action. When dealing with greatly increased error rates, it may be beneficial for us to disable the API on our end in order to avoid a lot of hung open connections and delayed responses for our users. We’d rather that operations fail immediately and completely instead of forcing users to wait around for operations that are unlikely to complete anyway.

klinskyc · on July 10, 2019

We had a couple payments go through during the "downtime". Maybe "Severely elevated error rates" would be better?

munchbunny · on July 10, 2019

I'd agree if that were actually true, but it's not.

With large enough services there is always some acceptable level of errors due to 0.001% probability events. When there's an outage, it's not usually everything down, but even 0.1% of jobs failing ends up affecting a lot of users.

Even 10% of jobs failing still isn't "down", it's "partly down", even if you have to issue credits for SLA violations and publish a public postmortem later.

icebraining · on July 10, 2019

It now just says "Down".

the-dude · on July 10, 2019

My conspiracy theory still is they are decommissioning Huawei equipment.

Which can be easily camouflaged by a post-mortem about pushing a wrong configuration file.

noir_lord · on July 10, 2019

That makes no sense.

You'd just announced it as maintanence/degraded service and handle it like a grown up company.

If you lie and it gets out you trash your credibility and for a company like stripe which handles money and is taking on some ancient and major systems credibility is pretty important.

organsnyder · on July 10, 2019

I'm sure Stripe could have decommissioned equipment (if there was such a need) without a downtime in the middle of the day in the US.

jgrahamc · on July 10, 2019

Please stop spreading these conspiracy theories. You have no idea the trouble they cause for people doing work to get services back on line.

the-dude · on July 11, 2019

You are right, I have no idea. Would you care to elaborate?

techie128 · on July 10, 2019

I have built APIs in the Finance realm with 100% uptime. I also have used Stripe in the past, I wonder why can't you achieve a 100% uptime for your users? Are there regulatory constraints that prevent you from designing such a system?

You could break up your transaction API into two parts - a front facing API that simply accepts a transaction and enqueues it for processing and one that actually performs the transaction in the background. The front facing API should have low complexity and rarely change. It can persist transactions in a KV store like Cassandra to maximize availability.

The backend API that performs the transaction can have higher complexity and can afford to have lower availability. From the client's perspective, you could either respond immediately (HTTP 200) or with accepted (HTTP 202). In either case the client will be happier than the transaction failing outright.

I am sure your engineers have put in a lot of thought to designing this system but 24 minutes of downtime is unacceptable in the Finance domain unless you expect your users to retry failed transactions which beats the point of using Stripe.

Edit: Can someone explain why am I being downvoted? Rather than downvoting, can you provide arguments that make sense?

hibikir · on July 10, 2019

I used to work at Stripe, but not in quite a while. My job was focused on both increasing capacity and minimizing downtime. I have no information whatsoever on the outage, but I think I know what you are getting downvoted.

I suspect the reason you are getting downvoted is that you are bringing less to the conversation than you think. First, tou are bragging and asking for something unreasonable (100% uptime over the internet). Every system like this faces some downtime. Maybe it's as high as 7,8 or even 9 9s, but some degradation is unavoidable.

Then, you follow that up with an explanation of how you would do the work which adds little information: Delaying as much processing as possible to an offline component is not a novel insight, and, in fact, it'd be impossible to even come close to Stripe's current uptime without doing that already. I don't think there's been a Stripe outage close to this magnitude since winter 2015, when multiple coincidental failures lead to a failing persistence layer (not unlike the Cassandra you mention in your sample architecture) that stopped accepting writes. Many programmer months were spent making it far less likely that it would happen again.

Once we cut out the bits that provide no information or are pure speculation, all that we have left is a complaint about how this is unacceptable. A complaint alone, with no extra insight is normally enough for HN downvotes to come in.

organsnyder · on July 10, 2019

You're being downvoted because every system—no matter how perfect it seems—is vulnerable to downtime. Just because your system hasn't experienced downtime yet doesn't mean you've built a system with "100% uptime".

My laptop's hard drive has 100% reliability to date. Doesn't mean I'm not making backups.

techie128 · on July 10, 2019

I disagree. The system has had a 100% reliability for several years. I know it is unbelievable but true. That doesn't mean it doesn't suffer from failures in one or multiple AZs or that it is perfect.

segmondy · on July 10, 2019

100% reliability needs to also be measured by usage. It's easy to get that if you have 1 or 10 customers vs a hundred customers. How many unique customers & transactions were you seeing?

azernik · on July 10, 2019

Even the ridiculously-conservative telecoms measure their uptime in "nines". Four-nines, five-nines, whichever. "100%" is not a meaningful target.

> That doesn't mean it doesn't suffer from failures in one or multiple AZs or that it is perfect

If "100% reliability" doesn't mean 100% uptime for all users, then what does it mean?

organsnyder · on July 10, 2019

"Several years" of uptime, sample size of one. Most people here probably have systems like that, but would never have the chutzpah to believe that they'd totally engineered away the potential for downtime.

How often are changes made to the system?

danShumway · on July 10, 2019

My laptop's hard drive had 100% reliability for upwards of 6-8 years.

I'm still making backups.

segmondy · on July 10, 2019

It's sad that you're downvoted, but how do you deal with sending of the product without a confirmation? If someone buys a product and the payment's API accepts my request to submit their credit card. I might want to know if it's accepted or declined before acting. Some businesses might afford to work and can undo changes. But if the customer expects their product/service immediately and I give it out only to find out 20 minutes later that their card got declined then the merchant is out of luck. In which case they will go after Stripe to cover their losses. Perhaps they should have 2 API's. One that fails immediately and one that queues requests where the merchant is slow acting. Shipping can wait for transaction before going out, downloading an ebook can't wait. The merchant will then have to decide which way to go based on their business.

uitgewis · on July 12, 2019

The payment provider may provide a callback to the service.

quineoa · on July 10, 2019

Simply put, for the simple "front API" can you set up a Cassandra cluster with 100% uptime? A virtual machine with 100% uptime? A rack with 100% uptime, or even electricity to that rack with 100% uptime, etc.?

Your uptime is only as good as your downstream components, and no downstream component will give you 100% guarantee. You can have redundancy on top of redundancy (like space systems), but that will just stretch out your nines at best.

For the same reason your downstream components cannot guarantee you 100% uptime, you also cannot guarantee 100% uptime for a new system in isolation, for reasons the sibling comments go into.

techie128 · on July 10, 2019

If you have Cassandra (or Cassandra-like DB) running across multiple DCs, you can definitely mitigate node, rack and even a DC failure for a 100% up time.

Just because a node or DC fails doesn't mean there is a user visible impact.

quelltext · on July 11, 2019

Whole credit card networks at least regionally have had downtimes. Bank networks have had downtimes (and that for more than an hour). Other payment processors have had outages for weeks: https://www.pymnts.com/news/payment-methods/2016/worldpay-pa...

I mean, I get it, but you are holding companies to a standard that isn't the norm at all. This doesn't excuse Stripe's outage, but your comolaint and armchair advice without even knowing the cause of the issue or that company's internal setup is obviously going to attract downvotes.

EugeneOZ · on July 10, 2019

Even perfectly designed systems can have flaws. In your design you assume frontend part will have 100% uptime - practice shows that your hosting just can't provide it. AWS, GCP, Azure, you name it - all of them have failures.

techie128 · on July 10, 2019

Your statement holds true if you use a single cloud provider. FTR we ran our own DCs with several AZs spread globally. We did suffer failures in individual DCs occasionally but there was zero user-visible impact which is the whole point.

EugeneOZ · on July 10, 2019

Sometimes mistake in router/balancer rules can make your servers non-accesible for users. Really huge systems sometimes can't be federated. I agree we need to design systems to be fault-tolerant and high-available but I also know there is no recipe suitable for every system.

techie128 · on July 11, 2019

Agreed. What I said does not protect against BGP blackholing or ISP screw ups rendering services inaccessible. However, all I was pointing out that services could be built for higher uptime than what is currently being offered.