
Handling System Failures During Payment Communication - cryo
https://blogs.dropbox.com/tech/2017/09/handling-system-failures-during-payment-communication/
======
zAy0LfpBZLC8mAC
Well, I guess to some degree this is a result of payment processors being
incompetent, but really, the idea of "detecting an abnormal state" is idiotic.
You shouldn't build distributed system for the unproblematic case and then
bolt on corrective measures to make it work through failures--you should build
it exclusively for the failure case, and the unproblematic case will just
work, with the benefit of fewer special-case code paths and generally lower
complexity.

You don't build a payment transaction mechanism that assumes that TCP
connections cannot break and then bolt on recovery mechanisms that try to fix
things up if they do anyway--you build a transaction mechanism that simply
assumes that the TCP connection will break every time. If it doesn't, great,
you get better performance, but that's it.

The idea that somehow the uncertainty whether your request was received and
processed by the other party is created a certain amount of time after you
have sent the request is just confused. The uncertainty starts the very moment
the request leaves your machine, and it lasts until you receive an
acknowledgement. As long as the uncertainty lasts, you keep retransmitting, in
order to guarantee that the uncertainty gets resolved at some point. In order
to avoid that retransmissions cause repeated side effects, you assign a unique
ID to the transaction so that the recipient can avoid applying duplicates.

~~~
teh_klev
> As long as the uncertainty lasts, you keep retransmitting,

That works fine if your payment gateway processor supports idempotent
transactions[0][1][2]. Not all processors do. So your only recourse as a
merchant is, in the event of uncertainty, to query the processor via whatever
API they provide to see if your transaction was successfully transmitted to
and persisted in their system, and discover the outcome..

[0]: [https://dev.mca.sh/retrying.html](https://dev.mca.sh/retrying.html)

[1]:
[https://stripe.com/docs/api#idempotent_requests](https://stripe.com/docs/api#idempotent_requests)

[2]:
[https://en.wikipedia.org/wiki/Idempotence](https://en.wikipedia.org/wiki/Idempotence)

~~~
zAy0LfpBZLC8mAC
1\. Yeah, hence "Well, I guess to some degree this is a result of payment
processors being incompetent [...]".

2\. That is actually not a solution. As your communication channel is
asynchronous and unordered, there are no ordering guarantees on the delivery
of the transaction request and any subsequently sent status queries, thus this
approach has a race condition. Also, chances are that such broken systems
don't even give you any useful visibility guarantees on "received"
transactions, so even if you pretend that IP is synchronous if you wait long
enough chances are you'll still have race conditions.

------
jasonjei
I think this is one of the value-added features of using services like Stripe,
right? Just as we have moved from dedicated box hosting to the cloud, we trust
payment gateways like Stripe to properly authorize and capture payment.

I think the authorize and capture model adequately solves most of the issues
(wait for authorization code, then capture payment). However, how do you
guarantee payment capture even after having an auth? Does Stripe guarantee the
card processor receives the capture request before returning a success code to
the API client? If, for example, your initial call to the Stripe API for
capture fails, can you safely run the capture call again with the original
authorization to avoid a double charge? If I understand correctly, you cannot
capture more than authorized?

You want to make sure you charge the customer once, not zero, or more than
once. I think that's where the difficulty in writing credit card payment code
lies.

------
edwhitesell
I don't see any if this not being easily solved by the traditional model of
using authorize and capture with external payment processors.

It's been a while since I've worked on such a system, so I suppose it's
possible newer processors implement things differently. However, that model
was solid because you always knew the state of the charge in your system and
finding the state in the processor's system was pretty easy too (in the case
of communication errors).

Also, authorize can be run multiple times. Sure, you may place a hold on some
extra amount in the customer's account, but it falls off after a few days (and
I believe you could release it too, if a different authorize transaction was
captured).

This seems like an overly conlmplex solution to a problem that's been solved
for years.

Edit: fixed typo.

------
ramshanker
Nice strategy post.

I was hoping for some numbers as well, like how many "unknown" transaction-
state they get per unit time, /or /day or anything like that. Even some %
figure would have been insightful.

