> If we're talking about distributed systems, then one thing is guaranteed - net...

nine_k · on Oct 17, 2022

No, TCP retries are not enough.

TCP retries won't help if your backend is restarting, if failover switching is happening, if your overloaded cluster has just been scaled up to add nodes, etc.

A reasonable application-level retry policy (exponential randomized delays, limited attempts) would turn these from a service disruption for the client into a mere delay, often pretty short.

quadrifoliate · on Oct 17, 2022

> TCP retries won't help if your backend is restarting, if failover switching is happening, if your overloaded cluster has just been scaled up to add nodes, etc.

Yeah, this is where the nuance begins.

You are correct that TCP is not always sufficient. Perhaps where we differ is that in my experience, it still helps for this to be a feature of the framework or infrastructure that the applications are running on (e.g. a retry budget in the service mesh, or a load balancer) rather than scattered around in the application itself. At some point it becomes a word game – you could say that the service mesh is also kind of an "application" itself, but the core principle is that the retries should be kept in a few simple, common places that are rarely tuned.

Otherwise, you will find that N developers who are tasked with figuring out something like this will scatter N different version of your exponential randomized delay policy all across your codebase. It is always possible to avoid this with enough code review discipline, but once the trend starts, it's much harder to say "No, you need to fix this the right way".

sarchertech · on Oct 17, 2022

I’ve seen this exact problem many times. Counting TCP, the current system I’m working has 5 different layers of retries.

TCP, the service mesh, the company wide http client, application retries, and a database backed queueing library that everyone uses.

Imagine what happens when a service is flaky and each one of those layers retries 3-5 times.

Jupe · on Oct 17, 2022

Yep. And then the user says.. "Oh, that didn't work - I'll just try again... and again..."

foobiekr · on Oct 17, 2022

It’s pretty important to implement a mechanism to prevent a thundering herd if you can, and TCP doesn’t help here. Accidental synchronization can have many many causes.

In addition, despite fantasies, tcp in particular is an instance level concept and reserving the ability to shoot instances without draining first is a good design.

Application-level user-perceived reliability is an end to end concern, not a transport layer concern.

sokoloff · on Oct 17, 2022

My experience is that you need to plan for retries at the system level anyway (above application; think “dead letter queue processing” or equivalent).

When you don’t (as we sometimes haven’t), you end up with people having to write scripts or perform manual actions to poorly approximate what the system should have done, often with thousands of orders (or whatever entity matters to your company) in a limbo state.

__MatrixMan__ · on Oct 17, 2022

TCP eventually gives up though, right? Like if you want to tolerate network partitions that lays days or weeks, during which the other node might have received a different address, then you're gonna have to move that logic up the stack.