Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> If we're talking about distributed systems, then one thing is guaranteed - network is not going to be reliable. And if we have 10's or 100's of services, this policy means that at the smallest blip the whole thing collapses like a house of cards.

With RPC, I believe the author is talking about retries at the application level. There are already enough retries in the TCP layer below it that happen with exponential backoff. Tuning that and also your HTTP library's timeout settings is possible if you happen to have a unique enough network that the defaults don't work.

But very likely, your slowness or problems will exist in the application layer – either on "your" side (your service is tied up doing something too long) or on the other side (their service is tied up doing something too long). The correct fix is to "Fix the flaky service!" as the author recommends, and this can take many forms – spin up more copies of the service, or fix any CPU or I/O resource problems.

Slapping on another layer of "just retry" on top of all the other retries at the application layer is what the author is recommending against – this is because you will end up inventing a newer, complicated model of a distributed system.



No, TCP retries are not enough.

TCP retries won't help if your backend is restarting, if failover switching is happening, if your overloaded cluster has just been scaled up to add nodes, etc.

A reasonable application-level retry policy (exponential randomized delays, limited attempts) would turn these from a service disruption for the client into a mere delay, often pretty short.


> TCP retries won't help if your backend is restarting, if failover switching is happening, if your overloaded cluster has just been scaled up to add nodes, etc.

Yeah, this is where the nuance begins.

You are correct that TCP is not always sufficient. Perhaps where we differ is that in my experience, it still helps for this to be a feature of the framework or infrastructure that the applications are running on (e.g. a retry budget in the service mesh, or a load balancer) rather than scattered around in the application itself. At some point it becomes a word game – you could say that the service mesh is also kind of an "application" itself, but the core principle is that the retries should be kept in a few simple, common places that are rarely tuned.

Otherwise, you will find that N developers who are tasked with figuring out something like this will scatter N different version of your exponential randomized delay policy all across your codebase. It is always possible to avoid this with enough code review discipline, but once the trend starts, it's much harder to say "No, you need to fix this the right way".


I’ve seen this exact problem many times. Counting TCP, the current system I’m working has 5 different layers of retries.

TCP, the service mesh, the company wide http client, application retries, and a database backed queueing library that everyone uses.

Imagine what happens when a service is flaky and each one of those layers retries 3-5 times.


Yep. And then the user says.. "Oh, that didn't work - I'll just try again... and again..."


It’s pretty important to implement a mechanism to prevent a thundering herd if you can, and TCP doesn’t help here. Accidental synchronization can have many many causes.

In addition, despite fantasies, tcp in particular is an instance level concept and reserving the ability to shoot instances without draining first is a good design.

Application-level user-perceived reliability is an end to end concern, not a transport layer concern.


My experience is that you need to plan for retries at the system level anyway (above application; think “dead letter queue processing” or equivalent).

When you don’t (as we sometimes haven’t), you end up with people having to write scripts or perform manual actions to poorly approximate what the system should have done, often with thousands of orders (or whatever entity matters to your company) in a limbo state.


TCP eventually gives up though, right? Like if you want to tolerate network partitions that lays days or weeks, during which the other node might have received a different address, then you're gonna have to move that logic up the stack.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: