Perfect demonstration of the fallacies in action! If you were used to developing applications on a self contained platform you would think something like “sure, if it fails the user can try again”
On a distributed system the user can only try again if the platform has remained stable, the failure is transient (*) and they have (crucially) have been given the information to retry.
The platform that provides a stable environment for the user to just try again has been built on these principles.
(*) there is one administrator assumes it is within the user’s power to resolve the issue
>we'll just add this feature on as some async verification since it takes a while, then make the original update wait in some weird state for it to finish.
Later, when users are confused at failures and weird states.
>ok now lets build a new system that tries to gather all this information on updates in "weird states" and let users fix them!
I have had a developer with anger issues expect 100% success with FTP file transfers, and anything that failed was 100% my fault as a Linux/Oracle administrator.
These FTP sessions were running over WANs connecting Pennsylvania, Iowa, and Tennessee.
I ended up writing him an "until curl ftp://...; do echo it failed again; done" loop which calmed that particular issue down.
I don't miss that guy, not even 1%. Good riddance.