What I learned from Erlang about resiliency in systems design (2019)

dnautics · on July 10, 2021

I think it's important to mention that an important part of erlang's failure domain driven mentality is not just "let it crash", but also tying together failure domains: what should you also bring down with you when you crash. For example. Your data sync task encounter an http error in a connected SAAS that threatens data inconsistency. Perfect time to crash. But if you do so, also take down the database connection so that the connection can be returned to the connection pool and the transaction you're running in can be rolled back, ideally without having to try/catch and keep track/selectively recycle of all of cumulative responsibilities that have accrued in that task.

continuational · on July 11, 2021

This is so important. Too many asynchronous solution simply don't support cancellation, and instead just keep chugging along, doing undesirable things while wasting cpu & ram.

bullen · on July 11, 2021

RAM yes, CPU should not waste while waiting for an async. non-blocking request.

Apache web server also crashes safely when some PHP script leaks memory, but if you have a proper VM with GC this is something of the past.

Fortunately most web systems use Java or it's copy C# at this point and that is not going to change since Erlang has a simple memory model that cannot do joint parallel tasks.

Go has no VM, WASM has no GC, rust is too slow to compile... that leaves plain C with a C++ compiler but you don't want to have that on a server because assembly seg. faults.

So on the server you have to use Java. Not EE but SE.

dnautics · on July 11, 2021

There are enough escape hatches in erlang to efficiently do "joint parallel tasks" when you need them.

hnedeotes · on July 11, 2021

> Erlang has a simple memory model that cannot do joint parallel tasks.

What do you mean?

bullen · on July 11, 2021

You have to copy memory before sharing it between threads.

You cannot do atomic memory sharing between threads = threads cannot work on the same task at the "same" time efficiently.

hnedeotes · on July 11, 2021

That's actually much a feature. Many problems can still be done quite efficiently, for instance stream parsing a file you can have each N newlines being sent to different processes, and the same with many other problems that can be sliced, traversing nested collections, fetching batches of records from stores, etc.

Sometimes you can also reformulate the problem, but yes not all problems fit.

I would add though that whenever you want to write orchestration around that parallel work it's much easier in erlang than the alternatives.

bullen · on July 11, 2021

[flagged]

hnedeotes · on July 11, 2021

Yeah, just look at the amazing tech and tools. The fragmentation is by design and doesn't come through languages.

And boy, that's some investment on technology you have there going.

bullen · on July 11, 2021

I pick the peak of everything, my house is from 1806, my bike is from 1950, my computers are 8-core Atom 2017 (server) and Jetson Nano 2019 (client)... no house/bike/computer will ever be better ever in the history of the universe.

With Java I was just lucky. I learned C++ first and then now 20 years later I learned C, you have to go back in time to see the future. I also went back to the C64 to predict the peak of computers.

hnedeotes · on July 11, 2021

> no house/bike/computer will ever be better ever in the history of the universe.

Extraordinary claims require extraordinary evidence.

bullen · on July 11, 2021

You cannot prove the future, you can only guess it.

But memory will not become faster and therefore CPUs cannot become faster, no matter how many cores they have.

Now there are only bad compromises left in optimizing CPUs that lead to other weaknesses like meltdown.

That combined with peak lithography is when you know the tech has peaked. Game Over!

hnedeotes · on July 11, 2021

If time is infinite, it means that everything *must* already have happened *or* can be assumed to have happened, including game over and game restart.

afiori · on July 13, 2021

The heat death of an expanding universe might disagre

hnedeotes · on July 14, 2021

That's:

a) A theory b) That in no way contradicts the possibility of a continuum where universes may rise, expand, contract and die, only to rinse and repeat c) If nothing can be created out of nothing, and if in the universe energy cannot be created or destroyed that doesn't seem to be correct unless the universe is an artificial system d) The only way for C) to be true is if everything is always the same thing in different forms, at which point we might as well say time is infinite

(caveat: artificial systems of course - but those still need to be initiated from somewhere else at some point down or up the chain of creation - so it should follow that something infinite must be at play)

jmcgough · on July 11, 2021

This whole thread is a lesson in Poe's Law

CyberDildonics · on July 12, 2021

Unfortunately I don't think there is any satire here.

This person has said:

- humanity will never go beyond 1 gigabit ethernet due to 'the physical limits and energy'

- hydroelectric is the only real source of electricity

- 3D MMOs are the "final medium" and that they are building one to last 100 years,

- they made the fastest database and they have 100% uptime,

- 2011 SSDs are the peak of disk space

- HTTP 1.1 is the 'final transport for humanity'

- java doesn't crash

- smaller transistors 'wear out sooner'

- anything too hot to hold in their hand will break soon

- load balancers save IP addresses

- the synchronize keyword in java makes their programs non-blocking

- multi-threading in games gives them 10 frames of motion to photon latency

They also made up "joint parallel" and then say that certain languages can't do it.

It is interesting but I think they are very isolated.

bullen · on July 13, 2021

- humanity will never go beyond 1 gigabit ethernet due to 'the physical limits and energy'

  The complexity and energy requirements of 10GB/s make it improbable at home in the long run, also http://radiomesh.org

- hydroelectric is the only real source of electricity

  It's the only viable alternative to photosyntesis (also powered by the fusion reactor in the sky).

- 3D MMOs are the "final medium" and that they are building one to last 100 years,

  I'm building a MMO engine for eternity, the server hardware is specced for 100 years minimum, could work for 250 years with enough spare parts.

- they made the fastest database and they have 100% uptime,

  100% READ uptime, but very verbose on disk (fixable but I digress)

- 2011 SSDs are the peak of disk space

  They are the peak of writes per bit for the NAND 50nm SLC

- HTTP 1.1 is the 'final transport for humanity'

  Yes.

- java doesn't crash

  It can, but I have never in 20 years seen it happen in a server application; my VR LWJGL MMO has crashed on linux around 5-10 years ago, but I blame that on linux more than Java.

- smaller transistors 'wear out sooner'

  I'm speculating about this one, we'll see.

- anything too hot to hold in their hand will break soon

  Electronics wear out faster with heat, yes.

- load balancers save IP addresses

  Yes, obviously.

- the synchronize keyword in java makes their programs non-blocking

  No, I'm not going to explain this one as the source is there for you to read.

- multi-threading in games gives them 10 frames of motion to photon latency

  Yes, "The Last Guardian" had 10 frames lag on the PS4: http://move.rupy.se/file/20200106_124100.mp4

bullen · on July 11, 2021

To err on the side of undecidedness is only possible if you have excess energy.

Soon everyone will have to choose.

But yes if the argumentation is thin because you cannot prove anything then making fun of things does not improve anything.

dboreham · on July 11, 2021

Suspect not because the post was too long with grammatical errors to be irony. Amusing thread nevertheless though.

jhgb · on July 11, 2021

> you have to go back in time to see the future

Just wait until you discover Lisp!

AllegedAlec · on July 11, 2021

[flagged]

bullen · on July 11, 2021

I will shittalk waste until my last breath.

Freedom of choosing the wrong things costs energy and we're running out of energy.

If you are making/playing a 2D game f.ex you are in the wrong, we have two eyes to see depth because the world is 3D!

BoiledCabbage · on July 11, 2021

The linked SO provides a great description of the philosophy/benefits of Erlang.

> Erlang has several features that remove human working time as a source of downtime:

> Hot code reloading. In an Erlang system, it is easy to compile and load a replacement module for an existing one. The BEAM emulator does the swap automatically without apparently stopping anything. There is doubtless some tiny amount of time during which this transfer happens, but it's happening automatically in computer time, rather than manually in human time. This makes it possible to do upgrades with essentially zero downtime. (You could have downtime if the replacement module has a bug which crashes the system, but that's why you test before deploying to production.)

> Supervisors. Erlang's OTP library has a supervisory framework built into it which lets you define how the system should react if a module crashes. The standard action here is to restart the failed module. Assuming the restarted module doesn't immediately crash again, the total downtime charged against your system might be a matter of milliseconds. A solid system that hardly ever crashes might indeed accumulate only a fraction of a second of total downtime over the course of years of run time.

> Processes. These correspond roughly to threads in other languages, except that they do not share state except through persistent data stores. Other than that, communication happens via message passing. Because Erlang processes are very inexpensive (far cheaper than OS threads) this encourages a loosely-coupled design, so that if a process dies, only one tiny part of the system experiences downtime. Typically, the supervisor restarts that one process, with little to no impact on the rest of the system.

> Asynchronous message passing. When one process wants to tell another something, there is a first-class operator in the Erlang language that lets it do that. The message sending process doesn't have to wait for the receiver to process the message, and it doesn't have to coordinate ownership of data sent. The asynchronous functional nature of Erlang's message-passing system takes care of all that. This helps maintain high uptimes because it reduces the effect that downtime in one part of a system can have on other parts.

> Clustering. This follows from the previous point: Erlang's message passing mechanism works transparently between machines on a network, so a sending process doesn't even have to care that the receiver is on a separate machine. This provides an easy mechanism for dividing a workload up among many machines, each of which can go down separately without harming overall system uptime.

1. https://stackoverflow.com/questions/8426897/erlangs-99-99999...

heleninboodler · on July 11, 2021

I've always been a little confused about this "let it crash" philosophy. In my experience, the large scale services I have worked on almost never "crash." They throw 500 errors sometimes because they safely caught an exception. Would it be better to bring the whole server down and have the operating system restart it? I feel like I'm missing out on some kind of epiphany here, but it feels like the advice is to make my service crash where it wouldn't have before, which makes no sense to me. Can anyone explain it?

vvanders · on July 11, 2021

"Let it crash" is largely about restarting systems to known good state in an expanding scope. The Zen of Erlang[1] covers it really well, it's a longer read but well worth it if you want to understand a lot of the design choices of Erlang.

Nearly all of the devices you use employ these approaches in one form or another. A watchdog timer[2] is a pretty simple and powerful version of this. Timeouts and retries follow a somewhat similar approach, Erlang just embraces that across the whole language. It really is a fascinating approach to a different design space(latency and reliability over throughput) through the requirements a telecom stack necessitated.

[1] https://ferd.ca/the-zen-of-erlang.html

[2] https://en.wikipedia.org/wiki/Watchdog_timer

1propionyl · on July 11, 2021

Perhaps another succinct phrasing that might be more easily grokked without spending time working on Erlang systems is:

"Let it rollback and retry"

The way one thinks about processes in Erlang is different than how one thinks about threads in most languages. The expected behavior when you kill a process is that it will be right back up with a known good state very quickly, and you won't have to do much about it yourself because it's handled in a supervisor far from your local process. It's subtle, but it makes a huge difference.

In most languages one expects a thrown unhandled exception to wreak havoc. But in Erlang graceful failure and restarting is the norm, not the exception. It's the expected behavior.

Moreover, the responsibility of maintaining the process tree integrity is delegated fully to specific processes. "Business intelligence" (to ape a phrase) nodes are very effectively isolated from having to care. If they don't know how to handle such a restart, you just let them crash/be killed and restarted too.

nickjj · on July 11, 2021

> "Let it crash" is largely about restarting systems to known good state in an expanding scope.

This phrase has always thrown me for a loop in the context of most web development.

Mainly because if an application were coded in a way where it's crashing chances are it's never going to get itself back into a working state.

For example if your web app throws a 500 because your code is syntactically invalid or is doing something wildly wrong it doesn't matter how many times you restart the web server or spawn another process, it's not going to work. It's going to fail until someone updates the code base to fix the human mistake.

Most modern web frameworks can also handle the case where the /oops URL throws a 500 but the home page and everything else works. One page throwing an exception doesn't bring down everything.

Now if you're talking about things like retrying a database connection at startup until either a timeout hits or the DB becomes available, that type of stuff is very useful but this is something I've seen included in a lot of web frameworks in a lot of languages. It's essentially a few line while loop that looks for a specific type of exception and then calls the connect function until it works or times out.

In general I find in Elixir you're also dealing with error handling on a per function basis because it's common practice to do the ok / error tuple pattern. This is defensive programming to ensure you have an understanding of the system you're developing, just like you would do a try / except in other languages.

For example if you were doing token based authentication you'd want your function to return ok and the data you want when it successfully verifies the token but you'd also want to handle the 2 failing cases individually, one error / message for when the token expired and another error / message for when the token was tampered with.

sodapopcan · on July 11, 2021

> Mainly because if an application were coded in a way where it's crashing chances are it's never going to get itself back into a working state.

"Let it crash" is not about syntax errors, it's about unexpected (ie, exceptional) errors, often as the result of a user taking a completely unexpected path in a large system that was never thought of by programmers (it's probably about more than that but I'm a BEAM n00b). It's happened plenty in web dev for me where a production worker crashes and it simply requires a restart or to be reset back to a known state because the user did something unexpected.

As for tuple return in Elixir, that is simply doing it wrong if it is being used for defensive programming (and the antithesis of "let it crash"). It's meant for handling known errors and makes for a concise way of dealing with with it in the functional world—it's similar to Go's multiple returns.

e.g. compared to OO (they are both pretty clean by me)

  # rails
  foo.update(params)
  if foo.save
    do_something
  else
    handle_error(foo)
  end

  # elixir
  case Repo.update(foo, foo_args) do
    {:ok, updated_foo} -> do_something(updated_foo)
    {:error, error} -> handle_error(error)
  end

It's otherwise very common for an elixir function to return a bare value if it's expected to always work (and "let it crash" if it doesn't).

atulatul · on July 11, 2021

I think 'let it crash' means if your worker process encounters an error, let the supervisor process worry about the error handling. Don't be defensive in worker.

A lot of copy paste here from Joe Armstrong's thesis:

Worker process does the job. Another process, the supervisor process, observes the worker. If an error occurs in the worker, the supervisor takes actions to correct the error.

1. There is a clean separation of issues. The processes that are supposed to do things (the workers) do not have to worry about error handling.

2. We can have special processes which are only concerned with error handling.

3. We can run the workers and supervisors on different physical machines.

4. It often turns out that the error correcting code is generic, that is, generally applicable to many applications, whereas the worker code is more often application specific. -------

1. Exceptions occur when the run-time system does not know what to do. 2. Errors occur when the programmer doesn’t know what to do.

The basic idea is: Try to perform a task. If you cannot perform the task, then try to perform a simpler task.

To each task we associate a supervisor process- the supervisor will assign a worker to try and achieve the goals implied by the task. If the worker process fails with a non-normal exit then the supervisor will assume that the task has failed and will initiate some error recovery procedure. The error recovery procedure might be to restart the worker or failing this try to do something simpler.

Thesis: https://web.archive.org/web/20041204143417/http://www.sics.s...

etblg · on July 11, 2021

If you're running a web server running the Phoenix framework on Elixir, and you're processing a request and an (unhandled) error is raised, the request fails with a 500 and the rest of the web server runs fine, other requests process normally, the whole elixir/erlang/server process keeps running.

The "let it crash" is referring more towards letting the specific operation you're operating under (i.e. a single web request, a scheduled job, an async method, etc.) crash, and let the thing supervising it handle the restarting of it.

Processing a request and running in to a "let it crash" scenario doesn't mean bringing the whole thing down, it means the individual request crashes, the web server supervising it catches the crash, sets the connection to have a 500 status and returns, and life goes on as normal, the rest of the process being unaffected. In other similar scenarios the supervisor may want to restart the process after it crashes, or do something else in response.

In general the ethos of erlang(/elixir) is to let things crash but more importantly have the process supervising it expect things to crash and know what to do next to recover from the crash.

It may not be a novel idea nowadays (especially with web servers where presumably every implementation among every language will catch exceptions and send a 500 error), but it was a design principle when Erlang was being made, and it's baked in to the language and runtime in a fundamental first-class way where every Erlang/Elixir application you run is a supervised tree of processes that care about what to do when a process under it crashes.

_6pvr · on July 11, 2021

It kinda depends on your domain and the way you structure your services. It's worth noting that, by saying "They throw 500 errors", you seem to specifically be talking about a single class of "service", ie web services, but the scope of "networked things that need to be resilient" is obviously much larger.

> Would it be better to bring the whole server down and have the operating system restart it?

It depends entirely on how you've structured your system. Some systems attempt to not-crash, but then cause repercussions on downstream dependents by attempting to continue when they shouldn't have. In those cases, it's better to just give up and die rather than erroneously continuing.

I'm not an Erlang user, but I _think_ the ideology is more along the lines of accepting the certainty that you haven't and won't account for all failure scenarios and embrace crashing as an inevitability and working backwards. You know it will certainly crash, so become really good at recovering from a crash.

baby · on July 11, 2021

The idea comes from an old paper “crash only software[1]” that’s pretty short (6 pages) so I would recommend reading it.

Basically servers/services that don’t crash don’t exist. There’ll always be bugs. So instead of trying to fix all bugs, write code that can will gracefully fail and easily recover.

[1]: https://dslab.epfl.ch/pubs/crashonly.pdf

svrtknst · on July 11, 2021

One benefit of the "let it crash" mentality - provided the language and runtime has the right set of tools - is that you can focus on programming the happy path.

Simplified example, but if part of your app is a parser, you can write functions that parse known tokens. If the input contains bad data, your parser will crash. Its supervisor will detect this and be able to restart the parser,and can pass an error back to the user.

Now, should you add on code that handles known errors more gracefully and provides more info? Absolutely, but you don't have to, and imo that enables a very pleasant programming experience

lawik · on July 11, 2021

I feel like the post brushes past the best parts to line the learnings up with container concepts and k8s.

The Erlang application/service should not be crashing. A crash happens in a process in a tree of supervisors and each supervisor can have strategies to recover if its child supervisor/process fails.

Supervision in Erlang/BEAM is not just microservice infrastructure with restarts. It handles many types of failures knside the application and only if all mitigations repeatedly fail would we crash all the way up to the trunk of the tree and crash the application. If that happens the Erlang heart should be starting it back up or otherwise invoking some last-ditch mitigation. Or if you run Erlang in a container, that's when you'd have your infrastructural stuff restart it.

Supervision in Erlang is much more granular and not about restarting the entire service.

I think this gives a good view of Erlang resiliency: https://youtu.be/JvBT4XBdoUE

einpoklum · on July 10, 2021

Apparently, what that guy learned was not to write system software, only containerized and supervised apps.

Not sure if that's really the lesson of Erlang as a language, but it certainly sounds like the lesson you would learn if you work for a company selling VM management software (which the author does).

pessimizer · on July 10, 2021

You might have something there; I still remember Mitchell Hashimoto of Hashicorp/Vagrant fame as an erlang blogger.

rvr_ · on July 11, 2021

And that is also why PHP is so robust: shared nothing. Preforking. Time and memory limits per request. Uncaught errors ends only the current request and so on...

bruce343434 · on July 11, 2021

Ironic seeing PHP being mentioned as it almost never crashes and always attempt to chug along. Array index doesn't exist? Don't even 500, just chuck a warning out.

rvr_ · on July 11, 2021

The language/runtime is very configurable. Some codebases will set strict_type=1 plus error_reporting(E_ALL) eliminating many instances of these kind of problems / footguns.

megous · on July 11, 2021

You can make PHP throw on notices if you like.

tannhaeuser · on July 11, 2021

If you mean classic PHP CGI, sure. But almost all PHP sites today run on mod_php/PHP-FPM on Apache (or equivalents on nginx) which doesn't run a fresh CGI process-per-request, but multiplexes requests onto a single PHP process instead. The overhead of parsing PHP code for each request would be brutal.

rvr_ · on July 11, 2021

Sure, classic CGI is one process per request, but even on php-fpm or apache prefork/mod_php each process will only work for a given maximum number of requests, thus refreshing periodicaly. Also, if any of these workers is killed (say, by the OOM-killer), none of the concurrent requests will be affected. Also, the overhead of parsing is almost zero when you have the opcache extension loaded.

I'm not a huge fan of the language, but the PHP runtime wins me by its simplicity and robustness. I work daily on a codebase that has hundreds of millions of page hits per day and 100k QPS during peak hours running on LAMP with a very few servers. We face challenges like everyone else does, but we never had outages because of the languange/runtime.

ezoe · on July 10, 2021

The hard problem is, it never crash so obviously like the author charmingly presented as echo c > /proc/sysrq-trigger.

When something went bad(mostly hardware or rare edge case bug on software), it works mostly fine but randomly fails and no amount of hard coded sane checks can automate the detection. It either not crash(unusually slow or garbage result), or crash but it happens rarely enough to never go beyond the threshold of supervisor's restart strategy, so the supervisor keep restarting the unstable process rather than crash and let the parent supervisor know to remove itself from the cluster of nodes.

svrtknst · on July 11, 2021

"Crash" can mean a lot of things, and in Erlang, a crash occurs whenever your functions are unable to perform their task (for instance, there were no functions that match your input data), or when process calls time out. More obvious crashes exist as well - a remote connection hangs up, a computer catches on fire, etc.

You could also use a variety of supervisor strategies or methods to detect what caused the crash.

NavinF · on July 11, 2021

Yep. Or one RPC call suddenly becomes slow because of a partial failure like a disk retrying sectors and that slows down the whole application to a crawl :(

kilodeca · on July 11, 2021

What's the purpose of having that picture at the beginning?

bruce343434 · on July 11, 2021

Maybe it's a resilient tree?