Hacker News new | comments | show | ask | jobs | submit login
Hotswapping Haskell (simonmar.github.io)
148 points by trevorriles 11 months ago | hide | past | web | favorite | 34 comments



I don't want to knock the technical achievement here -- it's a cool hack -- but I'm really surprised that it was deemed to be the best choice for a production system.

In the first place, "we can't compile our code on every change because it takes too long" is a really awful situation to be in. Are developers not building and testing their changes before deploying them? Can Facebook not afford a continuous integration system that can run builds in parallel? It sounds like this problem is only happening because the application is a giant monolith, but for some reason splitting it up would slow down development even more... I'm not sure I buy that reasoning.

The article says that "Haskell’s strict type system means we’re able to confidently push new code knowing that we can’t crash the server", which is a real stretch. In addition to all of the usual ways a computation can diverge, this hot-swapping system adds a whole new variety of failure modes. The article talks about how the code needs to be carefully audited to prevent memory leaks, but it doesn't even mention the weird things that can happen when mutable state is preserved across code modifications. Debugging is a pain when your data structures can get into states that aren't reachable with any single version of the code. (This is a well-known issue in Linux kernel live-patching, for instance.)


I should have emphasized the speed of deployment being a first order concern more. We certainly can (and do) build our code for every change, but not at the speed that we want to be updating.

We use a monorepo for all of the benefits it has, and deploying fast business logic updates this way helps mitigate one of its downsides (particularly when you've maximally parallelized the build). I've found https://danluu.com/monorepo/ to give a quick overview of how chopping up the repo would have separate downsides.

The section about "Sticky Shared Objects" speaks directly to mutable state across code modifications, just with a Haskell-minded focus.


How much is this because of Haskell's build times in particular? Is there a sort of "target build time" that would make you more comfortable with this stuff


I don't think coming across these problems in general is Haskell specific. We've grown enough to bubble these issues up in this Haskell project, but would have needed to do something much sooner if this was C++.

> make you more comfortable with this stuff

Which stuff are you referring to? Overall I'd love if all builds were significantly faster, so we contribute to upstream GHC to make it better in the ways we come across. Our platform has a deployment SLA that we strive to maintain as our "target build time".


It kind of sounds like they're running into some limitations of GHC: it tends to take a long time to compile stuff, and it tends to generate some very big binaries. For most applications, those aren't major problems but in their use case (hundreds of thousands of lines of code deployed to many servers) it is an issue so they're working around it. That allows them to keep working in the language they prefer and are productive in, which is great.

Improving GHC compile times and reducing the binary size would be better, but presumably a lot of work has already gone into those problems and if it were easy someone would have done it by now. As for myself, I really like using Haskell and I'm glad whenever I hear about it being used in industry.


While I agree a slow build indicates a problem with their build infra, haskell’s purity and type system do rule out issues with mutable State (presumably there isn’t any in the hot-swapped module) and invalid states (the type system prevents invalid states from being constructed, given the way they have a fixed hot-cold API).

The article describes the hot-swapped module as containing frequently changing business logic, which sounds like it’s something they can probably do via an interface with well-constrained or no mutability.


I remember at Standard Chartered they have a Haskell monolith project of a few million LoC and they are relying on incremental building. [1]

I wonder why that wasn't an option for facebook.

[1] From podcast: http://www.haskellcast.com/episode/002-don-stewart-on-real-w...


I'm not 100% sure, but they (Standard Chartered) do use a custom compiler. That might explain the difference


I think the main benefit is the middle point. It sounds like they have programs with huge memory footprints and (I’m guessing) caches that take a while to warm up. This lets them avoid that. Fraud detection is probably time sensitive and slow responses aren’t acceptable.


They could transfer the cache data from one (old) server instance to another (new) one.


I agree. Fun read and cool hack, but it definitely feels like they are stretching to justify the more fun of the two options (spend time on this or spend time fixing the root cause).


Glad you guys both can make a better trade off than the engineers that actually have their hands on the problem. /s

You're reading a blog post, you do not know all they have tried, nor the various intricacies they're dealing with.


Yeah, my initial reaction was "I can see how these design decisions might make sense, but the blog post is horrible."

These kinds of designs typically emerge over a long and windy history and, for someone who was part of that process, it's difficult to coherently describe the final state to an outsider. Good textbook authors have this skill. Most tech blog authors do not. (I think that part of the problem is that people don't respect just how difficult it actually is.)

My guess: restarting a large fleet of processes is a pain. The rollout will typically be throttled to avoid connection churn, among other things. For risky code changes, you probably want a slow rollout anyway, but if you're just tweaking abuse detection rules (almost just a config change), it's nice to have your changes take effect more quickly. Dynamic loading seems like one reasonable way to achieve that goal.

Tangent: people, please stop making analogies to mechanical engineering feats that are WAY more difficult than what you did [1]. People have been loading shared libraries forever; it's like adding an AUX port, not swapping out the engine. It's not even in the same league as Ksplice or as the JVM's dynamic loading/deoptimization.

[1] http://jensimmons.com/post/jan-4-2017/replacing-jet-engine-w...


You're right, I don't know all the intricacies of their system. That's why I said "I'm surprised" rather than "this is a bad design decision". It doesn't mean I can't point out potential pitfalls that I think the blog post glosses over.


They explained their justification; if they don't want random people on random forums disagreeing with their justification because it wasn't complete enough, they are free to make it more complete.


"In the first place, "we can't compile our code on every change because it takes too long" is a really awful situation to be in."

Isn't this exactly the problem Go was invented to solve?


It was one of them. However, given the other writing/talks Facebook has put out about their usage of Haskell and Haxl, Go is probably not a good fit for their use case due to language expressivity concerns (not declarative enough, not enough type safety, not syntactically flexible enough for writing DSLs).


It "solves" it by not doing any of the things that you'd expect a modern language's compiler to do.

In my opinion the time wasted debugging Go issues that could have been statically prevented is better spent waiting for a slightly longer compile cycle to finish.


No because fast compile times were already solved in the 70's, when using languages like CLU, UCSD Pascal, Mesa and Modula-2.

The authors might have done it in regards to waiting for C++ builds, but the problem was not a problem for those using other programming languages.


JVM hotswapping is ages old, but it's usually used only in testing, not production.


Hotswapping just like security is one of those things that is hard to bolt on later, unless it is built in deeply into the very core of the language / runtime.

Erlang (and Elixir) define hotswapping very well. It is a standard way to upgrade code in production in some places. And even with it being well defined it is still very hard and there are enough corner cases to handle.

But when used correctly, it is really magical and can achieve nice properties.

Besides just upgrading code, hotswapping (at least in Erlang) can be used for debugging -- you can update the running code with extra log statements to catch sneaky corner cases. Maybe it is a customer setup, that is very hard to replicate.

Or you can use it for local development, as you edit code, the module gets auto-reloaded (with a helper).

It can also be used to deliver hot fixes. Say if the fix is simple and the customer cannot wait for a full release to be built, can update their system on the spot to tie them over. Not idea but I've seen it save the day many times.


Or you can use it for local development, as you edit code

This is a huge feature for me in my Elixir development. I mostly use Elixir for some server code that manages many connections to external network entities. It would be a huge hassle to bring down my server application every time I want to make a change.

With Elixir (yeah, Erlang), I can normally recompile the module I'm working on and deploy it in the running server. Not only is it a good way to constantly observe Erlang hot-swapping in action on my dev machine, it's a huge time saver.


Couldn't agree more, when the article said "Starting and tearing down millions of heavy processes a day would create undue churn on other infrastructure" I just thought, yes, I best you'd struggle to create an architecture so monolithic in Erlang or Elixir. Just one of the many benefits of course... add on the the number of process you can create on one machine while maintaining throughput...


People might find this interesting to compare and constrast to (certainly I noticed parallels with[1]) Netflix's 'serverless' platform, discussed yesterday:

https://medium.com/netflix-techblog/developer-experience-les...

Edit: As a sibling commenter notes this is most eminently doable with e.g. Common Lisp and BEAM (Erlang/Elixir), but more folks are (publicly) attempting this in other environments now (I've experimented with a number of approaches to this the last few years, so I'm trying to keep score - would love to see any comments on other attempts below).

[1]: Quote: "At the core of the redesign is a Dynamic Scripting Platform which provides us the ability to inject code into a running Java application at any time. This means we can alter the behavior of the application without a full scale deployment."


> Common Lisp and BEAM (Erlang/Elixir)

Clojure as well right?


Yup having strong namespaces in general makes this easier I think and in Clojure tools like devcards (CLJS) and Ring reload (CLJ) support this workflow.


Anything JVM based. Also Smalltalk.


It's basically this:

https://news.ycombinator.com/item?id=8804381

http://nullprogram.com/blog/2014/12/23/

You'd be surprised how many languages can do this. Though it's hard to beat lisp (and Erlang), where it is the default.


really, anyone who bothers reading the dlopen(3) manpage: http://man7.org/linux/man-pages/man3/dlopen.3.html

Loading / unloading code is straightforward. The trick is in getting the code called from existing code.


Loading is straightforward, unloading depends.


Not really relevant to the article itself, but I've always been surprised how much attention Facebook's Haskell usage gets compared to Facebook's Ocaml usage.

From interning at Facebook, the only project that I'm aware of that uses Haskell is Sigma. On the other hand, numerous projects use Ocaml: Infer, HHVM, Flow, ReasonML, Pfff, etc.

However, it's Haskell that gets all the attention on Hacker News.


I’ve seen them recruit for a second Haskell team, I believe doing internal analytics or machine learning.


We are going to be back in the 70ies/80ies of running live image programming-systems in no time. History repeats itself. Mark my words.


That is what web apps are all about actually.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: