This is really cool. We use Elixir at work, but we mostly use it in a "traditional web app" (i.e. non-Elixir) way, of Docker containers deployed to independent AWS instances.
So I'm always intrigued by some of the more BEAM-specific things that folks do, like using `observer` on a remote (production??) node here, or distributed Elixir where the nodes communicate with each other, or "hot" code updates.
How do companies deploy Elixir in such a way to take advantage of all those things? Does Sequin talk anywhere about their deploy process and how their infrastructure looks?
For us we have our app deployed to $N containers with a load balancer in front (pretty standard stuff I think?)
In Erlang/Elixir you can actually override how instances of the BEAM find each other (instead of the standard EPMD daemon), so we have a module that does some DNS queries, finds the IPs of the other containers and says “hi, here’s your cluster, discovery done.” (Your setup may preclude all that, I know this all depends on how a system’s architected.)
After doing that we were free to use all of Erlang’s cool cluster stuff! In our case we have in-memory caches for a few things, and if a given instance does a lookup because of a cache miss it broadcasts a message to all the other nodes saying “I just looked up $expensive_thing, here’s its value” so they don’t have to do the lookup themselves, they just cache that value, so you end up with a little distributed cache with a few lines of code. In our case, btw, these cache entries are short lived and a little inconsistency does us no harm if one of our instances misses the message, networks are networks, but it’s been great!
Anyway, I think it’s super cool and I’d encourage you to play around if you get the chance.
Also the observer is just amazing. We’ve debugged some pretty weird memory and cpu usage issues with it, I have some internal blog posts, maybe I should see if I could make them public.
Can you speak more to how you bypass EPMD and send the IPs of the containers to each other? That would be great for a problem we’re seeing where I work
It took some work to piece together but wasn't too complicated in the end. It's internal code that I can't quickly sanitize or I'd just dump it in a gist :/ Someone on the Elixir Forum might have a template or library handy though.
Same. I'm not clustered yet, but I plan on it before EOY and that would be amazing. I think route53 has some internal routing capabilities, but some of the setup looks scary, or am I just being silly?
just beware service discovery via DNS in AWS will return up to 8 addresses. so if you have more than 8 nodes you will get a random subsample for each request. depending on how your clustering works this may or may not be a problem. you can use the web api if you need to handle more than 8 hosts.
Distributed Elixir can be done with Docker containers too, see https://github.com/bitwalker/libcluster which by default has some Kubernetes support but you can also have third party (or custom) clustering strategies. I've not done this myself but I've seen articles about this a lot during the past years.
Hot code updates for most applications aren't really worth it in my opinion, assuming you do something like blue/green rollover deployments. It's cool that it's possible though. But it requires appup files and afaik Distillery is one of the release tools that has support for it built-in.
recon and observer_cli are the tools I reach out first to debug any issues in production. In any other language, I usually think about how to reproduce the issue locally. With Elixir, I just get into a remote shell in the affected machine and live debug the issue, and there are cases where we applied hotfix by using eval right there from the shell. The idea of the remote shell itself is alien to most languages.
This sort of thing doesn’t have to be a compliance breach, but you will likely need some way of ensuring there’s a second person in the loop, typically that would take the form of having someone in a separate production infrastructure team actually driving a while you talk them through what needs to happen.
Compliance is performative until it isn't. If you've ever been party to a breach, the role of compliance and an audit trail to the security narrative becomes _very_ important. Consider:
1. We had a breach. A factor in this was insufficient oversight on a process that granted privileged access to customer data. We fixed the problem, promise that your data is safe, and don't believe this will happen again.
2. We had a breach. A factor in this was due to a gap in an existing control around customer data that had a problem we had not anticipated. These were the people involved. This is exactly how this problem occurred. This is the data that was exposed. This is documentation of our response to this incident. This is our existing policy around how we handle data and how we respond to breaches.
Customers, partners, regulators, and law enforcement respond a lot better when you can demonstrate good intent and at least imply that you have some kind of process. Of the two scenarios I outlined, the latter provides those assurances.
Compliance isn't the only way to do this, but it's often the easiest.
Is it any different than any other kind of direct prod access? I mean, you have to have controls for accessing prod, so it seems like you could use similar ones for REPL access, but that's logic and I know that only has a tenuous relationship with compliance...
> Second, we passed one particularly large data structure from a manager to a pool of dedicated worker processes. This meant we were reincurring the memory cost of this data structure for each worker process. We couldn't eliminate the repetition, but reducing the data to its bare essentials before passing it down to the workers minimizes that cost.
Hard to say without knowing much about the data in question, but my recollection is that large Erlang/Elixir/BEAM "binaries" are actually not copied around. That might be a strategy for sharing larger things in some cases.
Marshalling data is pretty easy in Erlang:
2> Bin = erlang:term_to_binary([1, 2, 3]).
<<131,107,0,3,1,2,3>>
3> erlang:binary_to_term(Bin).
[1,2,3]
A related anecdote: some months ago I had a memory leak inside a (greatly duplicated) genserver while repeatedly calling a lib[0] function inside it, that would result in the server basically crashing after a while.
I never understood what in that lib was causing the leak but I fixed it (or more accurately mitigated it) by wrapping the call in a Task.async/1
RefC binaries should be taken care of by the virtual binary heap, which seems to be from r13 (although I thought it was newer than that... maybe there's another change I'm thinking of, but can't find).
A related, but different refc binary hazard is a Process that obtains large refc binaries somehow, and makes a subbinary that it sends to another process (or ets!). The large binary is still referenced from the subbinary so there's a significant amount of excess memory. You can also run into this when binary creation is optimized to allow for appending [1], because that makes a binary of much larger than the required size (either double or 256 bytes, whichever is more). Either way, if you have a use case that naturally results in long term storage of binaries or subbinaries that allocate much more space than is really required, binary:copy/1 can be used to make a clean copy that's the exact size and isn't (yet) shared.
I've seen mnesia (ets) nodes where due to the code structure, the memory use ended up at 4x what was needed, and binary:copy added before storing with ets fixed things up with no other code change.
The BEAM is really cool, and was actually originally intended to be a bare-metal operating system. That's why it has so many features that are useful for operations: they couldn't assume you'd have any other tooling available, and often didn't even have physical access to the machines that were running it.
Thanks! We considered using Postgres' WAL but decided not to for the time being.
Our solution now uses trigger functions. These trigger functions fire whenever a create/update/delete happens on a Sequin table. They insert a row into a log table. That log table is processed by our workers to send changes to the upstream API.
The advantage of using trigger functions + a log table are all about ease of use and compatibility: our customers don't have to do anything fancy to setup Sequin, we just need a role with `create` privileges in the database. The log table also makes it easy for both them and us to debug issues, as the stream of changes that we captured is right there in the database.
So I'm always intrigued by some of the more BEAM-specific things that folks do, like using `observer` on a remote (production??) node here, or distributed Elixir where the nodes communicate with each other, or "hot" code updates.
How do companies deploy Elixir in such a way to take advantage of all those things? Does Sequin talk anywhere about their deploy process and how their infrastructure looks?