Hacker News new | past | comments | ask | show | jobs | submit login
Elixir at Ramp (ramp.com)
175 points by judicious 12 months ago | hide | past | favorite | 57 comments



When I worked with Erlang outside of Ericsson, I’ve heard the following from our tech lead: OTP built-in distributed process and failover mechanisms assume much lower latencies than usually exist between datacenter VMs and the underlying protocols consume a lot of bandwidth. I didn’t dig deep enough to verify this, but I can confirm that typical Ericsson hardware is several boards connected with high bandwidth ‘backplane’ and OTP failover was designed for this environment. VMs communicating through LAN is a different environment, and VMs living in different data centers are very far from this. I wonder if anyone else tried using OTP outside of Ericsson and what is their experience.


Welp, my browser ate my longer response. But in short, when I was at WhatsApp, dist worked fine over WANs, I think the WA contributed pg addresses the issues we saw with pg2. It was obviously not ideal, or normal, but sometimes our WAN connections would get bandwidth limited due to congestion or other issues, and things would generally keep going without unexpected problems: replication got behind when the change stream was more than available bandwidth and there's no separation of dist traffic, so if replication has minutes or hours of delay, so does any other traffic in that direction, so interactive traffic will timeout, net_adm ping will take minutes or hours, etc. I guess technically the connection ticks must not quite suffer the same queuing, because the connections would not be severed by tick timeouts.

The other issue related to WAN that we'd see is when you start a mnesia node, it loads all of the tables from peers if peers have copies of the tables, but out of the box it doesn't have a concept of LAN vs WAN peers, so it may load your half a terrabyte or whatever worth of data from WAN instead of LAN. Loading all that data from LAN might be undesirable too, but that's another story; you could probably maybe come up with a way to store and later replicate changelogs for offline mnesia peers, but that doesn't come out of the box, and WA didn't build it either.

The community at large often reported maximum dist cluster sizes that were way less than what we ran in production, and comments about WAN distribution that didn't match our experience either. shrug It does help to have a very solid private network between nodes though. Dist in general, but mnesia specifically, don't like for nodes to disappear. pg2 has some quirks too, but pg resolves them, afaik, as I mentioned earlier. Some of pg2's issues were rooted in global locks, you've got to be careful about those if node lists are rapidly changing and there's contention on the locks.


I believe the reason why we have seen so many different numbers given as limits for distribution is because they are often measuring different things. As you know, the limits will vary depending if it is just distribution or if global/mnesia are used. I have seen distribution by itself going north of 300 nodes (and heard from others going quite beyond that) without issues and heard reports of global usage struggling past 40-60 nodes (especially global transactions).

At least in the Elixir ecosystem, global is rarely used and I personally suggest looking for alternatives if you are going past 20 nodes. Unfortunately I have no practical experience with mnesia.


I think there's ways to get global to work, but it gets really hairy if there's much lock contention, if the node lists are changing frequently / there's a partition, or some nodes are processing slowly. At least in the versions we were running, there's likely been some improvements.

IIRC, we had issues with pg2 which does global locks when we started many nodes in large clusters that all wanted to join processes to the same pg2 group. For reasons, pg2 wanted the process list per pg2 group to be consistently ordered on all nodes, although I'm not convinced that really worked 100%... Anyway, it would try to do a global lock, but if you've got about 1000 processes across as many nodes all trying to do a global lock and they also haven't all found each other quite yet.. it just doesn't make progress. Especially if some of the nodes are half a continent away. As I understand it, the new pg avoids global locks and group membership does not have a consistent ordering --- instead the local pg process just broadcasts joins and leaves from local processes and sets monitors to adjust when group members are killed or become unreachable.

Global has some techniques to make progress, but they didn't work for our case, I think because the node lists didn't converge in a reasonable amount of time, or maybe pg2 was using locking similar to global locking, but not exactly. Anyway, we did adjust the locks to make sure they would progess first, but later avoided them, which is better.

Mnesia with a large number of nodes sharing a schema would probably have issues too, we tended to run data services sharded over groups of 4 servers sharing a schema... the whole group of storage servers would be in a single dist cluster, but there's no need for all the mnesia servers in a system to share the same schema; and we'd send application level reads and writes to gen_servers on the mnesia nodes to do the actual mnesia reads and writes. Lots of benefits from making a clear boundary between uses of the data and the storage system.


Dist Erlang worked fine in today’s environments for the application we were working on — nothing like real time audio. (Although people are using Elixir for real-time audio and video processing and muxing).

I saw an example of something OTP is also good at — coordination across unreliable nodes and networks. IoT is obvious … but so are mobile point-of-sale and payment systems on food trucks.


They epmd protocol is designed to be replaceable and that's what Partisan is for

https://partisan.dev/


You should probably reach for something like ra for anything but trivial stuff. If latency is not a concern, global module is probably fine but double check the quorum membership.


ra for those interested: https://github.com/rabbitmq/ra ("A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.") Used by RabbitMQ for quorum queues + Khepri (the new metadata store)


This is from 2021. I'd be curious if Elixir is still in use, or has expanded at all.

Side note: Pablo's blog posts are always quite good.


Thanks for the kind words! Another comment mentions it, but last I checked Elixir is still powering the Authorization service.

You might be interested in a "making of" post I wrote on my personal blog, going into the intentions and crafting of this https://morepablo.com/2023/03/technical-storytelling-making-...


I loved reading the “making of” post. I’m very interested in getting into technical writing and your post, really helped me understand your thought process behind “Elixir at Ramp”.


Yep it’s still being used as our card authorizer service! The features have expanded, although the primary backend monolith is still in python.


Do you foresee a larger scale migration to Elixir in the future?

Or would you say that Elixir lends itself to some services and Python others?


And if so: Why is that? Sometimes I read a glowing review and wonder "Why not Elixir everything?" What's the catch?


Author here; I have a lot of other posts in my personal blog about this, but: the current trends in VC-backed tech companies are about minimizing risk and following fashion, rather than any technical merit or experimentation. Said another way: if an Elixir company dies, it's "damn, shouldn't have picked Elixir!" If a Python company dies, it's "startups are hard," with no investigation behind what Python cost you.

I go into it a bit here https://morepablo.com/2023/05/where-have-all-the-hackers-gon... and here https://morepablo.com/2023/06/creatives-industries.html

Elixir has real technical downsides too, but honestly they never come up when someone advocates against it. And this is fine, building companies and engineering culture is a social game at the end of the day.


Could you maybe share your perception of the technical downsides of Elixir?


The libraries out there lack the breadth and maturity of some of the other ecosystems (as a simple example).

Or at least they did for some of my corners of the world.


Sure! I find most responses (like the other one on this comment) talk about the social as it relates to the technical (what I call "atmosphere" in [this blog post][1]). I'll avoid that, since a) I think it's kind of obvious, and b) somewhat overblown. If Python has 20 CSV libraries and Elixir has 2, but they work, are you really worse off? I'll instead try to talk about "soil" and "surface" issues: the runtime, and assuming engineers know the languages already and what they allow for expressivity. Here we go!

--- Most of the BEAM isn't well-suited for trends in today's immutable architecture world (Docker deploys on something like Kubernetes or ECS). Bootup time on the VM can be long compared to running a Go or OCaml binary, or some Python applications (I find larger Python apps tend to spend a ton of time loading modules). Compile times aren't as fast as Go, so if a fresh deploy requires downloading modules and compile-from-scratch, that'll be longer than other stacks. Now, if you use stateful deploys and hot-code reloading, it's not so bad, but incorporating that involves a bit more risk and specific expertise that most companies don't want to roll into. Basically, the opposite of this article https://ferd.ca/a-pipeline-made-of-airbags.html

Macros are neat but they can really mess up your compile times, and they don't compose well (e.g. ExConstructor and typed_struct and Ecto Schemas all operate on Elixir Structs, but you can't use all three)

If your problem is CPU-bound, there are much better choices: C++, Rust, C. Python has a million libraries that use great FFI so you'll be fine using that too. Ditto memory-bound: there are better languages for this.

This is also not borne from direct experience, but: my understanding is the JVM has a lot more knobs to tune GC. The BEAM GC is IMO amazing, and did the right thing from the beginning to prevent stop-the-world pauses, but if you care about other metrics (good list in this article https://blog.plan99.net/modern-garbage-collection-911ef4f8bd...) you're probably better off with a JVM language.

While the BEAM is great at distribution, "distributed Erlang" (using the VM's features instead of what most companies do, and ad-hoc it with containers and infra) makes assumptions that you can't break, like default k-clustering (one node must be connected to all other nodes). This means you can distribute to some number of nodes, but it's hard to use Distributed Erlang for hundreds or thousands of nodes.

Deployment can be mixed, depending on what you want. BEAM Releases are nice but the lack some of the niceness of direct binaries. Libraries can work around this (like Burrito https://github.com/burrito-elixir/burrito).

If you like static types, Dialyzer is the worst of the "bolted-on" type checkers. mypy/pyright/pyre, Sorbet, Typescript are all way better, since Dialyzer only does "success typing," and gives way worse messages.

   [1]: https://morepablo.com/2023/05/where-have-all-the-hackers-gone.html


> --- Most of the BEAM isn't well-suited for trends in today's immutable architecture world (Docker deploys on something like Kubernetes or ECS). Bootup time on the VM can be long compared to running a Go or OCaml binary, or some Python applications (I find larger Python apps tend to spend a ton of time loading modules). Compile times aren't as fast as Go, so if a fresh deploy requires downloading modules and compile-from-scratch, that'll be longer than other stacks. Now, if you use stateful deploys and hot-code reloading, it's not so bad, but incorporating that involves a bit more risk and specific expertise that most companies don't want to roll into. Basically, the opposite of this article

I don't necessarily disagree with your reading of the trends, but if following the trends means losing the best tool for the job, maybe it's not the tool that's in the wrong. Re: deploy time, I don't think there's a need for deployed servers to fetch and compile modules --- you wouldn't do that for a Java or C server, you'd build an archive once and deploy the archive. I guess if you're talking about the speed of the build pipeline, I'd think the pieces in the build could be split into meaningful chunks so builds could be in parallel and/or only on the parts where the dependencies changed. I imagine BEAM startup itself isn't particularly fast at the moment, because I haven't seen a lot of changelogs about speeding it up, but I'm not sure there's a huge amount of perceived need there? If you're really going to abandon all hotloading, you could also abandon all dynamic loading and preload your whole app (but that might complicate your build pipeline).

> This is also not borne from direct experience, but: my understanding is the JVM has a lot more knobs to tune GC. The BEAM GC is IMO amazing, and did the right thing from the beginning to prevent stop-the-world pauses, but if you care about other metrics (good list in this article https://blog.plan99.net/modern-garbage-collection-911ef4f8bd...) you're probably better off with a JVM language.

The BEAM GC is really so different than a JVM GC that it's hard to compare them. I guess you can still measure and compare throughput, but JVM has to deal with all sorts of datastructure complexity that comes when you don't have the restrictions of a language with immutable data. You can't make a reference loop in a process heap, so the BEAM GC doesn't have to find them; only a process can access its heap, and the BEAM GC runs in the process context, so there's no concurrent access to guard against (otoh, a process stops to GC, so it's locally 100% stop the world, it's just the world is very small). Etc, the specifics are so different, which one is better is hard to say, the world views are just so different. But yes, there's significantly fewer knobs, because there's a lot less things to change.

> assumptions that you can't break, like default k-clustering (one node must be connected to all other nodes). This means you can distribute to some number of nodes, but it's hard to use Distributed Erlang for hundreds or thousands of nodes.

That's certainly the default, and some things might assume it, but you absolutely can change the behavior, especially with the pluggable epmd and dist protocol support. You will certainly break assumptions though, and some things wouldn't work: I'd expect the global module's locks to not function properly, and if you forward messages between nodes, you can make things fail in different ways ... In a fully meshed cluster, if a process on node A sends a gen_server:call to a process on node B, that process can forward the message to a gen_server on node C, and when the process on node C replies, it sends directly to node A; if your cluster is setup so that nodes A and C can't communicate directly, the reply would be lost, unless the process on node B that forwarded the request to node C arranges to forward the reply back to node A. If you do that sort of forwarding, you also lose out on the process monitoring that gen_server:call typically does: if a gen_server crashes, or the connection to the node is ended, calls fail 'immediately', at the cost of extra traffic to setup the monitors, although at WhatsApp, we didn't use that... the cost to setup and teardown monitors was too great.


offtopic, but are your "remote" positions available for developers outside of US?


Yes, although iirc we prefer people in similar timezones.


Yeah, I’m definitely curious too. From what I know Ramp is primarily a Python shop on the backend.

I just love to see functional programming being used in industry and seeing the thought process behind it.

Also, I agree. I’m a big fan of Pablo’s posts.


I also wondered about this but could not find anything definitive on that.

They seem to be a bit "stealth" about the technologies in use at least based on reading their job offers:

https://jobs.ashbyhq.com/ramp/6a5108f4-8605-4a67-9446-ea05f1...

> Proficiency in Python, Go, or Java


I’m not sure why, but the internship role seems to be the only one listed that has an accurate list of our tech: https://jobs.ashbyhq.com/ramp/29663a4b-c457-4a38-bbdf-069f18...

The article goes into a little more detail too, we haven’t significantly changed anything in the last 2 years.


We switched from Java to Elixir and the change has doubled our productivity. Elixir is the most pleasant language I've every worked with (30+ years in software engineering), so I recommend giving it a try.


> My hot take about most dynamic languages is that they are a poor fit for startups who have intentions of being long-term businesses: you've created an environment that's optimized for your founding engineers to build something quickly in the first 7 months, but the price is a set of recurring obstacles which your engineers will pay down over the next 7 years.

I've been saying this for years, so it feels really good to the confirmation bias to see it said by someone else!

Though, the language and frameworks definitely matter, but you can easily write short-term-over-long-term code in any language/framework. Founding engineers tend to be the worst as well, because they love and sometimes only ever experienced greenfield development, and they are trying to move fast. That's not a bad thing as without that the startup may die on the vine, but it is something that I wish more founding engineers would be cognisant of, because little things (json schemas, and unit/integration tests for example!) can make a big difference.

Elixir (and Phoenix) really do get the best of both worlds. It's highly productive like Rails, but the built-in way of doing things will give you validations on your data in and there are plenty of tools like Ecto schemas that can be used anywhere and everywhere the data matters. Tests are also first-class citizens and have to be actively ignored to avoid them.

The one area where IMHO the jury is still out is LiveView. It takes a lot of battle-hardened approaches of React, and my initial feelings on it are very good. However I've not yet had a natural organization structure emerge with it, so I'm not sure how maintainable it will be in the future for people that didn't write it. Anybody have experience with that?


> The one area where IMHO the jury is still out is LiveView.

Agree 100%. Unfortunately LiveView is quickly becoming the way of doing things, which I think is ill-advised. It makes the language larger, and is still too much in flux (it's not even v1 yet).

LiveView is great, but it's not perfect and has real drawbacks.


I've been using LiveView since it came out and it has gotten a lot better since then. I ran into some drawbacks early on with authentication, but later versions fixed those issues.

Out of curiosity, what do you see as LiveView's drawbacks with its latest release?


I feel the same way about OO tbh. When everything is just functions + data types it's all much more easy to reason about and maintain.

I have no clue have LV scales. But I have yet to see a React application that doesn't drive me crazy with the amount of indirection and complexity!


Totally agree, when is everything is functions and data, it makes life a lot easier. I especially find it useful when refactoring code while trying to keep an existing function signature.


>> My hot take about most dynamic languages is that they are a poor fit for startups who have intentions of being long-term businesses

If you know exactly what you are building, down to the datatypes, sure, use static typing. But 99% of startups aren't like that. Business development is a highly dynamic process.

Once you hit $1M revenue, rewrite it in Rust or whatever.


This is unironically my plan/playbook with ideas I have for startups. Start with Python, node, or whatever. Then go to Rust once product-market fit has been established.

Rust really hits that sweet spot with memory-management, programming paradigm flexibility, and speed. It’s precisely great for what I need because it’s essentially an ML with good imperative constructs.


> My hot take about most dynamic languages is that they are a poor fit for startups who have intentions of being long-term businesses...

I don't know. Every startup I've ever worked at has used a dynamic language and most of them have been successful.

I know many disagree with me, but I've used a lot of typed languages (Swift, Typescript, C#, Java, etc.) and a lot of dynamic languages (Python, Ruby, Elixir, etc.) in my 30+ year career and I'm still not convinced that typed vs. dynamic is as big a deal as some make it out to be.


does elixir have type checking like typescript does ?( you can add types ?)


The hottest news in the Elixir community is that it is currently experimenting with set theoretic type / gradual typing. Some french PHD is working on it.

It is past the RND phase and in development at the moment.

It is highly experimental and not guaranteed to ship with the next version of Elixir. Right now the idea is to see if it offers any value in the first place. There are a lot of statements about what types offer. What the Elixir team wants to do is actually experiment to see if there's a benefit. Which is a really neat thing for a lang to do...to actually experiment like that is incredible.

As for right now, in a sense yes. It has this thing called Dialyzer - https://www.erlang.org/doc/apps/dialyzer/dialyzer_chapter.ht... which is a static analysis tool. This offers some non runtime type guarantees. But it is still a dynamically typed language. And the community is split on it IME. The errors are cryptic and there are some other issues. There's also Norm.

There are some things you can do to guarantee types during runtime, like match a function to a type. So when called with a string the implementation is different than with a boolean. So there are some run time type guarantees.

example of matching (return a diff string if bool or if nil):

def do_thing(n) when is_bool(n): "some boolean"

def do_thing(n) when is_nil(n): "nil"

There's also the concept of a struct which is like a typed object / map / key value pair thing.


> The hottest news in the Elixir community is that it is currently experimenting with set theoretic type / gradual typing. Some french PHD is working on it

To expand on it: you can read the proposal[0] or watch the video presentation by the author[1].

It is a very similar system to Typescript, with (at the first glance) only minor differences: "any" becomes "term", primitive constants are not singleton types (so you cannot write programs with type system alone... yet), and I expect some edge cases around generics will be handled differently (especially since you cannot specify types yourself in function calls, i.e there is no myFunction<MyType1, MyType2>()).

On the other side, there are some things that the proposal seems to be doing better than Typescript. It handles type assertions without the need for "function isX(maybeX): asserts maybeX is X { ... }" and since pattern matching plays important role in Elixir, I expect it will have a lot of polish around that.

[0] https://arxiv.org/pdf/2306.06391.pdf

[1] https://youtu.be/gJJH7a2J9O8


To show the struct example:

    def do_thing(%Foo{} = foo), do: foo
Here `foo` has to be type of struct `%Foo{}`.

    def do_thing(%Foo{bar: "baz"} = foo), do: foo
Here `foo` has to be struct `%Foo{}` with field `:bar` with the value of `"baz"`.

Another one I like to point out is how operators aren't rampantly overloaded.

    def add(a, b), do: a + b
This will _only_ work with floats or integers because `+` only works on those types. String concatenation, adding lists, date math, etc have their own operators/functions.

While Elixir does offer mechanisms to overload operators, you have to be very explicit about it. There is no way to globally redefine `+`, for example. It would be on a per-module or even per-function basis (and not something that is done too often, though there are some good uses of it out there).

Dynamic typing in Elixir really isn't that big a hinderance if you stick to its idioms. That said, I'm fairly excited about the prospect of the type system.


No and sort of. While Elixir doesn't currently have types (though I understand that is coming), you can use Dialyzer to type things. It kind of sucks, though, so every project I've worked on except one has eventually turned Dialyzer off.

Pattern matching and guard clauses go a long way to ensure your types are correct, so most of the projects I've worked on have relied instead on those.


Ramp and Divvy (Bill), both in the expense management space, chose Elixir for a portion of their payment systems.


As a user, Ramp's web interface is absolute trash. I've submitted reimbursement requests only to get an error message and then have them go through later, creating duplicates.

CC transactions queue up, requiring a response and there's no way to go through them efficiently. The UI pops up a weird modal and is horrible to use.

It feels like a legacy enterprise application and it's basically a new product. I wish my company didn't use it.


(2021)

Unfortunately Brex doesn't use Elixir anymore

https://medium.com/brexeng/building-backend-services-with-ko...


"Moving to Kotlin first" is a long way from "doesn't use anymore". I suspect that they're still running a lot of Elixir code in production. This was the case more than a year after this announcement, per a conversation with one of their engineers.


Kotlin, definitely an interesting choice for backend development. But I love to see idiosyncratic language choices being made.


Having spent considerable time in both Java and Elixir, I would also choose Kotlin (or Java) over Elixir as a backend language.

I worked in Elixir for over a year, and frankly was quite disappointed with it. Elixir has a lot of good ideas, but falls short in some crucial points in execution. Relying on shapes instead of types to identify results / objects, weird syntax choices (You use both <%= %> and {} to interpolate in phoenix templates, depending on where you are), no ability to write custom guards, a lot of language concept overlap - for example behaviors / protocols should be one implementation instead of two separate ideas.. Elixir is an okay language, but I think it's just a fad, not good enough to have staying power. I think a better language written on the BEAM will come along and be the language that Elixir should have been. Just my personal opinion.


> Relying on shapes instead of types to identify results / objects

Is the issue lack of types or relying on shapes? Those can be orthogonal features, since you can have structural typing (i.e. types that are built on top of shapes). Nominal typing is often too rigid (and not the direction we are exploring in Elixir's type system).

> for example behaviors / protocols should be one implementation instead of two separate ideas

Elixir has three building blocks: processes, modules, and data. Message-passing is polymorphism for processes, behaviour is polymorphism for modules, and protocol is polymorphism for data.

The overlapping comes that modules in Elixir are first-class (so they can be used as data) and that data (structs) must be defined inside modules. So while maybe those features could feel closer to each other, I don't think they could be one implementation. One idea would be to treat modules/behaviors as "nullary protocols" but that's not what "nullary typeclasses" mean in Haskell, so it is quite debatable.

Do you have examples of languages that do both in one implementation? For example, my understanding is that both Haskell and Go offer "protocols" but not "behaviours". I'm not aware of a language with both as one.

I'd love to hear other concepts that overlap, it is always good food for thought!


Elixir creator just entered the chat ;)


Types are coming...

* https://news.ycombinator.com/item?id=36311875

* https://news.ycombinator.com/item?id=35766126

Re: "You use both <%= %> and {} to interpolate in phoenix templates, depending on where you are" - no need for EEx anymore for web development, just use HEEx, which has standardized on {}. You can use HEEx for LiveViews and "DeadViews" (server-side rendered Phoenix)


> custom guards

I don't recall 100%, but I think this is a BEAM feature that exists because they don't want to run arbitrary code as part of guards that could have side effects, delays and so on. I don't remember the specifics.


You can write custom guards out of any built in function that can already be used in a guard.

Like defguard is_list_or_even_int(val) do is_list(val) or (is_integer(val) and val % 2 ==0) end


Interesting. We had the exact opposite reaction. We switched from Java to Elixir at my current company and we'll never go back. It has been a two-year migration, but worth every second. We're twice as productive in Elixir, it's easier to test and deploy, and we don't have to paint IntelliJ lipstick on the language to hide all the shitty parts.


This one is quiet actively developed but it seems like by only one person https://gleam.run


I remember when José Valim first came up with it. I thought it might just be a fad, or a toy language to test out some ideas.

that was in 2012.


Since Elixir 1.6 it is possible to create custom guards with `defguard`: https://hexdocs.pm/elixir/Kernel.html#defguard/1


It's true that you can create custom guards, but they are still very limited, and they can only be made of a small list of allowed expressions [0].

[0]: https://hexdocs.pm/elixir/1.6.6/guards.html#list-of-allowed-...


Doordash uses Kotlin for backend as well. I suppose they want FE/BE to be in the same language. And when your main product is a native app Kotlin starts to make sense.


Neither the iOS app nor the web app use Kotlin at DoorDash. Kotlin was picked for other reasons (consistency with the Android app wasn't mentioned): https://doordash.engineering/2021/05/04/migrating-from-pytho...


It's likely that they just chose Kotlin as "better Java" - many companies do this and I can understand why.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: