
How Rust Lets Us Monitor 30k API calls/min - cfabianski
https://blog.bearer.sh/how-rust-lets-us-monitor-30k-api-calls-min/#.XujqagRRU9Q.hackernews
======
meritt
Sorry, I must be missing something in this blog post because the requirements
here sound incredibly minimal. You just needed an HTTP service (sitting behind
an Envoy proxy) to process a mere 500 requests/second (up to 1MB payload) and
pipe them to Kinesis? How much data preparation is happening in Rust? It
sounds like all the permission/rate-limiting/etc happens between Envoy/Redis
before it ever reaches Rust?

I know this comes across as snarky but it really worries me that contemporary
engineers think this is a feat worthy of a blog post. For example, take this
book from 2003 [1] talking about Apache + mod_perl. Page 325 [2] shows a
benchmark: "As you can see, the server was able to respond on average to 856
requests per second... and 10 milliseconds to process each request".

And just to show this isn't a NodeJS vs Rust thing, check out these
webframework benchmarks using various JS frameworks [3]. The worst performer
on there still does >500 rps while the best does 500,000.

It's 2020, the bar needs to be _much_ higher.

[1] [https://www.amazon.com/Practical-mod_perl-Stas-
Bekman/dp/059...](https://www.amazon.com/Practical-mod_perl-Stas-
Bekman/dp/0596002270)

[2]
[https://books.google.com/books?id=i3Ww_7a2Ff4C&pg=PT356&lpg=...](https://books.google.com/books?id=i3Ww_7a2Ff4C&pg=PT356&lpg=PT356)

[3]
[https://www.techempower.com/benchmarks/#section=data-r19&hw=...](https://www.techempower.com/benchmarks/#section=data-r19&hw=ph&test=db&l=zik0sf-1r)

~~~
lostcolony
They list out what is being done by the service - "It would receive the logs,
communicate with an elixir service to check customer access rights, check rate
limits using Redis, and then send the log to CloudWatch. There, it would
trigger an event to tell our processing worker to take over."

That sounds like a decent amount of work for a service, and without more
detail it's very hard to say whether or not a given level is efficient or
inefficient (we don't know exactly what was being done; we can assume that
they're using pretty small Fargate instances though since the Node one came in
at 1.5G). They also give some number; 4k RPM was their scaleout point for Node
(that's not necessarily the maximum, but the point they felt load was
sufficiently high to warrant a scaleout; certainly, their graph shows an
average latency > 1 second). Rewriting in Rust, that number was raised to 30k
RPM; 100 mb of memory, < 40ms average latency (and way better max), and 2.5%
of CPU.

Given all that, it sounds like, yes, GC was the issue (both high memory and
CPU pressure), and with the Rust implementation (no GC) they're nowhere near
any CPU or memory limit, and so the 30k is likely a network bottleneck.

That said, while I agree that sounds like a terrible metric on the face of it,
with what data they've provided (and without anything else), it also sounds
like it may be due to they're just operationally dealing with very large
amounts of traffic. They may want to consider optimizing the network pipe; not
familiar enough with Fargate, but if it's like EC2, there may be a sizing of
cpu/memory that also gives you a better network connection (EC2 goes from 1
GBPS to a 10 GBPS network card at one instance type)

~~~
wahern
> That sounds like a decent amount of work for a service

5+ years ago I wrote a real-time transcoding, muxing streaming radio service
that did 5000 simultaneous connections with inline, per-client ad spot
injection (every 30 seconds in my benchmark). Using C and Lua. On 2 Xeon E3
cores--1 core for _all_ the stream transcoding, muxing, and HTTP/RTSP setup, 1
core for the Lua controller (which was mostly idle). The ceiling was handling
all the NIC IRQs.

While I think what I did was cool, I know people can eke much more performance
out of their hardware than I can. And I wasn't even trying too hard--my
emphasis is always on writing clear code and simple abstractions (though that
often translates into cache-friendly code).

At my day job, in the past two months I've seen _two_ services in a "scalable"
k8s clusters fall over because the daemons were running with file descriptor
ulimits of 1024. "Highly concurrent" Go-based daemons. For all the emphasis on
scale, apparently none of the engineers had yet hit the teeny, tiny 1024
descriptor limit.

We really do need to raise our expectations a little.

I haven't written any Rust but I have recently helped someone writing a
concurrent Rust-based reverse proxy service debug their Rust code and from my
vantage point I have some serious criticisms of Tokio. Some of the decisions
are clearly premature optimization chosen by people who probably haven't
actually developed and pushed into production a process that handles 10s of
thousands of concurrent connections, single-threaded or multi-threaded. At
least not without a team of people debugging things and pushing it along. For
example, their choice of defaulting to edge-triggered instead of level-
triggered notification shows a failure to appreciate the difficulties of
managing backpressure, or debugging lost edge-triggered readiness state. These
are hard lessons to learn, but people don't often learn them because in
practice it's cheaper and easier to scale up with EC2 than it is to actually
write a solid piece of software.

~~~
lostcolony
All I'm saying is that without some example of the payloads they're managing,
and the logic they're performing, it's hard to say "this is inefficient". And,
as I mentioned, if their CPU and memory are both very low, it's likely they're
hitting a network (or, yes, OS) limit.

I've seen places hit ulimit limits...I've also seen places hit port assignment
issues, where they're calling out to a downstream that can handle thousands of
requests with a single instance, so there are two, and there aren't enough
port identifiers to support that (and the engineers are relying on code that
isn't reusing connections properly). Those are all things worth learning to do
right, agreed, and generally doing right. I'm just reluctant to call out
someone for doing something wrong unless I -know- they're doing something
wrong. The numbers don't tell the whole story.

~~~
wahern
They might not be doing anything wrong, per se. But if your expectations are
that 500/s is alot (or even 4000/s for log ingesting), then your architecture
will reflect that.

Here's what they're doing:

> Now, when the Bearer Agent in a user's application sends log data to Bearer,
> it goes into the Envoy proxy. Envoy looks at the request and communicates
> with Redis to check things like rate limits, authorization details, and
> usage quotas. Next, the Rust application running alongside Envoy prepares
> the log data and passes it through Kinesis into an s3 bucket for storage. S3
> then triggers our worker to fetch and process the data so Elastic Search can
> index it. At this point, our users can access the data in our dashboard.

Given their goal and their problems with GC I can tell you right off the bat
probably what's the problem with their various architectures from day 1--too
much simplistic string munging. If your idea of log ingestion is using in-
language regex constructs to chop up strings into pieces, possibly wrapping
them in abstract objects, then its predictable you're going to have GC issues,
and memory bandwidth issues in general, and poor cache locality in data and
code. But 99% of the time this is how people approach the issue.

What a problem like this cries out for is a streaming DFA architecture, using
something like Ragel so you can operate on streams and output flat data
structures. You could probably implement most of the application logic and I/O
in your scripting language of choice, unoptimized GC and all, so long as
you're not chopping up a gazillion log lines into a gazillion^2 strings. The
latter approach will cause you grief in any language, whether it's JavaScript,
Java, Go, Rust or C. The number of objects per connection should be and can be
a small N. For example, at 10 distinct objects (incoming connection object,
log line, data structure with decomposed metadata, output connection object,
etc) per connection times 500 connections, that's 5000 objects per second.
Even Python's and Ruby's GC wouldn't break a sweat handling that, even though
internally it'd be closer to 10 * (2 or 3) objects.

Here's a big problem today: nobody writes their own HTTP library or JSON
library; everybody uses the most popular ones. So right off the bat every
ingestion call is going to generate hundreds or thousands of objects because
popular third-party libraries generally suck in each request and explode it
into huge, deeply nested data structures. Even in Rust. You can't optimize
that inefficiency away. No amount of fearless concurrency, transactional
memory, fastest-in-the-world hashing library, or coolest regular expression
engine can even begin to compensate. You have to _avoid_ it from day 1. But if
your expectations about what's possible are wrong (including how tractable it
is with some experience), it won't even occur to you that you can do better.
Instead, you'll just recapitulate the same architectural sins in the next
fastest language.

~~~
lostcolony
"I can tell you right off the bat _probably_ what's the problem"

Emphasis added. I don't disagree with you that they may be doing something
inefficient; I'm just saying, I don't -know- what they're doing, so I'm
disinclined to judge it.

I do know that, again, in Rust, whatever bottleneck they're hitting is neither
CPU nor memory, despite the seemingly low throughput, which does imply that
what you're proposing isn't the bottleneck in that implementation.

------
akoutmos
Great article and thanks for sharing! There are a couple of things that stand
out at me as possible architecture smells (hopefully this comes across as
positive constructive criticism :)).

As someone who has been developing on the BEAM for long time now, it usually
sticks out like a sore thumb any time I see Elixir/Erlang paired with Redis.
Not that there is anything wrong with Redis, but most of the time you can save
yourself the additional Ops dependency and application network hop by bringing
that state into your application (BEAM languages excel at writing stateful
applications).

In the article you write that you were using Redis for rate limit checks. You
could have very easily bundled that validation into the Elixir application and
had for example a single GenServer running per customer that performs the rate
limiting validation (I actually wrote a blog post on this using the leaky
bucket and token bucket algorithms [https://akoutmos.com/post/rate-limiting-
with-genservers/](https://akoutmos.com/post/rate-limiting-with-genservers/)).
Pair this with hot code deployments, you would not lose rate limit values
across application deployments.

I would be curious to see how much more mileage you could have gotten with
that given that the Node application would not have to make network calls to
the Elixir service and Redis.

Just wanted to share that little tidbit as it is something that I see quite
often with people new to the BEAM :). Thanks again for sharing!

~~~
eggsnbacon1
I would push rate limiting to the load balancer, HAProxy or Nginx, but that's
just me. If you have a round-robin LB in front you just set each instance to
limit at 1/nodes rate, that way you don't have to share any state.

If you're load balancing on IP hash you can set each instance to limit at full
rate and not worry about it.

Shared state in rate limiting becomes a bottleneck very quickly. If you're
trying to mitigate spam/DDOS you could easily get 100,000 requests a second.
You're going to max out your shared state db way faster than 10gig lines

~~~
akoutmos
That is definitely a valid route to go so long as your rate limiting is not
dependent on much business logic. If rate limiting is per user or per user per
instance/service, I would personally bring that kind of concern into the
application where it is closer to the persistence layer where those things are
defined (and again handling the business logic inside per customer
GenServers).

I have never used this product so just speculation. But I imagine there is
some sort of auth token that valid agents send to tell Bearer that this is a
valid/invalid request so that things can be trivially rejected to mitigate a
DoS/DDoS to an extent.

------
didroe
I'm one of the engineers that worked on this. It was the first Rust production
app code I've written so it was a really fun project.

~~~
mamcx
One of the interesting effects of using rust is saving money! I also migrate a
F#/.NET ecommerce backend and can run in less RAM/CPU that make my bills
lower.

~~~
raphinou
Can you share information about your experience? I'm currently working on a F#
project, enjoying the functional approach, while having a lot of libraries
available on the .Net platform. The |> operator is one I use all over the
code, but Rust doesn't support custom operators. Is that annoying, or not at
all? Is your code less functional and more imperative style due to Rust?

~~~
stuartd
> Is your code less functional and more imperative style due to Rust?

I would imagine so. Rust doesn't support tail call optimization, and variables
are immutable only by default.

~~~
zozbot234
LLVM should optimize tailcalls and sibcalls. But tail call optimization has
unexpected interactions with the extended RAII that Rust uses because stuff
has to be dropped at the end of its lifetime, so the code that's running in
"tail" position is sometimes not what you expect.

~~~
thaneross
As a beginner to Rust I'm surprised by this. Given the Rust compiler is able
to figure out the lifetimes in the recursive case, you'd think the lifetimes
within the tail-optimized loop would be same. Doesn't the lexical scoping of
the loop's body have the equivalent lifecycle of a recursive call (drop at the
end of the loop vs the end of the function)?

------
foxknox
500 requests a second.

~~~
jvehent
Which any programming language can handle easily. The architecture here is
more interesting than the language choice.

------
cybervasi
GC of 500 request/s could not have possibly caused a performance issue. Most
likely the problem was due to JS code holding on to the 1MB requests for the
duration of the asynchronous Kinesis request or a bug in the Kinesis JS
library itself. With timeout of 2 minutes, you may end up with up to 30K/min x
2min x 1mb = 60GB RAM used. GC would appear running hot during this time but
it is only because it is has to scrape more memory somewhere while up to 60gb
is being in use.

------
eggsnbacon1
They didn't mention Java as a possible solution, even though its GC's are far
better than anything else out there. I have nothing against Rust but if I was
at a startup I would save my innovation points for where they're mandatory

~~~
JoshTriplett
> I have nothing against Rust but if I was at a startup I would save my
> innovation points for where they're mandatory

An article published today, addressing that exact point:
[https://tim.mcnamara.nz/post/621040767010504704/spend-
your-n...](https://tim.mcnamara.nz/post/621040767010504704/spend-your-novelty-
budget-on-rust)

~~~
andrewzah
Ah yes, a 236 word "article", that says to choose "boring old technology" and
also to use Rust in the same breath.

This article should mention that rust isn't close to ready when it comes to
web backends. As much as I love Rust, if I were running a startup or even a
decently sized company I would always choose Rails. Now -that- is boring,
old... and mature technology. Certain components could get re-written in Rust,
certainly, but there's no reason to ignore a mature ecosystem from the start.

~~~
zozbot234
> This article should mention that rust isn't close to ready when it comes to
> web backends.

Actix-web works just fine. They got a new maintainer team involved that has
been spending some time getting rid of all the insane unsafety that was in the
code before.

~~~
andrewzah
Yes. It "works". However deciding to use Actix/Warp means throwing away years
and years of work in the rails and ruby world.

Rails is mature, robust, and has a huge ecosystem with rubygems. Rust (when it
comes to web stuff) is not. "it works" does not pass my litmus test. With
Actix/Warp I have to implement stuff by hand that either comes by default with
rails or already exists in a gem.

I like Rust but I'm not a zealot. People way overestimate performance when
they barely have any traffic to begin with as a small startup or even a medium
sized company.

You could even use rails, and use Rust to write ruby gems instead of going
with actix/warp/etc.

> insane unsafety

This was overblown. Yes, the author didn't respond appropriately, but unsafe
isn't inherently dangerous. This is a stupid misconception within the Rust
community and caused a lot of unnecessary drama around Actix.

------
DevKoala
There is a couple things I see in this post that I wouldn’t do at all, and I
maintain a couple services with orders of magnitude higher QPS. I feel that
replacing Node.js with any compiled language would have had the same positive
effect.

~~~
ecoqba11
Totally!

------
newobj
500qps. i think the more interesting story here is what language/framework
COULDN'T do this, than which one could.

------
trimbo
> After some more research, we appeared to be another victim of a memory leak
> in the AWS Javascript SDK.

Did you try using the kinesis REST API directly:
[https://docs.aws.amazon.com/kinesis/latest/APIReference/API_...](https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html)

------
qrczeno
That was a real issue we were struggling to solve. Feels like Rust was the
right tool for the right job.

------
hobbescotch
Having never dealt with issues relating to garbage collection before, how do
you go about diagnosing GC issues in a language where that’s all handled for
you?

~~~
013a
There are some general tricks that are language-agnostic, like allocating a
huge "buffer" object when the app starts, the size of which is some
significant portion of the memory you allow the process to use, which always
has a reference, then storing references to other objects you need in that big
object. In other words, circumvent the garbage collector.

Of course, this has its own issues, but I've seen it done in e.g. Go before.
Its likely you'll inevitably end up with leaks, but if your service is
fungible and can tolerate restarts, basically what you're doing is moving the
"GC Pause" to be a "Container Restart" pause, which may be slower, but would
happen less often. Some languages have ways to manually call the GC (Node is
not one of them, afaik).

~~~
eggsnbacon1
> There are some general tricks that are language-agnostic, like allocating a
> huge "buffer" object when the app starts

I've only seen this done in Go :)

~~~
milesvp
It's a fairly common thread in game development circles. Usually game
development is one of the few places with big enough constraints that it's
worth doing your own memory management in languages that have garbage
collectors. Other places it often makes sense to just architect around GC
pauses, since you're going to want redundancy and load balancing anyways.

------
zerubeus
Feels like a HN post being upvoted just bcz it contains Rust in the title
(after reader the article) ...

