Our ActiveMQ brokers running on a single-core, 1gb, Digital Ocean node easily push 5k messages/s through ~200 queues and another several thousand/s through a couple hundred topics. They do reach a cliff edge though on single cores. Dual cores pretty much quadruples the throughput to the point where SSD latency is the problem.
Our Tomcat servers don't see 15k requests/s, but our load testing has shown a pair of them is good up to 3k/s (each) on dual-core 2gb servers with 512mb heaps. We really didn't push it much farther because that exceeds our needed capacity.
This is pretty boring technology. I'm fairly certain we could hit 15k req/s per node with plain Tomcat if we actually tried. I do think if you wanted to stretch past 15k req/s you probably do need more exotic programming models like Vert.X, but that does come at complexity.
I think the ultimate in throughput would be to have Tomcat listen on a unix domain socket (This is actually supported in Tomcat(https://github.com/apache/tomcat/commit/a616bf385a350175a33a...) and have TCP terminated with HaProxy (https://bz.apache.org/bugzilla/show_bug.cgi?id=57830). We set this up and it is _really really really_ fast, but we'd be on our own for support since nobody else is really doing it. Tomcat also has some strange behavior with the sock file, you kinda have to manage it on your own which is strange, but it could be done within the startup scripts.
Terminating the TCP with any kind of reverse-proxy and forwarding over a local TCP port to Tomcat would get you most of the benefit with very litte configuration and is a fairly standard setup.
It plays to the strengths of both becuase the reverse proxy handles long-lived but slow connections very well, and the Tomcat server gets to have fewer, but hotter threads.
That's actually our prod setup; we do that now and it's a pretty robust. As a bonus is gives you a nice "wiretap point" you can use tcpdump for debugging.
In our limited tests with HaProxy -unixDomainSocket-> Tomcat, we saw a significant increase in throughput and decrease in latency (when it worked). It got rid of our wiretap point though which is pretty handy, and we didn't _really_ need the performance increase. If I were to speculate, HaProxy is a rocket and beats even the mature codebase of Tomcat it seems when parsing http, and probably takes advantage of kernel features that haven't made its way into Tomcat yet. It wasn't prod-worthy stable, sometimes execution threads seemed to hang.
Multiple applications, not just a single on, powering call centers, not sexy SV stuff. Queues are used for two patterns: reliable data transport that we expect to be in near-realtime, and reliable event sourcing. The topics are mainly used non-reliable event signaling... for instance cache eviction (hey this row was updated, if you have it in your local cache, discard it or you'll get a conflict if you try to commit to the db). They also form the backbone for websocket broadcasting (sending messages to all of a users' app windows) Messages are all text/json or text/xml; I'd say 512b-8kb with an average around 1k. Data transmission messages are on the large end of that; signalling messages are on the small side of that.
Interesting application, thanks for the background it helps put your original comment in a bit more perspective. As for it not being 'sexy SV stuff' -> most important automation projects aren't and they tend to be far more interesting than the stuff that is supposedly sexy. I've helped build a system that routes a sizeable fraction of all container and bulk freight around the world and when looked at from the outside it is probably pretty boring but the guts of it were fascinating. Not to mention mission critical to a degree that very few other things (healthcare, avionics) are. Telco has a ton of interesting challenges like that.
I remember a consulting gig where they were pushing 50k requests per second with PHP in a 2018 laptop by using this framework https://github.com/walkor/workerman
In a real application, with database connection pooling and auth sessions, it went down to 15k requests/s.
And that was PHP7. PHP8 introduced JIT so it's probably significantly faster these days and hopefully fully typed.
Using Elixir for a similar problem and it scaled well past 15K without any real intervention on my part, and it was a similarly tiny instance.
Not that they shouldn't have used Java, I understand their application was already written in Java and it wouldn't make sense to use an unfamiliar language, but I do wonder about Javas' efficiency here.
15k/rps is on the low side for Java, so I found the post to be a little strange. The netty library can handle 10x that without batting an eyelid. We do much more than 15k/s on API nodes with hardly any attention paid to optimisation around traffic.
These benchmarks are a little pointless. 15k/s is either amazing or poor or average depending on what actual work the server does. If it’s just a glorified CDN as in many benchmarks then it’s not that great(?)
We did 10k rps almost 20 years ago on pair of Sun x4100 servers. Dual core Opteron@2,4GHz and Java 1.4.
And that was with storing data to spinning rust (async) and business logic using custom scripting language.
Most of the ability to get throughput is based on not blocking a request on network or other IO and also not doing writes in band of request.
I think it's less about the web server choice and more about the architecture around pre-computing and caching certain results in memory and then asynchronously writing via a queue.
For certain web servers that utilize an event loop instead of threading, it becomes very easy to accidentally block the event loop with synchronous work, so the choice of server does matter.
Yes it's easy to block the event loop. I agree that it matters in that you're much more likely or even able to hit this throughput by using an event loop runtime.
In other words, having 32/64 concurrent threads you won't get the throughput you want so agree there.
However, I'll elaborate on my point and why it's not just "use event loop".
If you block the event loop you can get catastrophic behavior, like your entire server acts as if it's a single threaded synchronous runtime.
However, even with not blocking the event loop, you need to have optimal IO __given a coroutine__. If you are waiting a bunch of ms for multiple network calls and also a db write within a coroutine, your latency is going to go up and you might not hit your latency requirements.
I was specifying that the architecture was taking these concepts into account and getting low latency on each request in a high throughput way by not blocking on writes, precaching info in server memory, etc.
If they had to go to a db and/or do other calculations per query instead of pre-caching regardless of 32/64 or using an event loop, it likely would have either created another bottleneck or the web server would bottleneck and they would not hit their RPS
You can maintain high throughput while performing in-band I/O as long as the requests are pipelined and the I/O is batched. What gets affected is the request latency.
However in practice it’s more around 4-6k because it’s doing database operations. Some end points are only 1500 rps while some of my read ops are in the 10-25k.
This was an impressive metric 20 years ago. Today you can achieve 15k/sec on a single $10/mo VPS running any number of Java, Go, JS, PHP, or Rust apps.
I run a micro saas with 34 customers, with about 300 users online at any given moment at the peak.
It's written in php, running on a $4 a month shared hosting from namecheap.
Everyone says I'll need to move it over somewhere better if it grows. How soon? Even people with experience seem to have trouble explaining when. It always fascinates me and makes me wonder how many small webapps are on AWS for no real reason.
Personally, I need cpanel and phpmyadmin because I don't know anything else really. I still ftp files up manually from my windows computer. Going to a real server seems like it would make things 10x more complicated, and no one seems to be able to predict at what point I'll need to.
Going to a "real" server could just mean switching it up, like a $3-5 vps, sftp instead of ftp, ...? and then you can scale by just scaling up the vps. Your solution is quite good.
One of the core offerings in AWS is EC2. Which are just shared (up to dedicated) virtual machines, and they start at a few dollars a month. I'm in no way suggesting a change to your setup, only pointing out that AWS isn't some mythical complex ecosystem.
Is namecheap running PHP8? That upgrade alone will net you a massive performance improvement over 5.x
They are, and they've had up to date versions of php basically right as they become available, and can switch easily between versions, even down to differentiating between 7.0 and 7.1 in case someone really needs to choose that finally for a specific reason.
No doubt AWS has options that could work for me at a similar price point. But I don't think it would be at all easy from what I've seen, and I'd have zero support. Compared to namecheap which basically helps me do anything I need via live help with actually useful people in my experience.
I once upgraded from a $2.50 a month server to the $4 server for a few other benefits I needed, and performance actually went DOWN at that server IP. I asked them to move me to a different one, just based on my "trust me bro, this is slower than before" and they did move me and performance returned.
There's quite a few Java frameworks and libraries that will need to re-evaluate their existence (or at minimum, their core APIs) when green threads become popular.
Apologies replying to myself, but Netty, which underpins many of the popular Java backend frameworks, see backward compatibility as more important than supporting green threads.
Just curious -- have you personally confirmed this?
Not trying to cast doubt... just noting that I have exactly the same feelings (hopes?) but I have yet to personally confirm it, in large part because it's not yet fully GA so we haven't jumped in with our production workloads.
Not OP, but as someone who has used Spring quite extensively (6+ yrs) in the past and using Quarkus for all project these days, DX is much better with Quarkus.
Hot reloading (to the extend that change-save-refresh would feel like working with a python/ruby projects), shallow stack traces, less 'magic', , excellent documentation, plethora of modules, less memory footprint, milliseconds start/restart (owing to compile time wiring), first class container support, dev tools, always running tests are few that makes DX with Quarkus amazing.
Virtual Threads are still in preview phase. People will wait for stable release or at least someone to try in prod and give lightning talk or publish paper. Seems little premature now.
> To maintain consistency with the current infrastructure, we continued utilizing the existing setup, which includes a Kubernetes cluster on AWS EC2 instances.
Ironically this would likely be too expensive to deploy on AWS serverless infrastructure (AWS Lambda, API GateWay etc) even-though it touts endless scalability as it main feature.
I wouldn't be so quick to say that. I've pretty spent the better part of my life writing C code, lots of it networked. And while you can get absolutely awesome performance out of it on modest hardware the JVM is no slouch at running networking stuff, I wouldn't be surprised if it is roughly on par with optimized C for that kind of application. Usually it's the kernel side that will be your bottleneck and it won't make much of a difference what the application is written in once you are that close to the theoretical max.
Writing it in C isn't some magic bullet. Java you're spending your time optimizing the tail end of efforts, but C you're spending your time optimizing the bulk of the effort that Java gives for free. (not that I'm defending Java, but C being low level doesn't make it instantly easy to do faster stuff with).
I think these comparisons are a bit silly, it really depends what you are doing for each request. Are you just proxying, message brokering or querying a database? It makes a lot of difference
Our Tomcat servers don't see 15k requests/s, but our load testing has shown a pair of them is good up to 3k/s (each) on dual-core 2gb servers with 512mb heaps. We really didn't push it much farther because that exceeds our needed capacity.
This is pretty boring technology. I'm fairly certain we could hit 15k req/s per node with plain Tomcat if we actually tried. I do think if you wanted to stretch past 15k req/s you probably do need more exotic programming models like Vert.X, but that does come at complexity.
I think the ultimate in throughput would be to have Tomcat listen on a unix domain socket (This is actually supported in Tomcat(https://github.com/apache/tomcat/commit/a616bf385a350175a33a...) and have TCP terminated with HaProxy (https://bz.apache.org/bugzilla/show_bug.cgi?id=57830). We set this up and it is _really really really_ fast, but we'd be on our own for support since nobody else is really doing it. Tomcat also has some strange behavior with the sock file, you kinda have to manage it on your own which is strange, but it could be done within the startup scripts.