It's known that golang optimizes for latency at the expense of throughput, and it doesn't give you the option to change this if your requirements change. This is the power of the JVM. In any case, the JVM now ships with ZGC, a low latency GC, and it may be worth running the benchmark with it. There's also another low latency GC in the works called Shenandoah.
Thanks for the info. It makes sense in that case. Though the JVM can be very performant when written properly, and especially now with ZGC, Valhalla, Panama, etc.
ah yeah no doubt. I've read the recent gc code in the jvm tree. it's beautiful. We just had a ton years of experience with c++ and it was a good fit for seastar.io - this all came from me messing around with https://github.com/smfrpc/smf and seeing what one could do w/ DMA (no kernel page cache)
Not really, after all most native code compilers have endless amount of configuration options as well, while Go still falls behind many use cases.
Also I started to see a trend in books and blog posts regarding how to write Go code towards better performance, so it isn't a given that it excels at performance out of the box.
All of which comes back to the original point that many times isn't the language, rather how it is written and what tools one makes use of.
If Go is so much better than Java, Google would have replaced it already on Android with Go (battery life and such), instead they went with a mix of AOT/JIT with PGO, introducing Kotlin, while gomobile efforts were never given any serious consideration not even for the NDK.
Go’s mission statement isn’t to be faster than Java or to replace java. Its to (1) compile faster (2) compile into single binaries with no dynamic links (3) natively support concurrency and parallelism with M:N routines:threads.
Go will never replace Java for Android because (1) it doesn’t use a VM, so it would need to compile for every arch that android runs on (2) it would require a bug-compatible port of Android. Because Kotlin runs on the JVM and can run on top of Java codebases, it doesn’t have to leap over these hurdles.
There are other reasons you can come up with I’m sure, but those are the biggest that stick out to me. Notably it has nothing to do with “which language is better”.
They optimize for different use cases, and that’s OK. That being said, if you were writing Android from scratch and were only targeting ARM (an equally silly comparison meant to highlight the differences in the languages), you’d be hard pressed to justify Java over Go.
I honestly have yet to see any evidence that Fuchsia is anything but an experiment in this respect, like Dart was (which was also going to "replace android").
I entirely admit the possibility, but I see no plan. Hopes and aspirations are not plans.
> That being said, if you were writing Android from scratch and were only targeting ARM (an equally silly comparison meant to highlight the differences in the languages), you’d be hard pressed to justify Java over Go.
Instead they went with Rust, C++ and Dart.
Now what have all those languages in common that Go lacks?
This isn't really true for non-trivial code bases. And even if it is, it's not a big difference in practice, especially with incremental compilation where I find it actually much quicker to change a couple of files and re-run unit tests compared to golang which has to spit out a multi-dozen MB binary each time, taking 5-6+ seconds.
> (2) compile into single binaries with no dynamic links
Java is getting AOT compilation which will do the same.
> (3) natively support concurrency and parallelism with M:N routines:threads.
The fact that a tool works out of the box and offers you an extensive array of ways to get more out of it is not in any way worse than a tool that just works out of the box. You can use it in exactly the same way with no more mental effort - stick to the defaults.
Which one is easier? Adding a few lines of GC configuration or rewriting a service in C++ using Seastar and binding processes to cores?
What sort of performance measurement uses default configurations? What is even the point of not tuning the GC to your application workload?
I have spent the last 15 years on running Java apps in production and to optimize for the p99 latency is really not that hard. Optimize p99.99999 is a whole different subject though. I don't understand what is the point of comparing a default GC setting that is for hello world applications to a software that is optimized every way possible. Apples to oranges. It is a grat marketing gimmick though. Look ma no performance! Look here, so much faster. We live in a single dimension word, yay!
Easier for who? At some point, the architecture needs to change to overcome a fundamental limit. Developers take on that work to make the performance and operations easier for their users.
Seastar is the foundation of Scylla, which shows that rewriting in C++ can deliver magnitudes more performance which is not possible by just tuning Cassandra on the JVM. In fact, Datastax has now copied the Scylla approach in Cassandra but still lags behind drastically with performance.
While there is no significant latency difference in lover percentile tiers. Where Scylla really shines is TCO. Some companies trade SRE time for license cost, some other companies tune GC.
Yes, you get 10x the throughput while maintaining far better tail latency with Scylla over Cassandra.
How is 10ms vs 475 not a major improvement? How is 4 nodes vs 40 not a major improvement? If you're an SRE than how is managing 4 servers with far less tuning and maintenance not a major improvement? Also 99.9% percentile still matters. They're testing with 300k ops/sec which means 300/sec are facing extreme latency spikes that can be enough to fail and/or cause cascading issues through out the application.
There's no metric where Cassandra is better here and you can't tune your way to the same performance in the first place which is the whole point of Scylla. What even is your claim here? Spend more to get less?
2-3x maybe, 10x not likely, unless you compare it to untuned cassandra.
What makes it possible to run cassandra/scylla on nodes with TBs of data density is the TWCS compaction strategy from Jeff Jirsa. He was just a cassandra power user at the time, and I like to think that the invention was possible because of Java.
So, next time you read an ad piece from scylla about replacing 40 mid size boxes running CMS with 4 big boxes, don't forget about TWCS.
It's 4 servers doing the same as 40. That is 10x throughput, and with lower tail latency.
Scylla is far more than a compaction strategy. If it was that simple, than Cassandra would already be able to do it.
It's an objectively faster database in every metric. Datastax's enterprise distribution has more functionality but core Cassandra is now entirely outclassed by Scylla in speed and features.
Again, that's 40 mid size boxes with questionable GC (32G to 48G heap size is the no man land, G1GC target pause time of 500ms of course will result in 500ms P999 latency, etc), versus 4 big boxes, which have more than 4x higher specs. So 10 divides by 4, that's the 2-3x that I mentioned.
The TWCS is just a good example that raw performance is not everything. Performance also comes from things like compaction strategy, data modeling, and access pattern, while users and stakeholders also care about things like easiness to modify, friendly license, and steady stewardship.
How does Scylla relates to Kafka and zookeeper here? I know a bunch of ad-tech companies that built their entire stack around JVM and scala. Those companies need to perform a bid (a whole bunch of logic ) within 100 ms ( including networking), otherwise, they got a financial penalty.
Please stop being a kid who post everywhere you see JVM product your opinion on that. Kafka, Pulsar are two successful projects no matter what you think.
Kafka relies on a bit outdated architecture but has the most vibrant ecosystem. Those are major pros and cons. Not a language or VM.
Redpanda uses the Seastar framework which was created by the ScyllaDB project. Scylla is high-performance C++ reimplemation of Cassandra and RedPanda seems to be chasing the same thing as an alternative to Kafka/JVM.
As a 12 year adtech veteran who has built ad networks from scratch 3 times, low-latency and high-throughput are critical to ad serving infrastructure and that's why Scylla is such a better alternative to Cassandra. The only other database that gets close is Aerospike, and possibly Redis Enterprise with Flash persistence. It's entirely valid to want similar improvements for event streams as well, and as long as they keep the same external API then you don't lose any of the ecosystem advantages either.
Ironically, Gil Tene uses the talk to sell you the Azul C4 pauseless GC, which shall easily invalidate your 10x lower p99 latency claim. Of course we also have zgc and shenandoah these days.
Seastar is a fundamentally different way of programming from what you mentioned above. Let me give you an example. Seastar takes all the memory up front - never gives it back to the operating system (you can control how much via -m2G, etc). This gives you deterministic allocation latency, is just incrementing a couple of pointers. Memory is split evenly across the number of cores and the way you communicate between cores is message passing - which means you explicitly tell which thread is allowed to read which inbox (similar to actors) - i wrote about it here in 2017 https://www.alexgallego.org/concurrency/smf/2017/12/16/futur...
The point of seastar is to not tune the GC for each application workload So to bring that up means that you missed the whole point of seastar. Instead the programmer explicitly reserves memory units for each subsystem - say 30% for the RPC, 20% for the app specific page-cache (since it's all DMA no kernel page cache), 20% for write-behinds, etc. (obviously in practice most of this is dynamic). It is not one dimension as suggested and not apples to oranges. it is apples to apples. You have a service, you connect your clients - unchanged - and one has better latency. that simple.
It may be your experience that when you download a bin kafka say 2.4.1 you change the GC settings but in a multi-tenant environment that's a moving target. Most enterprises I have talked to, just use the default script to startup kafka w.r.t gc memory settings. (they may change some writers settings, caching, etc)
At the end of the day there is no substitute for testing in your own app with your own firewall settings w/ your own hardware. The result should still give you 10x lower latency.
I am familiar with Seastar too. It is one component that is pretty useless by itself. What is relevant in this topic is what is around it, the functionality that you provide. This is why Scylla is copying Cassandra. You can come up with a nice way of programming whatever you want but the end of the way the business functionality is what matters and there are different tradeoffs involved, still.
What do you mean "copying" Cassandra? Obviously they're offering the same API. Many people like the Cassandra data model and multiregional capabilities and that's why it was chosen.
What Scylla is doing is unlocking new performance potential with a C++ rewrite and an entirely different process-per-core architecture that gets around the fundamental limitations of Cassandra and makes it easier to run. This performance and stability has also led to the team making existing C* features like LWT, secondary indexes, and read-repair even faster and better than the original implementations.
correct. we use one protocol, just raft - both for metadata and for data.
Raft is really easy to parallelize and dispatch to multiple followers async. I measured recently on 3 i3.8xlarge instances which give you 1.2GB/s - and i got around 1.18GB/s sustained -https://twitter.com/emaxerrno/status/1260415381321084929
Also, what's nice about using raft is that if there is a bug, we know it's w/ our implementation and not w/ the protocol. so it gives users sound reasoning.
Because the JVM is fast. And it's not resource hungry - your program might be resource hungry. Bad desktop Java programs misled a whole generation of programmers about the JVM.
Look at stuff like LMAX. Java can be lightning fast.
The JVM is fast. (though I'd say the best thing about the JVM is the amazingly large test suite it maintains). I think he was alluding to the fact it is much easier to develop predictable systems in a language that forces you to deal with the constraints up front. - i.e.: writing your own memory allocator for custom pools knowing exactly the latency, throughput, assembly generated for it.
Not to mention that a lot of libraries that are immediately avail for c++ (see io_uring) take a while to get ported to java. The cost of JNI for a libaio wrapper is also expensive. Last i bench(few years back), switching to a crc32 with JNI switch alone was 30 microseconds - an eternity - before doing the work.
In any case, we use seastar (seastar.io), which I'm not sure can actually be ported to java. The pinned thread per core makes a lot of sense for minimizing latency and cache pollution, etc. Externally, the feeling that apps in java are slow is real, less because JVM is slow per se, but because writing low latency apps in java is not the idiomatic way and those that do see to extract every ounce of performance of the hardware often look else where since the work is just about the same.
> i.e.: writing your own memory allocator for custom pools knowing exactly the latency, throughput, assembly generated for it.
If you want to do this, you can gain a lot of performance with having custom allocators and pools on the JVM as well. E.g. frameworks like Netty have pooling strategies for ByteBuffers. If you go that route, you can also gain a lot of performance on the JVM, might really be competitive.
Unfortunately the JVM still enforces too many heap allocations since value-types are not a thing yet, but it still performs well.
One of interesting things I discovered about C++/Rust vs C#/Java is that the former language family wants you to do a lot of optimizations upfront, wich typically results in good performance. But sometimes you also spend too much time into optimizing something something that won't matter in practice.
Whereas the managed languages are a bit easier to work with by default, but will thereby only yield mediocre performance. However you still have the chance to look into the bottlenecks and improve them by large margins using the right approaches. In C# now even more so than in Java thanks to tools like Span and value types.
True. I think the rejection of operating systems managing resources the way it is today (dynamic resource binding) is coming. Pinned threads make a ton of sense in HPC.
You might be correct - it may be only my ignorant experience interacting with Apache Big Data technologies of present day and wondering what the hell they are doing under the hood to take them so long to process simple stuff in bulk.
Perhaps open source engineers are not "paid" enough to spend countless hours carefully optimizing their programs on JVM. Properly paid engineers working on proprietary technology can surely bend and twist JVM and make it perform. It might be possible, but costly due to complexity and unpredictable nature of it.
It reminds me of world of SQL where you have to run your query with many small modifications and hope that the optimizer will generate a sensible query plan while the query is still readable. That's the cost of building on top unpredictable systems. You wonder why you can't simply get access to the lower level - physical plans - and program on that level directly, since you know what your desired outcome on that level is anyway.
I still wonder why new projects for performant data processing are not written for example in Rust since for this task it has all the upsides and none of the downsides of JVM.
Depends. Java's GC can be tuned an infinitum. It's not a simple task and requires knowledge, but that's the beauty of it: if you need low-ish latencies, then you can tune the GC to target that instead of throughput. For example we're using relatively large heaps (partly due to inefficiencies in our code), but we still want to stay under 500 msec/request or so. So we told the G1 GC to target 150 msecs per collection, and then adjusted our heap size / code accordingly. It works well.
If you need really hard limits on collection then that's a tricky problem, but that's also tricky when you're managing memory yourself.
Once you start talking hundreds-to-thousands requests/second, 500ms is an incredibly long time and you're well past simple tweaks to GC. Tuning GC to a high degree is non-deterministic black magic, which is not what you're looking for at that point.
Simple tweaks can go a long way for a lot of developers, but GC performance has been a problem at the last 3 organizations I've been at - and I'm not in the valley or at a FAANG - so it isn't exactly an uncommon scenario for developers.
Thanks! Yeh I think the same, specially with the rise of core count. I think an i3.metal is at 96vcpus. I estimate a machine like that should hold around 400K partitions just fine with us. I should measure this weekend.
Because most people using JVM in production understand that your claim is without merit and performance in the JVM land can be tuned to the workload and finally it is never a single dimension decision to chose a language/ runtime. JVM had many additional properties that make it an excellent choice for big data use cases.
10x faster. API compat. No jvm.
I also know of other private impls of it. Just makes sense.