Sub-10 ms Latency in Java: Concurrent GC with Green Threads

jakewins · on Aug 5, 2020

For years of my life, all I thought about was stuff like this. If you've ever ran latency sensitive systems on the JVM.. man is it ever a pain.

Who was it that turned GC off entirely, minimized allocation and just restarted their VMs when they ran out of RAM every couple of hours, was that Netflix?

Either way. It makes me excited for Rust and the languages it'll inspire, all this labor gone away.

manuelabeledo · on Aug 5, 2020

> Who was it that turned GC off entirely, minimized allocation and just restarted their VMs when they ran out of RAM every couple of hours, was that Netflix?

Every single financial firm out there, using Java for sub-microsecond tasks. Really, there is no other way to keep low latencies if you have your GC messing around every few milliseconds.

This may surprise some people, but Java is ubiquitous in low latency environments such as trading firms. It offers performance close enough to C++, but the developer pool is way larger. Also, when one needs to comply with extremely low latency requirements, the way to go is always dedicated hardware anyway.

If you are interested in the topic, there is this project called OpenHFT that aims to provide high frequency trading tools for Java. Particularly, their Chronicle Queue implementation tries to handle the GC latency issue by storing stuff off heap. Its co-founder, Peter Lawrey, has also delivered a handful of good talks about low latency Java.

vgatherps · on Aug 6, 2020

As someone in the industry I always find the claim that Java is ubiquitous in low latency trading fairly strange.

The only firm that widely uses Java over C or C++ on the hft fast path is Virtu, and the sense I get from some of my friends there is that they regret the decision (and now it’s mostly used to guide fpgas).

There is a TON of Java usage in software that would have been considered HFT maybe 10 years ago, especially at big banks, but the fastest I’ve ever seen somebody make a Java trading program that actually handled the whole tick-to-trade path is 7us, which would be at best ok by C or C++ trading platform standards.

I’m sure you could write some dumb trigger program operating on off-heap bytes that had similar performance characteristics to a C hft program, but unless you accomplish this via code generation then you’re just writing C in Java with a runtime actively battling your goals.

Edit: Jane street uses ocaml + fpgas but they aren’t really in the HFT business in the same way that say Virtu or Tower is.

pjmlp · on Aug 6, 2020

People say this as if it was somehow depreciative, but I rather appreciate writing safe C in Java, which is something WG14 will never take the effort to improve.

vgatherps · on Aug 7, 2020

The C I saw written in Java was filled with object pooling bugs and horribly unsafe code dealing with object layouts in off-heap memory.

The amount of bugs springing from just these far outnumbered memory bugs in the C++ and C codebases.

manuelabeledo · on Aug 6, 2020

I didn't say Java is used over C or C++. In fact, I stated that, given some requirements, you are better off with dedicated hardware.

Granted, I have not worked in finance for over 10 years, but I also try to keep up with the times, thanks to some friends, mainly operating in the EU/UK market. But I would not say it is odd to think that Java is ubiquitous, given how technologies often permeate across teams. This is a bit like what happened to Python and FORTRAN, being the latter still widely used in the scientific world, especially physics, but Python adoption is definitely growing, despite its obvious shortcomings.

Now, I understand that the hard requirements around HFT are wildly different, but from the perspective of an employer, having a larger pool of professionals and tools, would offer significant savings. Why wouldn't they take it?

jakewins · on Aug 6, 2020

> Why wouldn't they take it?

Because the pool of Java devs is only marginally wider than the pool of C++ devs? 5.4M C++ devs vs 7.1M Java devs is not a massive difference in pools:

https://www.daxx.com/blog/development-trends/number-software...

And while I don't have numbers for it, my experience has been that the vast majority of Java devs have very little experience with the kind of specialized use of the JVM that low latency programming requires. On the contrary, I've found it difficult to hire for high-performance Java codebases, because the way they are written is completely different from how "normal" Java is written, so you need to retrain devs used to writing "normal" Java.

manuelabeledo · on Aug 6, 2020

> 5.4M C++ devs vs 7.1M Java devs is not a massive difference in pools

That depends on the geographic distribution, though. Also, we are talking of a 25% larger pool.

> And while I don't have numbers for it, my experience has been that the vast majority of Java devs have very little experience with the kind of specialized use of the JVM that low latency programming requires.

You could say the same about C++ developers. Most of the C/C++ jobs are in embedded devices, which differ vastly from HFT environments.

vgatherps · on Aug 7, 2020

I believe I misread your first sentence, I thought you said every prop shop IS using Java. I have seen this specific claim so many times that I must have filled in the word.

The reason not to use java for low latency latency is that good low latency Java is pretty much only Java in a syntactic and to some degree object sense. Avoiding allocation permeates everything, and is just not worthy the battle.

You basically try to write C++ in Java, except without the language features of C++ to help with memory management and layout, and wit the added burden of having to worry about pools and primitive boxing and the runtime at every corner.

You would also be surprised though at how little hardware has replaced C++ on Linux with good user space networking. Not that many trades are pure latency arb, so getting that last microsecond or two just isn’t worth the effort.

It’s much more common to see hardware in fixed-function locations than it is to see hardware running the meat of a strategy.

dan-robertson · on Aug 5, 2020

The part of GC which causes the most latency issues is compaction rather than merely collection. Using a language like rust won’t help if you have memory fragmentation and indeed allocation tends to be much faster with a GC than with malloc. I think the advantages of rust are more to do with often avoiding heap allocation entirely (and predictably) or value semantics leading to fewer small allocations or the language’s semantics not forcing you to choose between more reliable allocation-heavy immutable-everything code and faster harder-to-think-about mutation-heavy code.

gwbas1c · on Aug 5, 2020

IMO, the main advantage of Rust is that it doesn't require an extensive runtime in order to have memory safety. This allows you to write a library like an image processor or embedded database without using C / C++.

Otherwise, if you wrote your image processor in C# or Java, it becomes hard to call your library from Python or Node because you have to require the entire VM. Likewise, you can ship an application binary that has no requirements for a runtime. (Your application binary doesn't require a JVM, CLR, Mono, Python, Node, or some other runtime.)

I've been through the Rust book twice but I'm just getting to the point of trying to write something in it. The mental model is very different. Coming from C# / Java / Javascript / Objective C; I'm wondering how many hours I need before I can get my head into Rust?

pizza234 · on Aug 5, 2020

I'm learning Rust as well; in my opinion, starting with small guided projects is the most stimulating and incremental approach, although unfortunately, I find that starting practicing Rust - differently from other languages - requires "having read the whole reference".

The resources I've reviewed are:

- Rustlings: I personally don't like the project; it's exercises for the sake of exercising, which may be good or not, depending on the person/approach.

- Ferrous systems exercises (set of mini projects): very small and simple, but interesting to work on. I think they're a very good resource to try immediately after reading the book.

- Rust Programming By Example (book): if one finds one or more projects interesting to work on, this is a very fun book. it's "not-perfectly-produced" though, if you know what I mean.

- Hands-On Data Structures and Algorithms in Rust (udemy): even if one doesn't like algorithms, I think working with data structures implicity develops an understanding of "getting the head into Rust".

- Build your own Jira with Rust (https://github.com/LukeMathWalker/build-your-own-jira-with-r...): An exercise-based approach to practice; probably a good alternative to the previous resource, for those who really don't want to work with A./D.S..

- Talent Plan (https://github.com/pingcap/talent-plan): like the previous, but more technical.

Have fun :-)

gwbas1c · on Aug 5, 2020

> I find that starting practicing Rust - differently from other languages - requires "having read the whole reference".

And that's the problem that I have. In high school I was handheld into C / C++ with weekly lessons. By the time I started my career I abandoned C because the things I worked on professionally had no benefit from manual memory management.

Now the thing that I want to write, an embedded database, requires manual memory management and no runtime. I could, in theory, go back and do it in C. It'd be slow working in a language that I haven't done anything in since 2002, but at least I'm familiar with all the conventions.

Do I basically need to spend 40-80 hours doing silly exercises just to ease into the new conventions and mental model?

pizza234 · on Aug 5, 2020

It's not really clear what you mean with "silly exercises". One of the resources is a course for building "a [...] networked, parallel and asynchronous key/value store", which is far from being a "silly exercise".

Even ignoring that, it's a matter of big picture.

If learning Rust is only for this project, or only to be "quickly proficient in a new language", then I don't think it fits the specific case. The may be alternatives; I don't have experience (somebody else can surely advice better) something like C++ with smart pointers or memory safe D, I guess, could fit.

In the big picture of a career, or even in the context of a single company, spending 40/80 hours to be proficient in a language is essentially an insignificant time.

gwbas1c · on Aug 6, 2020

> It's not really clear what you mean with "silly exercises"

At this point in my experience, if I want to learn a language, I write something "easy" in the language that I want to write for my enjoyment.

For example, when I was between jobs I wrote a personal blog engine in NodeJS so I could get up to speed in modern Javascript and the node ecosystem: https://github.com/GWBasic/z3

"Silly exercises" implies a programming exercise that has little point outside of instructing a basic concept: The kind of exercises I did in high school when I learned C are an example; there was no outside purpose to the code itself. IE, there's no tangible use to the code when it's complete.

What I did 3 years ago was write a small program in Rust that opens links listed in a text file. (I've written many versions of this program over the last 18 years, mostly for self-education.) When I first wrote the program, it was mostly copy & paste, but it compiled even though I didn't understand most of it.

Last night I decided that I was going to recompile it on Windows as my first exercise. I had to change the "open a link" library because it only compiled on Mac, which required changing some code: https://github.com/GWBasic/open_links

Now I'm going to try porting my in-browser Javascript in Z3 to Rust + WebAssembly. Let's see how far I can get!

pjmlp · on Aug 5, 2020

On the other hand it requires jumping through hoops to make borrow checker friendly architecture designs, or fiddle your code base with Rc<> types everywhere.

And when their count reaches 0, you have your stop the world, unless you move the destruction into a background thread, thus manually emulating a tracing GC.

pkolaczk · on Aug 5, 2020

When RC drops to zero it is not STW. It is stop the current thread only. And even that can be trivially solved by background deallocation. Solving latency problem of GC is far from trivial.

pjmlp · on Aug 5, 2020

As if stop the current thread isn't the world the world in single threaded programs.

I explicitly mentioned how moving into a background thread is poor man's tracing GC.

Languages like D and C# offer mechanisms for deterministic deallocation, while keep the productivity of a tracing GC.

orneryostrich · on Aug 5, 2020

But if your program is single-threaded, you don’t need an `Rc` type. The whole point of `Rc` and `std::shared_ptr` is so an object can have one owning pointer per thread, in cases where you’re not sure which thread will finish using the object last.

pjmlp · on Aug 5, 2020

Sure you do, try to implement a Gtk-rs application without the Rc<RefCell<>> dance for widgets state.

pkolaczk · on Aug 5, 2020

Finalizers (IDisposable) or try-with-resources are not equally strong as deterministic destruction in C++ or Rust. Or did you mean a different feature for deterministic destruction in C# that I don't know of? I'm quite curious.

pjmlp · on Aug 5, 2020

You can stack allocate objects, so those are alive just until the end of the stack.

Then there native memory allocation and safe handles.

IDisposable and Finalizers aren't the same thing, actually, although they happen to be used together as means to combine deterministic destruction alongside GC based destruction.

You can also make use of lambdas or implicit IDisposable implementations via helper methods, that generate code similar to memory regions or arenas in C++, but in .NET.

Finally, many tend to forget that .NET was designed to support C++ as well, so it is also possible to generate MSIL code that ensures deterministic destruction RAII style, naturally this falls into a bit more advanced programming, but it can be hidden away in helper classes.

algorithmsRcool · on Aug 5, 2020

> You can stack allocate objects, so those are alive just until the end of the stack.

Well, not really. You cannot stack allocate anything but primitive buffers. So no objects or even strings. So it cannot replace heap allocation for anything but smallish "arrays" of simple types like char and int. This also means you can't use normal structs/value types, only primitives.

> You can also make use of lambdas or implicit IDisposable implementations via helper methods, that generate code similar to memory regions or arenas in C++, but in .NET.

> Finally, many tend to forget that .NET was designed to support C++ as well, so it is also possible to generate MSIL code that ensures deterministic destruction RAII style, naturally this falls into a bit more advanced programming, but it can be hidden away in helper classes.

I'm pretty sure there is no way in C# or in MSIL to explicitly free/deallocate a heap allocated object.

MSIL defines a Newobj opcode, but no Freeobj or anything like it that I have ever seen. You can use custom allocators or unmanaged memory to deterministically allocate and free buffers of structs/values types. But only those that do not include references to managed object references, otherwise you would need to pin and the references yourself and keep the GC aware that there were non-tracked references to those objects. It gets messy fast.

pjmlp · on Aug 6, 2020

It is so easy, write the code that you want in safe mode C++/CLI, get the code template, then implement the helper classes to generate the same MSIL on the fly.

As for stack allocation, apparently you missed structs.

Here is your string allocated on the stack.

    unsafe struct cppstring
    {
        const int BuffSize = 1024;
        fixed char data[BuffSize + 1];
        int current;
        
        public cppstring(System.ReadOnlySpan<char> buffer)
        {
            for (int i = 0; i < System.Math.Min(BuffSize, buffer.Length); i++)
                data[i] = buffer[i];
            current = 0;
        }
    }

    public class StackDemo {
        public void Myfunc() {
            var str = new cppstring ("Hello from stack");
        }
    }

Providing std::string like operations is left as exercise for the reader.

algorithmsRcool · on Aug 9, 2020

But that object is not a string, it is char buffer. You cannot directly substitute one for the other.

Yes I realize they can be used to store the same data, but you've defined a new type that is incompatible with an actual string object.

shpongled · on Aug 5, 2020

Easiest way to get started if you're coming from a GC background is to just liberally `.clone()` everything in Rust. Once you're used to move semantics and the syntax, then you can start messing around with borrowing. It definitely has a learning curve, but I find it a breeze to write once you grok the ownership rules.

nestorD · on Aug 5, 2020

One thing I love about Rust is that allocation are very explicit and easily spotted which helps a lot when one wants to avoid them.

I found C++ to be treacherous around corners cases on this subject.

Matthias247 · on Aug 5, 2020

I actually don't think so. I've seen plenty of Rust libraries that copy strings about 5 times unnecesssarily between usages, just because `.clone()`, substrings and co are so convenient. Those all could have been optimized away, but the authors of that code didn't knew or didn't try.

And if you do `MyAwesomeStructure::new()` you might actually trigger a whole bunch of allocations which are invisible.

So a "yes" from my side on Rusts ability to remove allocations if you try hard enough. A "no" however on allocations being extremely explicit and easy to see for non experts.

nestorD · on Aug 5, 2020

dthul's comment [0] covers my point of view fairly well. Clones and new are easy to spot, you might not be careful about them because you do not care or have other priority but you can find/grep them quickly: they are explicit.

Meanwhile C++ has a lot of implicit memory allocation and things that might or might not allocate.

[0]: https://news.ycombinator.com/item?id=24061235

pjmlp · on Aug 5, 2020

Unless you are going to review the whole code, written from scratch, there is no way to actually be aware of all allocations in Rust without help from a memory profiler.

dthul · on Aug 5, 2020

Just one or two days ago I asked here on HN how memory allocations in C++ are considered to be more hidden than in Rust and got some good replies: Especially constructors, copy constructors, assignment operators etc. can introduce non-obvious allocations.

For example:

  T t;
  a = b;
  T t2 = t;

can all allocate in C++. The equivalent in Rust:

  let t: T; // won't allocate
  let t = T::new(); // might allocate
  a = b; // won't allocate
  let t2 = t; // won't allocate
  let t2 = t.clone(); // might allocate

So in Rust you can tell that as long as there is no function call, there won't be an allocation.

pjmlp · on Aug 5, 2020

Like operator overloading and implicit conversations.

dthul · on Aug 5, 2020

True, those are even more places where C++ can implicitly allocate. The operator overloading also applies to Rust. Rust has no implicit conversions though, which are arguably worse since they are invisible (that's why I usually mark all my expensive single argument constructors as "explicit").

pjmlp · on Aug 6, 2020

Sure it has, you cannot ensure Deref implementations don't allocate.

dthul · on Aug 6, 2020

While technically true, the documentation makes it very clear that Deref should only be implemented for smart pointers and never fail. So no allocations in practice.

pjmlp · on Aug 6, 2020

Ah problem solved then, we just need to document very clear that C code should not corrupt memory, how did I never thought of it.

cogman10 · on Aug 5, 2020

I think the point was more that allocations are fairly explicit.

There are a few places in C++ where allocation can happen pretty much invisibly. A copy constructor is an example of that. You might see a new allocation simply by calling a method.

With rust, you usually won't see an allocation unless it is explicitly called for. You can follow the call tree and very easily pick out when those allocations are happening.

pjmlp · on Aug 5, 2020

Rust also has operator overloading and implicit conversations, both can cause memory allocations.

cogman10 · on Aug 5, 2020

Certainly. However, both of those things (from what I've seen) are fairly rare.

The C++ copy constructor comes up fairly frequently.

pjmlp · on Aug 5, 2020

Like unsafe Rust code, there is the theory and then there are the code bases that one finds out in the wild on crates.io and in-house, not necessarily using best practices.

dthul · on Aug 5, 2020

Rust does not have implicit type conversions.

pjmlp · on Aug 6, 2020

Via Deref.

dthul · on Aug 6, 2020

Not in practice. It's only used for reference coercions to existing objects.

You would need to ignore the documentation and bend over backwards to use it for an actual type conversion, because you can only return references.

pjmlp · on Aug 6, 2020

Bending backwards is a synonym to code quality delivered by consulting/offshoring projects across most big corporations.

darksaints · on Aug 5, 2020

Don't forget scanning. Yes, moving blocks of memory around is expensive, but it can also be done concurrently. Scanning, AFAIK, cannot be done concurrently, and thus remains the primary blocker to lower latency. And scanning is something that is entirely eliminated with static memory management.

pron · on Aug 5, 2020

Scanning is most certainly done concurrently with ZGC. Even root scanning is on its way to become fully concurrent, which is why we're nearing the goal of <1ms latency.

jerrinot · on Aug 5, 2020

I didn't know about fully concurrent root scanning, thank you!

How does root scanning work wrt to Loom? Are stacks of virtual threads treated as roots? I guess there is no other option?

pron · on Aug 5, 2020

No, virtual thread stacks are not roots! This is one of the main design highlights of the current Loom implementation. In fact, at least currently, the VM doesn't maintain any list of virtual threads at all. They are just Java objects, but the GC does treat them specially.

kjeetgill · on Aug 5, 2020

Interesting! In what way are they handled specially?

pron · on Aug 5, 2020

Unlike any other object, the location of references on a stack can change dynamically, so the GC needs to recognise those objects and walk their references differently. There are other subtleties, too.

pkolaczk · on Aug 5, 2020

Right, it is concurrent, but it is still costly. It brings rarely used data into the caches and pushes useful data out of the caches. If some parts of the heap were swapped out, the impact of concurrent scanning can be quite dramatic.

pron · on Aug 5, 2020

Ah, but you can pin GC threads to specific cores, and a reference-counting GC also has such non-trivial costs. In practice, however, people in the 90-95% "mainstream" domain that Java targets are very happy with the results. Of course, there are some applications that must incur the costs of not having a GC. In general, though, the main tangible cost of a GC today, for a huge portion of large-scale applications, is neither throughput nor latency by RAM overhead.

pjmlp · on Aug 5, 2020

I also enjoy the low level C++ like tooling that .NET offers, but I am also looking forward to the fruits of Panama and Valhalla in that direction.

pkolaczk · on Aug 5, 2020

Pinning doesn't solve the problem of cache L3+ thrashing.

pron · on Aug 5, 2020

It could if the GCs used non-temporal instructions that bypass the L3. Of course, how much of a problem this is in practice in most applications is something that would need to be measured.

pkolaczk · on Aug 5, 2020

Point taken. What about swap?

pron · on Aug 5, 2020

Yeah, swapping could be really bad and should be avoided. So: don't swap :) Java's memory consumption can't go up indefinitely. The most important setting is the maximum heap size. Set it to a good level and don't swap.

haxen · on Aug 5, 2020

All the modern GCs scan the heap concurrently, the hardest problem is scanning the GC roots in the call stack. ZGC is currently implementing concurrent stack scanning.

wokkel · on Aug 5, 2020

Yes it can. Stop the world collection is a thing of the past. See for example: https://developers.redhat.com/blog/2019/06/27/shenandoah-gc-...

jerrinot · on Aug 5, 2020

Modern Garbage Collectors do concurrent scanning.

I believe most GC implementations have non-concurrent "initial marking" phase, but that's typically fairly quick. It has to scan roots of your object graph, think stack, JNI, etc.

dan-robertson · on Aug 5, 2020

Scanning can be done incrementally with each allocation (Such that allocations become slightly more expensive but no individual allocation does loads of scanning work). Scanning can also be done concurrently.

rurban · on Aug 5, 2020

Scanning is also entirely eliminated by using no global heap allocations. With copying and small stack allocations there's not need to scan much, and can easily stay below 1ms.

bananaface · on Aug 5, 2020

Can't you just allocate a huge block up-front and throw stuff into it with a custom allocator? I don't know if Rust allows you to do that kind of thing.

steveklabnik · on Aug 5, 2020

Yes. There are some trade offs, you can’t parameterize the standard library data structures over an allocator yet, for example.

AtlasBarfed · on Aug 6, 2020

So arena allocation and buffer reuse?

pron · on Aug 5, 2020

In case you missed this post and the previous ones, here's the news: this labour is gone on the JVM. The GCs in JDK 14-15 are good enough pretty much out of the box, and they're getting better very quickly now.

OTOH, if you think other languages let you do away with a GC without pretty significant extra work, especially in concurrent systems, well, then you haven't had experience with those languages.

tjoff · on Aug 5, 2020

That statement has been said for pretty much every release of any GC.

I'd say the opposite is true. Relying on GC requires significant extra work. Because you always need to think about memory (exception: small script-like applications). The only thing a GC does is that it enable you to not think about it, but the moment you don't you will write bad code and realize it was a disservice all along. And by then it is too late.

So in a GC language you need to constantly be aware of when you take something for granted. Which is more work than just doing it manually yourself.

pron · on Aug 5, 2020

First, a GC allows you to abstract over memory allocations (i.e. hide them). This makes maintaining code over time significantly easier, as allocation becomes a hidden implementation detail.

Second, while there have been similar claims made in the past, they always apply for a certain target throughput and latency. GCs are constantly making great strides in that regard. Next year, ZGC will have a <1ms worst-case latency for pretty serious allocation rates, and with an acceptable hit to throughput. As you can see in this post and previous ones, G1 offers great throughput with acceptable latencies.

haxen · on Aug 5, 2020

> ZGC will have a <1ms worst-case latency

In our benchmarks we never saw a GC pause of more than 2 ms on either ZGC or Shenandoah, but the end-to-end latency, the one the user cares about, is impacted by much more than a single GC pause. Sometimes there would be several pauses in a rapid sequence, or just the background GC thread would do too much work at once.

Even after dedicating a core or two to the GC, you still face the issues of cache pollution and RAM throughput stealing that heap walking incurs.

pron · on Aug 5, 2020

Yep, that's because concurrent stack processing (https://openjdk.java.net/jeps/376) hasn't been merged yet, nor has the changes to the VM's internal references. Both of those are coming within a year. Until then, there's still significant work done by the GC at safepoints.

haxen · on Aug 5, 2020

A month ago we benchmarked against the master of the ZGC repo. There wasn't a big difference.

pron · on Aug 5, 2020

Sorry, I misread your previous comment. The planned improvements that would make the biggest difference are making ZGC generational, and improving scalar-replacement (or doing some sort of stack allocation).

jakewins · on Aug 5, 2020

This GC you speak of that hides allocations as an implementation detail sounds excellent! Every GC I've worked with - and my day job is orchstrating several thousand JVMs - simply moves the time when you need to think of allocations somewhere else.

Either it moves it into PagerDuty, like G1, or it moves it into your GCP bill like Shenandoa and ZGC.

The trade-off has always been latency-throughput-footprint, nothing I've seen yet has changed that. The innovation in Rust is realizing you can do all the tracing work at compile time.

pron · on Aug 5, 2020

An implementation detail doesn't mean that you don't think about it. We think about implementations all the times. It means that it's a detail, and changing a subroutine's allocation pattern does not create a ripple-effect throughout the code in its transitive consumers.

> The innovation in Rust is realizing you can do all the tracing work at compile time.

This is spectacularly false. For all non-trivial allocation/deallocation patterns, Rust also uses a runtime reference-counting GC, which is significantly slower than the tracing GCs you find in OpenJDK. The benefit comes from not relying on it too much, but this comes at a considerable cost of lost abstraction, which means more costly maintenance over the years. This is the same for all low-level languages; the difference Rust brings is that it (conservatively!) checks for memory access errors.

Another difference is that most people who talk about Rust haven't actually written a significant application in it and had to maintain it for years. I'm not saying it's impossible -- people do this for C and C++, which make a similar tradeoff in this regard -- but it does come at a substantial cost.

jakewins · on Aug 5, 2020

At this point I think we are arguing somewhat subjective things.

I used to be a hard-core zealot for the JVM as the performance platform of the future - and did talks arguing just as you are here why HotSpot outperforms $LANGUAGE in real applications. I feel like I'm hearing myself in your argument..

I wrote a signficant portion of the Neo4j storage engine, which is in Java. Now I'm writing another database engine in Rust (sidenote: not for replacing the Neo4j engine, just because it's interesting). Arguably database engines qualify as "significant applications".

I find - subjectively:

- Maintaining performant code in Rust is easier. I do the same patterns as I did in Java, except it doesn't rely on easy-to-break assumptions of how HotSpot happens to work (ex: stack allocations)

- Like you said elsewhere, the issue is often about fragmentation, ultimately stemming from object churn. I find that Rust makes it, culturally perhaps, easier to maintain low allocation code than Java. (sidenote: I think these two points is also why Go code generally has "better GC behavior"; Go doesn't have a better GC, it has a language that encourages less heap allocation)

- The engine I'm writing in Rust is faster than anything I've written in Java and - critically - runs for days and months without notable stalls.

Come to the dark side!

cogman10 · on Aug 5, 2020

I personally like both languages and think they both have a place. My 2 cents.

The JVM works exceptionally well when you are dealing with long lived apps which deal with a lot of allocations and the machine running it has a substantial amount of memory. Things like webservers, for example, are near perfect fits for the JVM.

Rust does really well when you need high performance, low memory, and ultra fast startup times. It won't necessarily outperform the JVM when you talk about doing a lot of heap allocations (due to heap fragmentation) and it unfortunately suffers from the same heap fragmentation if you are dealing with a long lived server that does a lot of allocations. But then, maybe that performance loss is acceptable for the ability to very quickly scale up and down servers.

Now, the JVM is making great strides towards getting faster startup times and even fast performance at startup (AppCDS). However, those strides often involve trade offs with either build complexity or performance losses (such as Graal's AOT). The benefit for rust is that it is as fast as it ever will be without any special build steps or tweaks.

Oh, and let's not forget diagnostics. Flight recorder for the JVM is simply AMAZING. The ability to hook up to a poorly behaving production server, start flight recording, and getting detailed information about things like "where are allocations happening" or "what are the hot methods" is simply amazing. No other platform that I know of has the level of detail you can get right out of the box with flight recorder. Certainly not without restarting the application with additional configuration. For example, you'd need a special build of rust with profiling turned on to even start to get the same level of info, doing such also significantly negatively impacts performance.

pjmlp · on Aug 5, 2020

Just a small correction, AOT and JIT caches like AppCDS were already available in commercial JVMs like J/Rockit or IBM WebSphere Real Time, as two examples from a couple of possible ones.

What is happening now is that the free beer Java users are also getting those features on the package.

pron · on Aug 5, 2020

I'm already on the dark side. Rust isn't my cup of tea, but I do a lot of programming in C++, and hope to one day use Zig for low-level programming. But this does come at a considerable cost, so I do it only when I must -- i.e. I'm targeting a constrained environment or I really need low-level control. The cost isn't so much up front (aside from the big loss in observability), when you first write the code, but low-level languages result in rigid programs that are much harder to evolve over years, certainly with large teams, that with Java. Separate from that, JDK 14 doesn't behave like JDK 8. If you need performance and observability and you are not constrained, then Java is an excellent choice that can have a big positive impact on costs.

thenewwazoo · on Aug 5, 2020

> For all non-trivial allocation/deallocation patterns, Rust also uses a runtime reference-counting GC...

Lest anyone read this and think it's true, it's not. Using Rc is a design choice, and not one that is a given. I have written tens of thousands of lines of Rust code doing very heavy data processing and used Rc only a handful of times. In fact, I find using Rc without a very good reason to usually be a bad idea that enables lazy thinking.

pjmlp · on Aug 6, 2020

I on the other hand have used Rc quite a lot, it is either that or having to twist the application architecture to somehow fit into Gtk-rs expectations.

imtringued · on Aug 5, 2020

>This is spectacularly false. For all non-trivial allocation/deallocation patterns, Rust also uses a runtime reference-counting GC, which is significantly slower than the tracing GCs you find in OpenJDK.

This is not really a good comparison because you can't write the same code in either language. In Java practically everything gets allocated on the heap barring some optimizations. Meanwhile Rust programs can selectively allocate memory on the stack when it makes sense to do so. Reference counting is is just one of many different allocation strategies available to Rust. It is not the first tool you grab when you want to allocate memory in Rust, therefore absolute throughput of reference counting might not be as relevant in Rust as the absolute performance of the GC in Java.

feanaro · on Aug 5, 2020

> This is spectacularly false. For all non-trivial allocation/deallocation patterns, Rust also uses a runtime reference-counting GC, which is significantly slower than the tracing GCs you find in OpenJDK.

Are you talking about `Rc` and similar smart pointer types? If so, the twist is that in Rust almost all allocations in Rust are trivial in this sense.

pron · on Aug 5, 2020

There's no "twist". There was a claim that Rust does tracing at compile time; it does not. It's allocation patterns and costs are similar to all other low-level languages. The cost and benefits of such languages are well known.

feanaro · on Aug 9, 2020

My point was that Rust's compile-time machinery and conventions allow you to avoid using referencing counting smart pointers like `Rc` in almost all cases. What you call trivial patterns of allocation is in fact the prevailing type of allocation in Rust.

CyberDildonics · on Aug 5, 2020

> This makes maintaining code over time significantly easier, as allocation becomes a hidden implementation detail.

You say that, but there seem to be no end to the stories of people spending enormous amount of time fighting the gc. It is not difficult to avoid heap allocations in other languages as well as freeing them deterministically.

pron · on Aug 5, 2020

First, you say it as if there aren't event bigger struggles without a GC. So the comparison is not "I'm fighting with the GC vs. I'm not" but "I'm fighting with the GC vs fighting other things, which is better?" So you hear those stories because those are the stories people who use a GC can tell.

Second, as this blog post series shows, the "fight" is not what it used to be. You don't really need to control allocation any more until your rates are really high. Java's GCs have just gotten so much better in JDK 14 and beyond.

CyberDildonics · on Aug 5, 2020

> First, you say it as if there aren't event bigger struggles without a GC.

I'm extremely skeptical about that. My experience is that with modern C++ you lose very little elegance and gain a huge amount of control by giving up a garbage collector. Memory management becomes a very minor problem. The vast majority of memory allocations are avoided and those that need to be there can be done ahead of time.

I have never heard anyone writing a latency sensitive program in C++ (games, trading etc.) say that their life would be easier if they were using a gc or that they wished they could do it in java. I have however seen decades of people talking about all the extreme lengths and rabbit holes they go down to deal with the java gc.

From a broader perspective, pretty much any language with a gc ends up having a perpetual conversation around how the next gc will solve the problems with the current gc. You can see it in java, go, julia, and D. The only one I never hear about is LuaJIT, but maybe I just haven't seen it or maybe the expectations are lower.

pron · on Aug 6, 2020

> My experience is that with modern C++ you lose very little elegance and gain a huge amount of control by giving up a garbage collector.

The cost of maintaining a large C++ application (>1MLOC) with a large team over years is very significantly higher than a similar Java application. In some cases the footprint and/or performance benefits are worth that extra cost, but in the vast majority of cases they're not.

> I have however seen decades of people talking about all the extreme lengths and rabbit holes they go down to deal with the java gc.

Again, 1. Java's GCs changed dramatically in the last two years -- the GCs described in the post, are brand new/recently revised and 2. that's because that's Java's particular rabbit hole. C++'s rabbit holes, from undefined behaviour, through partial evaluation with templates and constexprs, to compilation times and sheer language complexity are far deeper.

> I have never heard anyone writing a latency sensitive program in C++ (games, trading etc.) say that their life would be easier if they were using a gc or that they wished they could do it in java.

Their lives would be easier if they could do it in Java, but sometimes they can't. I think that the changes in the last couple of years and the upcoming changes in the next few years will make Java more appropriate even in domains where it hasn't been used before, but it's fine if not. Its market reach is so huge as it is. But games are not often maintained for many years, and telemetry isn't that important, so Java's benefits are not as big as for servers.

CyberDildonics · on Aug 6, 2020

> The cost of maintaining a large C++ application (>1MLOC) with a large team over years is very significantly higher than a similar Java application.

I am very skeptical of this, I don't know why it would be the case. My experience is that with modern C++ and avoiding inheritance programs end up much more direct and clear since a type isn't fragmented into multiple classes and base objects don't need to be used for generic programming and data structures.

> C++'s rabbit holes, from undefined behaviour, through partial evaluation with templates and constexprs, to compilation times and sheer language complexity are far deeper.

This seems like what would be said by someone who has just read a few comments on C++ here and there but not actually used it for non-trivial projects. These are rarely issues. I don't know what 'partial evaluation with templates' means and constexprs didn't even exist until recently. Compilation times do seem to be a big problem, mostly because many projects don't do anything with their structure to mitigate them.

eigenspace · on Aug 5, 2020

> From a broader perspective, pretty much any language with a gc ends up having a perpetual conversation around how the next gc will solve the problems with the current gc. You can see it in java, go, julia, and D.

Do you have any references handy on the julia bit there? I actually have seen very little conversation in the julia community about replacing or upgrading the GC. Mostly just the ocassional post from an inexperienced user who thinks that a borrow checker would be a good fit for julia.

Discussion seems to almost always revolve around showing users who need it, how to manually manage memory when necessary by pre-allocating arrays, using in-place operations or writing stack allocated code so that they avoid the GC in performance critical code.

I've never seriously used a language without a GC, but my feeling in Julia has always been that I never really had gripes about the GC because it's so easy to avoid the GC and take memory management into my own hands.

CyberDildonics · on Aug 5, 2020

There were plenty of discussions a few years ago. Julia is probably used the least out of those languages for things that might have latency problems though - servers, guis, games, etc.

eigenspace · on Aug 6, 2020

Interesting, in the three years I've been around I've seen little discussion of it. The main people I was aware of who cared about low-latency were robotics people and they mostly found that the techniques for manually managing memory allowed them to avoid the GC and get the latency they needed quite easily.

One complaint I will say I've heard though is that while these people find they can get the allocation behavior they want (i.e. none), some of them would like semantic guarentees that they will not hit allocations, rather than needing to test and make sure their code doesn't start hitting the GC if they switch Julia versions. Someday we might be able to provide such guarantees, but for now GC behavior is just an implementation detail that can change across minor versions.

I can sympathize with those who find that uncomfortable for sure, but in practice, Julia versions have been consistently better at getting more automatic stack allocations, not less, so it hasn't really been a problem.

imtringued · on Aug 5, 2020

Lua without a JIT is just a glue language. The expectations are low and LuaJIT is easily smashing them because a lot of Lua code is/was written for the original interpreter.

dtech · on Aug 5, 2020

This is only the case when latency is important below some threshold†. There is a very large class of applications that can work with that limitation, and in that case a GC can free up programmer mind-space by letting him mostly forget about memory management. You're optimizing for productivity of the developer instead of hardware usage.

Rust or other upcoming or future languages might change that.

† The threshold varies between VM and GC, but usually <10ms is easily achievable

pron · on Aug 5, 2020

ZGC's pause latency is at a point where it rivals, or about to rival, nondeterministic pauses by the OS. So unless you're running on a realtime kernel, GC is not an issue any more as far as latency is concerned. The only real, serious cost for modern GCs is RAM overhead.

imtringued · on Aug 5, 2020

I'm not convinced to be honest. Even with tiny heaps like 32MB JVM processes with practically nothing but a hello world micronaut project consume up to 200MB of RAM. There is more to this than just the dumb GC which always predictably hits the configured maximum or if there is no maximum it will just gobble up your most of your system. I once saw an instance of tomcat allocate 30GB to itself on a staging system with at most one user at a time simply because the machine had 128GB RAM. 30GB for nothing. I personally dislike the JVM. However I have no problem with Java or any JVM derived languages but it simply doesn't matter how much you hyper optimize your code if the underlying VM is junk.

pron · on Aug 6, 2020

The way RAM is managed in Java and in C is very different; neither is better than the other, but it's different, and you don't seem to understand the Java model. In that model, you give an application a maximum heap, which the program will use regardless of the minimum it needs to improve its performance. You want it to use less RAM? Just tell it to. Having said that, very recently Java has started to return RAM to the OS if it's just sitting idle. See https://malloc.se/blog/zgc-softmaxheapsize

RhodesianHunter · on Aug 5, 2020

This is not really true at all though, nor is it how people who design performance intensive systems go about it.

generally you design your application the way you normally would, do some testing, and then go back through the hot path and make some adjustments.

karlmdavis · on Aug 5, 2020

I mean... Rust literally doesn't have a GC, in the usual sense -- certainly nothing resembling most GCs' generations, mark, sweep, etc. approaches.

Steve lays it out far better than I could, coining the term "static garbage collection": https://steveklabnik.com/writing/borrow-checking-escape-anal....

And to your "without pretty significant extra work" qualifier: I really don't find that to be true with Rust. The initial learning curve was a bit rough, but certainly far less so than other languages/platforms I've picked up (hello, ML). In the end, I find that it's just a nice, helpful, productive, and ludicrously performant language.

pron · on Aug 6, 2020

First of all, reference counting is a GC algorithm. Mark and sweep is an example of a tracing GC algorithm, which is usually faster than reference counting, but it is true that Rust normally does not rely on any GC, static or dynamic at all, but on manual memory management, assisted by the type system.

The thing is that most of what you said about Rust is true of C++ as well. It really isn't hard to write a good fast program in C++ once you know it. The problem -- as those of us, like me, who have been writing in low-level languages for a couple of decades now know -- is that low-level programs in any low level language are necessarily rigid. Changing something in one place often has a much bigger impact on the codebase than in a high level language, making the overall cost much higher.

Low level languages have their place, but they won't replace high-level ones for "ordinary" application development. People will only pay the price when that extra 5% is important or when running in a constrained environment. This is the same equation that's been around for twenty years and there's no sign it is changing.

lostmyoldone · on Aug 6, 2020

Rust is really nice, I like it a lot. It's probably also one of the easier languages to write performant code with reasonable latencies in most cases, almost all. However, automatic resource freeing/deallocation can still bite you if you need really reliable low latencies, and if you think the language will handle it all.

Since the automatic deallocation in rust, per default, happens at the point where the value goes out of scope, if a resource destruction could block, or perform expensive operations on destruction, it can not be allowed to go out of scope in a latency sensitive thread.

Usually not that much of an issue unless you are chasing really low latencies.

But things can still happen in rust that catch you off guard. Say a value goes out of scope, and because of reasons, it does so with a destructor doing a logging call, which has accidentally become blocking on a socket send call, which it did because nobody realized that the AWS/GCP/whatever logger adaptor actually didn't perform all IO in a thread with which was only communicated with locklessly, which nobody noticed before because it was only if a buffer was full, which only happened today because ....

Not a big deal, it's almost all the same things which mess upp latencies in C++ code. And that's the thing. It's not necessarily easier to get low latency in rust than in C++, but the work required for hitting a quality/performance/latency target in rust is probably still lower than for C++. Unless you are lucky to have a very mature C++ low-latency stack, together with all the utility functionality you need, which seems to be exceedingly rare. Is the work required lower than Java on a custom/tuned JVM?

We'll have to wait and see. It probably is, but it's a complex balance between access to utilities, language complexity, and several more parameters which ultimately decide which platform provides the best environment for low latency code, especially if the complexity is non trivial.

VHRanger · on Aug 5, 2020

In D, the GC is guaranteed to only ever run if you allocate new memory on the heap.

This is still a lot of work for a video game (because you never want any latency, the only way to achieve this is with an arena allocator or going full @nogc).

But for apps where the latency requirement is bounded, D doesn't make it hard and the language is nice and ergonomic.

chrisseaton · on Aug 5, 2020

> In D, the GC is guaranteed to only ever run if you allocate new memory on the heap.

Isn't that how most GCs work?

Why would you do anything else? To release memory to the OS? That's not really a priority in most runtime systems, and I think wanting to do that is a pretty niche requirement.

frant-hartm · on Aug 5, 2020

I want memory allocation to run in constant time, so I can achieve low latency. That's the point of concurrent GCs - they run in the background to free up memory.

chrisseaton · on Aug 5, 2020

I know they run in the background... but I don't think they run in the background if there's nothing to do, do they?

imtringued · on Aug 5, 2020

A JVM with default settings will happily allocate more memory without attempting to even run a single GC cycle until it hits the configured heap size. Once it hits the limit it will try to run the GC.

readittwice · on Aug 5, 2020

Releasing memory is definitely useful for mobile devices or browsers. Even in some server use cases it's useful, e.g. when you pay for memory usage. I guess that's why the JVM has this -XX:SoftMaxHeapSize option.

I guess it depends whether GCs are always scheduled in an allocation or can be triggered another way. Either way that should be easy to disable.

I read somewhere that D doesn't have write barriers, so I would assume they have a hard time implementing more advanced GC features like generational collection or concurrent marking. It's not suprising that the GCs in the JVM achieve much better pause time.

nemothekid · on Aug 5, 2020

In Java, everything is heap allocated, whereas in D you have a choice.

chrisseaton · on Aug 5, 2020

> everything is heap allocated

You can definitely write useful Java applications where nothing gets heap allocated, through using existing objects, and through using scalar-replacement-of-aggregates.

wokkel · on Aug 5, 2020

Not entirely correct: https://shipilev.net/jvm/anatomy-quarks/18-scalar-replacemen... . Hoi cannot force a specific allocation, but the jit does do something akin to stack allocation.

manigandham · on Aug 5, 2020

That's the trade-off. Predictability and determinism vs development and testing effort.

The memory work remains the same, you can do it yourself or let the GC handle it. For 99% of applications, the GCs are good enough and getting better every year, but low-latency still needs predictability and would ideally choose manual memory management.

The realities of the job market and IT deployments are different though and that's why we still have JVMs involved with low-latency scenarios because of talent, tooling and productivity.

twic · on Aug 5, 2020

They're good enough for some purposes, but not for others. If you need <10 us latency, you still have to jump through hoops to get that on a JVM.

haxen · on Aug 5, 2020

For < 10 µs latency there are hoops in any language, as well as the OS, like thread-pinning, marking CPU cores unusable for OS interrupt handling, all kinds of virtual memory issues, etc. And you can't afford a single network hop or even SSD access.

twic · on Aug 5, 2020

True! It's not trivial to get 10 us latency even in C++, Rust, etc. You have to do OS-level configuration, and you can't touch network or disk.

But beyond that, the actual code you write is fairly natural for those languages. You can't allocate, and your code and data have to fit in the cache. But you can use normal language constructs, and most of the standard library - neither of which is true for Java.

Erlich_Bachman · on Aug 5, 2020

And yet 90% of time we hear about this latency problems when speaking about a language, it's always about JVM... When it would be equally hard in any other dynamic language... What gives...

vips7L · on Aug 5, 2020

I don't know of any other runtime that gives you the tools to tune your latency and the GC other than the JVM.

exdsq · on Aug 5, 2020

GHC (Haskell)?

saberience · on Aug 5, 2020

Err Dotnet? :)

therealdrag0 · on Aug 5, 2020

Might be a side affect of Java’s dominance/popularity.

PaulDavisThe1st · on Aug 5, 2020

>And you can't afford a single network hop or even SSD access.

10µs ? You can't even afford that many function calls.

Seriously, 10µs is not a sensible target for a general purpose OS, maybe not even for a general purpose CPU. Achievable? Perhaps. But sensible? Not really.

rowanG077 · on Aug 5, 2020

That's really insanity to me. 10µs is a goddamn long time for a modern computer. That's 30000 clock cycles for a 3Ghz CPU. Each clock cycle can handle multiple instruction. It should be trivial to handle most light computations in this time as long as you ensure the data is in cache.

PaulDavisThe1st · on Aug 6, 2020

That's a very fair point. I suppose I'm too used to the idea that the code which runs with sub-N-µs latency is going to process a "lot" of data, and that there's no way it will all be in the cache (any cache).

I suppose that if you're working on small amounts of data every time your code executes, then this becomes vastly more reasonable.

nradov · on Aug 5, 2020

You can also use the Real-Time Specification for Java (RTSJ). Essentially it allows you to do all your memory allocation, then shut off the GC.

https://jcp.org/en/jsr/detail?id=282

CyberDildonics · on Aug 5, 2020

If something is so latency sensitive and crucial making the java garbage collector such a hindrance, why not start writing parts in C++? It seems to me people end up getting into a situation where they are fighting with the java gc to try to get low latency with huge heaps and constant allocation when it really is not difficult to control memory allocation in modern C++.

nvarsj · on Aug 5, 2020

> Who was it that turned GC off entirely, minimized allocation and just restarted their VMs when they ran out of RAM every couple of hours, was that Netflix?

This was common practice in trading firms that got on the Java hype train. Turn off GC and just restart the JVM outside of trading hours.

> Either way. It makes me excited for Rust and the languages it'll inspire, all this labor gone away.

The JVM gives GC a bad name. There are plenty of GC languages which don’t have the level of pain of hotspot except in extreme cases. For the vast majority of GC languages you never even think about it. Rust / C++ are great when you need full control but it’s not necessary for most things.

pron · on Aug 5, 2020

No runtime has a better GC than Java's, certainly in the current version (14) which is worlds better than Java 8. The reason you don't think about it in other languages is because if you use other languages, you probably don't care too much about performance in the first place.

nvarsj · on Aug 5, 2020

So if I care about performance, I would use the JVM? I spent years as a performance engineer on HFT trading systems tuning JVMs. It’s possible to get very good performance on the JVM, but it is also hard and for the layman almost impossible. Yes, it has great throughput, but only because it has to. The design of hotspot / JVM and the languages that run on it encourages massive heap allocations of gigabytes a second. I can’t think of any other language or runtime that is comparable in terms of its memory bloat. Even if you control allocations, and have happy path GCs, hotspot GC still causes occasional pauses of 10-100s milliseconds. And this is considered good.

I know that lots of people have built careers on the JVM and defend it vigorously. Whenever I make a comment that is negative on the JVM, I always get downvoted without fail on HN. I used to be one of these people, after all, so I can relate.

If I truly cared about performance these days, I’d use C++ or Rust. Otherwise, I’d use a language like golang with reasonable default GC behavior and allocation patterns - where I know the daemon I write won’t use more than 100-200Mb, and probably even less, and will typically have sub-ms pauses. I think even most scripting languages, like Ruby/Python, have reasonable memory usage and GC patterns for web development, but I’m not as familiar with them.

I probably care more about energy consumption these days though. And again, the JVM is the worst offender. The average developer building a Java web app will create something which consumes gigabytes of memory usually. These apps tend to spend the majority of their CPU time allocating and GCing memory. I can only guess how much server time has been consumed to satiate the memory hungry needs of the JVM.

pron · on Aug 5, 2020

> So if I care about performance, I would use the JVM?

No. If you use the JVM you might well care about performance. If you care about performance, you should consider the JVM.

> I spent years as a performance engineer on HFT trading systems tuning JVMs.

The JVM is not designed to give you 100% performance (of the ideal offered by the hardware). It's designed to give you the easiest means of achieving ~95%, provided you're OK with some nondeterminism.

> I’d use a language like golang with reasonable default GC behavior and allocation patterns - where I know the daemon I write won’t use more than 100-200Mb, and probably even less, and will typically have sub-ms pauses.

You'd be wrong, because JDK 14 will most likely give you substantially better performance than Go out of the box; JDK 15 and 16 even more so.

BTW, the footprint and performance profile of JDK 14 is worlds apart from JDK 8. The JVM has been undergoing some very big changes in the last couple of years.

> And again, the JVM is the worst offender

Nope. https://greenlab.di.uminho.pt/wp-content/uploads/2017/09/pap...

nvarsj · on Aug 5, 2020

Thanks for the paper. The data tables for it just proves my point. For example, in the binary-trees test, Java used 1120 Mb. Go used 228, Rust used 180, Lisp used 373, Haskell used 494 Mb. Only Jruby and Erlang seem to require more memory usage. If a webapp performs similar to this experiment, I could rewrite it in almost any other language and I’d cut the required number of servers in half or more.

pron · on Aug 5, 2020

Java absolutely requires more memory; less so in 14, but still. That does not, however, make your point about either performance or energy, both of which are wrong.

> If a webapp performs similar to this experiment, I could rewrite it in almost any other language and I’d cut the required number of servers in half or more.

Nope. Because you gain in performance where you lose on memory. If you add more RAM, you'd pay for less hardware and consume less energy.

Anyway, if you last experienced Java in version 8, try 14. It's not what you might remember. You'd likely see a 10-40% reduction in cost-per-transaction, plus very low pauses with the new GCs. OpenJDK's VM in 14 is just significantly different from what it was in 8.

merb · on Aug 5, 2020

well no matter what you are trying to say, but without valhalla and other techniques to reduce allocations java will always be slower than the languages the other poster said.

no matter how good a gc gets, if it does need to do double the work of another gc it is not a problem of the gc. it is that too much allocation is needed even for simple things. java tried to tune their gc over and over instead of just fixing the problem. allocations.

pron · on Aug 6, 2020

> java will always be slower than the languages the other poster said

That's true, and while Valhalla will help close the gap, it is important to understand that Java's goal is not to be the fastest language, but to be the language where it's easiest to get to 95% of full performance potential. I don't think any other language achieves that as well as Java. The question, then is how much you're willing to pay for that extra 5%.

In other words, for every Java program X, there exists a C/C++ program Y that's at least as fast. But is the cost of writing Y worth it given the particular performance benefit? The reason Java is so popular is that the answer is very, very often no. And remember that the lower Java costs are not just in the "coding costs" of writing and maintaining, but also in observability. These days, thanks to the Java Flight Recorder, Java gives you unparalleled insight into what your application is doing for very little overhead.

merb · on Aug 6, 2020

well it shouldn't be a rant. for my company java would've been fast enough. however we invested heavily in dotnet core, since it's easier to scale people in our region with dotnet.

we are basically struggeling to find people in the java domain, the amount of people with c# knowledge in my area is so much bigger.

---

however I'm personally a heavy java user (or scala should I say) unfortunatly the more things I did in c# the more I missed stuff especially in plain java. I thing the oracle takeover of java held the language back a bit for a few years. I mean if I look at the releases, they started to catch up with their new model, etc. which is good.

and I think what java these days are missing is a project like django/ror or aspnet.core which drives the web forward with a fully integrated web framework that does not suck. don't get me wrong spring looks promising, but it tries to fullfil every role by being "flexible" which it probably shouldn't be. enforce something. give us a nice little ORM that is not enterprise, look at linq. and especially don't be so enterprisy.

pjmlp · on Aug 6, 2020

GUIs are a good example of that, once C++ ruled in such domains, now managed languages own it, with C and C++ left for those 5% doing the glue code with OS drivers and 3D APIs.

pythonaut_16 · on Aug 5, 2020

You're not really refuting the points so much as just saying "You're wrong".

Granted the parent was anecdotal with their own experiences too, you could at least address the points. E.g. to the point of achieving 100-200Mb daemons, you might reference what RAM footprint you typically expect/experience from a similar program running in JDK 14-16. Rather than just:

> You'd be wrong, because JDK 14 will most likely give you substantially better performance than Go

cogman10 · on Aug 5, 2020

* better GCs (plural)

Probably the best thing about the JVM that, AFAIK, no other platform does is it gives you an option for which GC you use. Not only is that an option, but you are spoiled with options. ZGC, G1GC, Shenandoah, The parallel collector, the serial collector. Pick the right one for the job!

But beyond that, they are infinitely tuneable (Not that you should usually). Again, not something that almost any other platform offers.

saberience · on Aug 5, 2020

Dotnet's GC is way better than Java, Java has been trying to catch up in recent years but it's still miles off.

pjmlp · on Aug 5, 2020

I do both since they exist, in no way is .NET GC able to handle multi-TB heaps with ms pauses.

What helps .NET is the design since the early days to support value types and the introduction of Midori learnings into C# 7 and later for low level programming.

darksaints · on Aug 5, 2020

I'd say it's more the language design than anything. Heap allocating everything, and then throwing in inheritance 30 levels deep makes for some very poor GC behavior.

wokkel · on Aug 5, 2020

What relationship do you see between inheritance and GC ? Creating An instance of some object 30 levels deep in the inheritance chain does not allocate 29 extra objects.

darksaints · on Aug 6, 2020

It's not always that simple. For example, if you inherit from a class that has private members, you are actually going to allocate both the parent and child.

https://stackoverflow.com/questions/27813539/what-happens-in...

Also, if you don't mark your methods and classes as final, you're gonna be allocating functions everywhere as well.

haxen · on Aug 6, 2020

> you are actually going to allocate both the parent and child.

You're going to allocate a single memory block that contains all the state of that object, which of course includes the superclass state. But that has nothing to do with the depth of the class hiererarchy.

> you're gonna be allocating functions everywhere as well.

If by "function" you mean a "Java method", they are code and not data, and they are not allocated dynamically at all.

In any case, this would have nothing to do with them being private or final. These options decide whether there will be an invokevirtual or invokespecial instruction to call them, where invokevirtual has more cost only before getting JIT-optimized.

def_true_false · on Aug 5, 2020

Perhaps Instagram? Or Twitch? Or Discord?

https://news.ycombinator.com/item?id=23144380

lostmyoldone · on Aug 6, 2020

I know I read about some trading firm(s) doing that, but I don't know which, if it was even stated.

Whomever they were, they were rotating pre-warmed jvm images with disabled GC, and were reaching quite respectable latency figures.

To not have to recycle them quickly, you'll want to not generate too much garbage objects, and that's actually easier than one might think in Java. Especially if you accept to restart the jvm from time to time, as you only need to be mostly statically allocated.

Rare error paths can freely use dynamic allocation as long as most of the service doesn't.

Nowadays you can also get away without using strings in most places, using only char sequence flyweights over "statically" allocated char sequences. Otherwise strings were a pain, especially API's that really doesn't need a string (ownership) but had string method arguments nevertheless.

Used like that, as you would on an embedded platform, theres nothing I know of that actually beats the JVM in raw performance while still being somewhat practical in terms of tooling and hiring. Rust might take that crown, we'll see, but I hope so.

OneWay43235392 · on Aug 5, 2020

>> It makes me excited for Rust

Here we go again ...

MaxBarraclough · on Aug 5, 2020

Doesn't seem helpful to use the term green threads here. This isn't a JVM with green threads (those are a thing of the past as I understand it). They're using plain old OpenJDK, and they're ensuring the GC gets a CPU core to itself.

Neat that they were able to get a dramatic improvement in GC latencies on both G1 and ZGC.

No mention of the Shenandoah GC. Would the same trick help out there too?

haxen · on Aug 5, 2020

We did measure on Shenandoah as well, it helped but not enough to be within 10 ms. Since this post is about Hazelcast Jet getting the best latency, we didn't report that.

MaxBarraclough · on Aug 5, 2020

Interesting, thanks.

brabel · on Aug 5, 2020

> This isn't a JVM with green threads

Exactly... I thought they may be talking about Project Loom's Virtual Threads (which are going to be true green Threads) which are available experimentally as of Java 15, given they did use Java 15, but nothing in the post indicates they used that.

haxen · on Aug 5, 2020

We use the same technique of cooperative multithreading under any name, but without the low-level support to be able to write plain sequential Java code. However, even though that changes our internal programming model, the behavior with respect to native threads, interactions with the OS scheduler, CPU caches, etc., should be identical.

pron · on Aug 5, 2020

Loom's virtual threads aren't cooperative, and the implementation actually does interact with the GCs in non-trivial ways. ZGC support was only added to Loom last week, and might still be unstable, but it would be interesting to test. Please report any finding -- be it performance or stability issues -- to the Loom mailing list.

haxen · on Aug 5, 2020

"Virtual threads" are fibers? When you say they aren't cooperative, you mean there's JVM infrastructure that actually pre-empts a fiber?

pron · on Aug 5, 2020

> "Virtual threads" are fibers?

Yes, but we found that the name "virtual thread" works better. For one, these fibres are actually an implementation of java.lang.Thread, so it's the same abstraction; for another, when people hear "fibres" they sometimes compare it to other implementations of fibres that are implemented very differently, and that causes confusion.

> When you say they aren't cooperative, you mean there's JVM infrastructure that actually pre-empts a fiber?

First, I mean that they don't require any explicit yields. Sometimes people confuse "preemptive" with "time-sliced preemption", so I'd rather just say "non-cooperative".

Second, there is a capability to forcefully preempt a virtual thread even if it's in some computation loop, but that's up to a chosen scheduler, and the scheduler isn't part of the JVM. You can supply your own scheduler, written in Java, and you'll be able to choose to preempt a runaway thread. Currently, this capability is not exposed, but it will be eventually.

See http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1.h...

haxen · on Aug 5, 2020

> I mean that they don't require any explicit yields.

So in this sense, are Kotlin coroutines also non-cooperative? They don't explicitly yield, but do have a mandatory API to use for "blocking" calls.

pron · on Aug 5, 2020

They do explicitly yield, although not with an explicit keyword or call, but with an explicit type. I guess you could say that the "await" is inferred by the compiler, but it's there. That's not the case with virtual threads, which behave more like Go's goroutines or Erlang's processes.

haxen · on Aug 5, 2020

>They do explicitly yield

Arguably, virtual threads also explicitly yield by calling one of the blocking methods in the JDK. This is very similar to putting all the bottom-level suspendable functions into the Kotlin standard library.

>virtual threads, which behave more like Go's goroutines or Erlang's processes.

I think this can be summarized as "Kotlin uses colored functions and Loom uses non-colored ones". This is a well-established core difference, I thought you had something else in mind with "explicit yield".

pron · on Aug 5, 2020

> Arguably, virtual threads also explicitly yield by calling one of the blocking methods in the JDK.

There's nothing explicit here, or nothing more explicit than the ordinary platform threads you use today -- JDK operations might or might not block an OS thread. There is no way you can tell whether a call does that or not, and that behaviour can change (plus, there's forced preemption as an option).

> This is a well-established core difference, I thought you had something else in mind with "explicit yield".

I would say that an explicit colour for yielding qualifies as an explicit yield, and so falls under cooperative. In any event, virtual threads have neither explicit yield-sites nor special yield-site colours, hence they're non-cooperative, and they support forced preemption.

haxen · on Aug 6, 2020

Without the very last point, forced preemption, they would indeed be cooperative because not calling any blocking method would make them non-cooperative.

This is exactly the same as within the "colored" subspace in a language that has this distinction. As long as all the functions you call are suspendable, you equally have no idea which one will actually suspend.

So, without forced preemption based on GC safepoints, and as long as there is any blocking operation left in the library, Loom qualifies as a cooperative multithreading system.

pron · on Aug 6, 2020

> Without the very last point, forced preemption, they would indeed be cooperative because not calling any blocking method would make them non-cooperative

If user code cannot possibly know whether an operation blocks or not, then it cannot "cooperate."

> As long as all the functions you call are suspendable, you equally have no idea which one will actually suspend.

First, this is true hypothetically but never in practice. Consider that if code really didn't care about blocking, then why not colour everything in the blocking colour? The answer is that in the coloured mode, compilation and cost of the two colours are very different.

Second, and more importantly, the reason that there are two colours is exactly to enable a cooperative style. While a "blocking" routine may or may not block, a non-blocking one never does and that is the crucial difference. With cooperative multi-tasking the default mode is that of a critical section -- there is no scheduling point unless you explicitly know in advance there might be one and where. With preemptive concurrency the default is the opposite: yielding may happen at any time unless you explicitly enter a critical section. This results in very different coding styles.

Anyway, we're arguing over definitions, so you may want to consult Wikipedia's definitions [1] [2].

[1]: https://en.wikipedia.org/wiki/Cooperative_multitasking

[2]: https://en.wikipedia.org/wiki/Preemption_(computing)#PREEMPT...

haxen · on Aug 6, 2020

"processes voluntarily yield control periodically or when idle or logically blocked in order to enable multiple applications to be run concurrently."

This tells me virtual threads (without forced preemption) are cooperative.

pron · on Aug 6, 2020

Virtual threads do not do it voluntarily. They have no knowledge or control over where they might yield. Without forced preemption, I guess you can say that as long as they don't call into the JDK in any way (including e.g. throwing exceptions) or any third-party library then they shouldn't normally expect to yield, but I don't count calling any code you haven't personally written "voluntarily yielding".

We call them preemptive with or without forced preemption -- in line, I think, with the definitions on Wikipedia -- but whatever you choose call them, the concurrency programming style is the same as that with threads today or Go's goroutines, and is different from the style of C#/JS's async/await, Kotlin's coroutines, or more explicit async code, all of which result in user code relying on knowing where yield points (possibly) are (i.e. "critical section" by default). BTW, even with OS threads, when you run transaction-handling code, as opposed to long-running computation, time-sharing preemption is the exception rather than the rule.

haxen · on Aug 6, 2020

I think the distinction is pretty clear: either the mechanism requires cooperation by the application thread (which typically initiates the yield at a compiled-in, predefined point), or it doesn't and the runtime environment preempts it from the outside.

Virtual threads are of the former kind. (At least as long as we don't involve the forced preemption feature).

pron · on Aug 6, 2020

Virtual threads are not cooperative, and OS threads that process transactions are also normally preempted almost exclusively at syscalls initiated by the thread, but I can't stop you from calling them that. The important thing to remember is that you program them like OS threads or Go goroutines or Erlang processes ("interleaving can happen anywhere unless I forbid it") and not like async/await or Kotlin's coroutines or asynchronous code or Windows 3.0 ("interleaving can only happen at certain allowed, known points"), whatever you want to call these two styles.

haxen · on Aug 7, 2020

I concur with your point about the programming model and style, but I do also maintain that "cooperative" vs. "preemptive" is not about that difference. It is a technical difference on how the system interleaves threads and whether it needs cooperation from the code running on them, and not on the programming model and critical sections.

For the distinction you have in mind, I see the terms "colored" vs. "non-colored functions" to be used the most, and they are both within the "cooperative multithreading" space.

nahuel0x · on Aug 5, 2020

Can a Loom scheduler kill a virtual thread in an arbitrary execution point (like Erlang process but unlike goroutines)?

haxen · on Aug 5, 2020

http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.h...

Yes it can and actually it's quite low-hanging fruit on the JVM. GC safepoints are already there, you just have hook into the mechanism.

pron · on Aug 5, 2020

True, but it heavily depends on what you mean by "can." Doing it safely in Java is a problem, as Java code does not protect from shared state. So I would say it's "sort-of, but not really, as it would be very dangerous unless you know exactly what the thread is doing." On the other hand, if the question is, if Erlang were implemented on the Java platform using virtual threads as processes, would code be able to kill a process arbitrarily, then the answer would be yes.

nahuel0x · on Aug 5, 2020

It can be useful for killing a clojure virtual thread who only uses shared memory by reading thread-safe persistent data structures and writes only to clojure atoms/STM (besides his unshared local state). If this is possible, then Loom + clojure can be a better model than erlang for some usages. Myriads of linked actors but with the added feature of shared memory for global views (see Rich Hickey criticism of the actor model) and optimized message passing (you don't need to copy messages if you have a global GC and they are clj persistent data structures). But external killing of a linked actor/vthread -one of erlang usually ignored secret sauces- is fundamental, if not, you need adhoc mechanisms like Go cancellation contexts who IMHO adds a lot of error prone accidental complexity. Think usages beyond supervisors/fault-tolerance like killing obsolete requests/computation or speculative execution.

pron · on Aug 6, 2020

The problem here is that Clojure only appears to other Clojure code to do what you're describing, but heavily relies on mutation and locking under the covers. Any lazy seq in Clojure is actually a mutable data structure that guards mutation with locks. Clojure, however, could emit instructions that check for interruption at sites that are safe for Clojure to interrupt a thread.

nahuel0x · on Aug 6, 2020

Just to understand you better, Clojure lazy-seqs are thread-safe but the Loom killing mechanism is not compatible with sections guarded with locks? So, if you had:

try { lock.lock() // long computation here, no interrupts check } finally { lock.unlock() }

What happens when the virtual thread is externally killed in the middle of the long computation? Nothing at all because is not manually checking for the interrupt token? (like Go, unlike Erlang). Or is interrupted but the finally block is not executed and we get a dangling lock? I know Loom is not finished yet, but I would like to know about his prospective.

pron · on Aug 6, 2020

The forced preemption mechanism that uses VM handshakes doesn't care about locks, so it could hypothetically preempt and kill the thread inside the long computation. If you want to insert explicit interruption checks, that's another matter, and it doesn't require the forced preemption mechanism at all.

tyingq · on Aug 5, 2020

They say they are using this: https://hazelcast.com/blog/idle-green-threads-in-jet/

And then say it's comparable. "This basic design is also present in the concepts of green threads and coroutines. In Hazelcast Jet we call them tasklets."

jjav · on Aug 6, 2020

Even on ten year old hardware single digit ms latency in Java server apps wasn't very special. Java (JVM) is an extremely performant platform so I always find odd how a meme has somehow built up on it being otherwise.

True that one can end up writing terribly inefficient Java code, but one can write terrible code in any language. If I need to write server code where performance is particularly important and I don't want to deal with the cost (in debug time and dev expertise) of C or C++, Java would be my first choice.

Also I'm of the school of thought that performance always matters. Autoscaling in cloud providers sure makes it easy to scale horizontally to make up for slow server code, but once you reach certain size, go have a chat with the finance team about the AWS bill.

ryanthedev · on Aug 5, 2020

Can we talk about no async/await support? Can't build scalable apps when I'm in callback hell. It's like ES5 all over again.

kasperni · on Aug 5, 2020

Java is getting virtual threads/fibers instead of async/await.

From https://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.ht...

-------------

An alternative solution to that of fibers to concurrency's simplicity vs. performance issue is known as async/await, and has been adopted by C# and Node.js, and will likely be adopted by standard JavaScript. Continuations and fibers dominate async/await in the sense that async/await is easily implemented with continuations (in fact, it can be implemented with a weak form of delimited continuations known as stackless continuations, that don't capture an entire call-stack but only the local context of a single subroutine), but not vice-versa.

While implementing async/await is easier than full-blown continuations and fibers, that solution falls far too short of addressing the problem. While async/await makes code simpler and gives it the appearance of normal, sequential code, like asynchronous code it still requires significant changes to existing code, explicit support in libraries, and does not interoperate well with synchronous code. In other words, it does not solve what's known as the "colored function" problem.

ryanthedev · on Aug 5, 2020

This is some sauce. Ty.

lostmyoldone · on Aug 6, 2020

I have some trouble understanding what people mean by scalable today, especially why people seem to have to run entirely event driven, and not mostly on the socket read/write edges?

Soon to be almost 20 years ago we pushed >10k messages per second on a JVM, using essentially pentium pro class hardware, and the messages spread over thousands of TCP consumers, yielding average latencies well below 0.5 seconds. Not really low latency, but low enough that we didn't need much lower

This was on a purely blocking implementation, because that was before almost anyone did anything like that on the JVM.

With the advancement in async IO, it's got to be possible to drive many millions of sockets, or have really low latency targets before you have to start being really careful?

So what are you guys doing that seems to need so much async code?

With that I mean actual async code, not code having locks but pretending to be async by using callbacks everywhere, because that's somewhat common.

I'm not trying go be rude, I honestly don't get what people are doing that needs more than the JVM should rather easily provide, unless possibly you have super low latency targets?

There seems to be too many that have to resort to quite cumbersome implementation strategies, so I'm starting to think there's some corner of the industry which I have completely missed, and which requires these strategies regularly?

capableweb · on Aug 5, 2020

> Can't build scalable apps when I'm in callback hell

What? We've (JavaScript and developers dealing with asynchronous patterns) been able to build scalable (in terms of code and its maintenance) for many many years, probably 10+.

async/wait is simply syntactic sugar and doesn't drastically change anything, you still need to understand the asynchronicity underneath it all, and if you do, you won't have any problems building scalable apps using your knowledge.

ryanthedev · on Aug 5, 2020

Pretty sure a callback vs synchronous style of coding is a little more than syntactic sugar.

It's a complete shift in application design...

If I have to flatmap one more time...

capableweb · on Aug 5, 2020

> Pretty sure a callback vs synchronous style of coding is a little more than syntactic sugar

Absolutely, asynchronous and synchronous are two very different patterns when programming with very different trade offs for both of them, but async/await is not synchronous, it only make that particular call _look_ synchronous, while actually being asynchronous.

Failing to understand that async/await is just syntactic sugar for dealing with asynchronous programming will sooner or later bite you.

Here is a gist that shows why async/await is not really synchronous (blocking) programming: https://gist.github.com/matt-mcalister/3f060bf32d292cfebc944...

hansdieter1337 · on Aug 5, 2020

I want to see Java on mission critical computers in space!

pjmlp · on Aug 5, 2020

Not on space, but still mission critical,

"Aegis Battleship Weapons System"

http://www.artist-embedded.org/docs/Events/2011/JTRES/Slides...

"French radar system for ballistic missile tracking and measurement"

https://www.militaryaerospace.com/defense-executive/article/...

"NASA Ground Control Station for Multiple UAVs Flight Simulation"

https://www.semanticscholar.org/paper/Ground-Control-Station...

Rebelgecko · on Aug 5, 2020

I've seen plenty of ground station and mission control software in Java. Actually in space a bit less likely though... A lot of that is running in RTOS environments that Java isn't well suited for (forget garbage collection, some flight software projects go as far as to ban dynamic memory allocation)

ginko · on Aug 5, 2020

I would certainly hope that green threads have less than 10 megaseconds latency.

hnarn · on Aug 5, 2020

Was the title changed? Because it says "ms" both here and on jet-start.sh and that unit is milliseconds.

ginko · on Aug 5, 2020

It used to be 'Ms'. Guess it's fixed now.

tupac_speedrap · on Aug 5, 2020

Glad to see some love for legacy programming languages like Java.

pron · on Aug 5, 2020

Well, that's understandable because there are still some legacy companies and organisations left that write a lot of new software in Java, like Apple, Amazon, Netflix, Twitter, Google, Alibaba, Tencent, NASA, GitHub, Microsoft, Facebook, Spotify and nearly all Fortune 500 companies. Plus, if you care about both performance and observability, there aren't many viable alternatives.

BTW, many if not most of the cutting-edge advances in compilation, low-overhead deep profiling, and garbage collection are done on the Java platform, so it's still the technology leader in those areas.

shock · on Aug 5, 2020

I don't think Java qualifies as legacy. I view it rather being in the 'mature' stage of programming languages evolution.

pjmlp · on Aug 5, 2020

If Java is legacy with 25 years, what to say about C++ with 50, or C with 60, Python and Ruby both around 30.

Ah and the beloved OS around here is reaching 60 as well.

Plenty of legacy love.

dionian · on Aug 5, 2020

The Java language is Java's least appealing feature. Fortunately there are seamless alternatives!