Hacker News new | past | comments | ask | show | jobs | submit login

You need a lot of memory bandwidth and large caches, or else the cores will starve. That's also why IBM mainframes have up to 4.5 GB of L4 cache.

Ok, just wow. L4 cache more than my laptop's ram. Thanks for that awesome titbit.

PS: don't worry, my upgrade is on it's way :p

:D A bit like the moment when I realized that on-CPU cache could now hold a complete DOS system, with programs included...

That's true of all high-frequency/high-core count hardware. Which is why running Java or Python codes on this hardware makes very little sense. Rust is more like it. Golang in a pinch.

It's the opposite. Running lots of poorly optimized processes allows you to amortize memory latency. If your software suffers from cache misses then it's not going to run out of memory bandwidth any time soon. Adding more threads will increase memory bandwidth utilization. Meanwhile hyper optimized AVX512 code is going to max out memory bandwidth with a dozen cores or less.

> it's not going to run out of memory bandwidth any time soon

No, but the higher the memory bandwidth, the sooner those processes can get back to their inefficiency.

That's really not true. Memory bandwidth, just like memory capacity becomes a bottleneck when it is exceeded, but more doesn't automatically speed anything up. Java and python programs will likely be hopping around in memory and waiting on memory to make it to the CPU as a result.

Typically only multiple cores running optimized software that will run through memory making heavy use of the prefetcher will exceed memory bandwidth.

Noob ques. Is there any fundamental limitation in Java or more like JVM will need to evolve to optimally use such architecture ??

AIUI the relevant weakness of Java here is that it typically has worse memory density and locality than something like Rust.

Consider code which linearly goes through a list of points in 2D space and does some calculation on the coordinates.

In Rust, the list is a Vec<(f64, f64)>. The Vec is a small object containing a pointer to a large block of data which contains all the points packed tightly together. Once the program has dereferenced the pointer and loaded the first point, all the others are immediately after it in memory, in order, containing nothing but the coordinates, and so the processor's cacheing and prefetching will make them available very quickly.

In Java, the list is an ArrayList<Point2D.Double>. The ArrayList is a small object containing a pointer to an array of pointers to more small objects, one for each point. Each of the small objects has a two-word object header on it. The pointer plus header means that for every two words of coordinate, there are three words of overhead, so the cache is used much less effectively. The small objects aren't necessarily anywhere near one another in memory, or in order, so prefetching won't help.

There are a couple of ways the Java situation can be improved.

Firstly, today, you can replace the naive ArrayList<Point2D.Double> with a more compact structure which keeps all the coordinates in a single big array. This gives you the same efficiency as Rust, but requires programming effort (unless you can find an existing library which does it!), and may give you an API that is less efficient (if it copies coordinates to objects on retrieval) or convenient (if it gives you some cursor/flyweight API).

Secondly, in the future, the JVM could get smarter. In principle, it could do the above rewriting as an optimisation, although i wouldn't want to rely on that. A good garbage collector could bring the small objects together in memory, to improve locality a bit.

Thirdly, in the near-ish future, Java will get value types [0] which behave a lot more like Rust's types. That would give you equally good density and locality without having to jump through hoops.

[0] http://openjdk.java.net/projects/valhalla/

You will have to tune your code to need as little shared state across threads as you can. It's not fun, but tuning code at this level rarely is.

The synchronization is what actually matters, shared memory being read is not a problem.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact