The JVM world tends to solve this problem by using off-heap caches. See Apache Ignite  or Ehcache .
I can't speak for how their Rust cache manages memory, but the thing to be careful of in non-GC runtimes (especially non-copying GC) is memory fragmentation.
Its worth mentioning that the Dgraph folks wrote a better Go cache  once they hit the limits of the usual Go caches.
From a purely architectural perspective, I would try to put cacheable material in something like memcache or redis, or one of the many distributed caches out there. But it might not be an option.
It's worth mentioning that Apache Cassandra itself uses an off-heap cache.
"Remarkably, we had only put very basic thought into optimization as the Rust version was written. Even with just basic optimization, Rust was able to outperform the hyper hand-tuned Go version."
This Apple marketing meme needs to die. Reference counting incurs arguably more cost than GC, or at least the cost is spread through-out processing.
I agree with you that most people tend to associate GC with something more advanced nowadays, like mark and sweep as you said in another comment, but it seems pointless to argue that ARC is not a form of GC.
Refcounting in the presence of threads is usually non-deterministic too.
Most people think of Python as GCed language, and it uses mostly RC.
Any runtime that uses mark & sweep today may elect to use RC for some subset of the heap at some point in a future design, if that makes more sense. The mix of marking GC vs refcounting GC shouldn't affect the semantics of the program.
Note: that parenthetical is a very big caveat, because properly profile-optimized JVM executables can often achieve exceptional performance/development cost tradeoffs.
In addition however, ARC admits a substantial amount of memory-usage optimization given bytecode, which is now what developers provide to Apple on iOS. Not to mention potential optimizations by allowing Apple to serve last-minute compiled microarchitecture optimized binaries for each device (family).
To satiate the pedants... ARC is more of less GC where calls into the GC mechanism are compiled in statically and where there are at worst deterministic bounds on potential "stop the world" conditions.
While this may not be presently optimal because profile-guided approaches can deliver better performance by tuning allocation pool and collection time parameters, it's arguably a more consistent and statically analyzable approach that with improvement in compilers may yield better overall performance. It also provides tight bounds on "stop the world" situations, which also exist far less frequently on mobile platforms than in long running sever applications.
Beyond those theoretical bounds, it's certainly much easier to handle when you have an OS that is loading and unloading applications according to some policy. This is extremely relevant as most sensible apps are not actually long running.
I doubt that. It implies huge costs without giving any benefits of GC.
A typical GC have compaction, nearly stack-like fast allocation , ability to allocate a bunch of objects at once (just bump the heap pointer once for a whole bunch).
And both Perl and Swift do indeed perform abysmally, usually worse than both GC and manual languages .
> ARC is more of less GC
Well, no. A typical contemporary GC is generational, often concurrent, allowing fast allocation. ARC is just a primitive allocator with ref/deref attached.
What? This doesn't make any sense. From the cache's POV stack and bump-allocated heap are the same thing. Both are continuous chunks of memory where the next value is being allocated right after the previous one.
The only difference between the stack and the bump-allocated heap is that the former has hardware support for pointer bumping and the latter has not. That's all.
The stack pointer moves both directions and the total range of that back-and-forth movement is typically in kilobytes, so it may fit fully in L1.
Just check with perf what happens when you iterate over an array of 100 MB several times and compare that to iterating over 10 kB several times. Both are contiguous but the performance difference is pretty dramatic.
Besides that, there is also an effect that the faster you allocate, the faster you run out of new gen space, and the faster you trigger minor collections. These are not free. The faster you do minor collections, the more likely it is for the objects to survive. And the cost is proportional to survival rate. That's why many Java apps tend to use pretty big new generation size, hoping that before collection happens, most of young objects die.
This is not just theory - I saw this just too many times, when reducing allocation rate to nearly zero caused significant speedups - by order of magnitude of more. Reducing memory traffic is also essential to get good multicore scaling. It doesn't matter each core has a separate tlab, when their total allocation rate is so high that they are saturating LLC - main memory link. It is easy to miss this problem by classic method profiling, because a program with such problem will manifest by just everything being magically slow, but no obvious bottleneck.
Yes, you are right about stack locality. It indeed moves back and forward making effective used memory region quite small.
> These are not free. The faster you do minor collections, the more likely it is for the objects to survive. And the cost is proportional to survival rate.
Yes, that's true. Immutable languages are doing way better here having small minor heaps (OCaml has 2MB on amd64) and very small survival rates (with many object being directly allocated on older heap if they are known to be lasting in advance).
Now I understand your point better and I agree.
If one goes with reference counting as GC implementation, then one should take the effort to use hazardous pointers and related optimizations.
Java is very fast and 3X slower is a pretty wild claim.
On the real world, you won't get things as optimized in higher level languages, because optimized code looks completely unidiomatic. A 3x speedup from Java is a pretty normal claim.
Ofc, Apple is not MS, so Swift developer experience and docs kind of suck (if you're not in the "iOS bubble" where it's nice and friendly), and multiplatform dev experience especially sucks even worse...
And "easier to pick up and code in" might not necessarily be an advantage for a systems language - better start slowly putting the puzzle together in your head, and start banging up code others need to see only after you've internalized the language and its advanced features and when (not) to use them! It helps everyone in the long run. This is one reason why I'd bet on Rust!
Well, for fairness, D is quite a bit older than Swift. (It's nearly as much older than Swift as it is younger than C++!) But what do you think pushes Swift out of the "C++ with hindsight" basket?
They also talked about it pretty frequently.
Maybe ggp's claim that D just needs a heavy hitter backing it might be misplaced.
Right now, I think eBay is one of the big companies that uses D.
Edit: Sibling comment beat me to it.
So I doubt they would sponsor D.
I don't think MS has any interest in improving C++ (look at their compiler). But that's not because of competing activities.
Visual C++ is the best commercial implementation of C++ compilers, better not be caught using xlc, aCC, TI, IAR, icc and plenty of other commercial offerings.
If C++ has span, span_view, modules, co-routines, core guidelines, lifetime profile static analyser, is it in large part to work started and heavily contributed by Microsoft on ISO, and their collaboration with Google.
As for competing with C++, it is quite clear, specially when comparing the actual software development landscape with the 90's, that C++ has lost the app development war.
Nowadays across all major consumer OSes it has been relegated to the drivers and low level OS services layer like visual compositor, graphics engine, GPGPU binding libraries.
Ironically, from those mainstream OSes, Microsoft is the only one that still cares to provide two UI frameworks directly callable from C++.
Which most Windows devs end up ignoring in favour of the .NET bindings, as Kenny Kerr mentions in one of his talks/blog posts.
Back to the D issue, Azure IoT makes use of C# and Rust, and there is Verona at MSR as well, so as much I would like to see them spend some effort on D, I don't see it happening.
Rust have by far the better ecosystem compared to those 3
My refute it simply: Rusts web development story isn’t out of the box clean like Crystal Lang’s which ships with an HTML language out of the box. So it could be categorized as a poor choice in comparison to Crystal
Did you mean HTTP server? If so, there are at least 3 good ones in Rust that are only a `cargo add` away. If you've already taken the trouble to set up a toolchain for a new language, surely a single line isn't asking too much.
It's Haskell-like features only help on large and complex programs.
— Maintenance Complexity "keep in mind that this is a subjective metric based on the author's experience!"
Help make one:
— contribute programs that you consider to be "idiomatic"
— use those measurement and web scripts to publish your own "common use cases" "idiomatic code" benchmarks game
>> compare language speed for common use cases vs trying to squeeze every last piece of performance
— please be clear about why you wish to compare the speed of programs written as if speed did not matter
??? What do you think is not "idiomatic" about programs such as:
Would it be fair to accept an optimization on a language and refuse it on another because it is not idiomatic ?
Most code will be considerably slower due to a lot of factors.
Java in particular is a very pointer-heavy language, made up of pointers to pointers to pointers everywhere, which is really bad for our modern systems that often are much more memory latency than CPU constrained.
A factor of 2-4x to languages like C++ or Rust for most code seems plausible (and even low) unless the limiting factor is external, like network or file system IO.
It's true that pointer chasing really hurts in some sorts of program and benchmark. For sure. No argument. That's why Project Valhalla exists.
But it's also my view that modern C++ programming gets away with a lot of slow behaviours that people don't really investigate or talk about because they're smeared over the program and thus don't show up in profilers, whereas actually the JVM fixes them everywhere.
C++ programs tend to rely much more heavily on copying large structures around than pointer-heavy programs. This isn't always or even mostly because "value types are fast". It's usually because C++ doesn't have good memory management so resource management and memory layout gets conflated, e.g. std::vector<BigObject>. You can't measure this because the overheads are spread out over the entire program and inlined everywhere, so don't really show up in profiling. For the same reasons C++ programs rely heavily on over-specialised generics where the specialisation isn't actually a perf win but rather a side effect of the desire for automatic resource management, which leads to notorious problems with code bloat and (especially) compile time bloat.
Another source of normally obscured C++ performance issues is the heap. We know malloc is very slow because people so frequently roll their own allocators that the STL supports this behaviour out of the box. But malloc/new is also completely endemic all over C++ codebases. Custom allocators are rare and restricted to very hot paths in very well optimised programs. On the JVM allocation is always so fast it's nearly free, and if you're not actually saturating every core on the machine 100% of the time, allocation effectively is free because all the work is pushed to the spare cores doing GC.
Yet another source of problems is cases where the C++ programmer doesn't or can't actually ensure all data is laid out in memory together because the needed layouts are dynamically changing. In this case a moving GC like in the JVM can yield big cache hit rate wins because the GC will move objects that refer to each other together, even if they were allocated far apart in time. This effect is measurable in modern JVMs where the GC can be disabled:
And finally some styles of C++ program involve a lot of virtual methods that aren't always used, because e.g. there is a base class that has multiple implementations but in any given run of the program only one base class is used (unit tests vs prod, selected by command line flag etc). JVM can devirtualise these calls and make them free, but C++ compilers usually don't.
On the other hand all these things can be obscured by the fact that C++ these days tends only to be used in codebases where performance is considered important, so C++ devs write performance tuned code by default (or what they think is tuned at least). Whereas higher level languages get used for every kind of program, including the common kind where performance isn't that big of a deal.
Without global knowledge of memory lifetimes, maintainers make local decisions to copy rather than share.
Allocation in a C++ program is going to be about the same speed as in a Java program. Modern mallocs are doing basically the same thing on the hot-path: bumping the index on a local slab allocator.
Here are some benchmarks; I'll leave to the experts out there to confirm or dismiss them.
If anything the gap is increasing not shrinking. JVM is terrible at memory access patterns due to the design of the language, and designing for memory is increasingly critical for maximum performance on modern systems. All the clever JIT'ing in the world can't save you from the constant pointer chasing, poor cache locality, and poor prefetching.
The gap won't shrink until Java has value types. Which is on the roadmap, yes, but still doesn't exist just yet.
The problem with those benchmarks is if you look at the Java code you'll see it's highly non-idiomatic. Almost no classes or allocations. They almost all exclusively use primitives and raw arrays. Even then it still doesn't match the performance on average of the C (or similar) versions, but if you add the real-world structure you'd find in any substantial project that performance drops off.
Plenty of `inner static classes`. Where's the up-to-date comparison showing that `static` makes a performance difference for those tiny programs?
Plenty of TreeNode allocations.
> The problem with …
The problem with saying "in any substantial project that performance drops off" is when we don't show any evidence.
I don't know what you mean by "copy-only types." I'm not finding any reference to that terminology in the context of language design.
Sorry about the confusion. I meant for the quotes around "copy-only" to indicate that it wasn't really a standard term, but I marked "value types" the same way, so that didn't really work. By "copy-only" I meant something you couldn't have more than one reference to: every name (variable) to which you assign the data would have its own independent copy.
That's not really a requirement of value types, no. C# has value types (structs) and you can have references to them as well (ref & out params).
In general though yes it would be typically copied around, same as an 'int' is. Particularly if Java doesn't add something equivalent to ref/out params.
There are only a few applications for which it isn’t acceptable.
Just look at all the massive apps built on Electron. People wouldn’t do that if it wasn’t effective.
I used to think that but now I’m not so sure. :(
— memory usage is less different for tiny programs that do allocate memory: reverse-complement, k-nucleotide, binary-trees, mandelbrot, regex-redux
I mean, those 1GB RAM 7 years old slow as molasses phones getting dust into drawers or being littered into landfills would scream if they didn't have to run everything behind a VM.
Google's spyware is becoming more invasive and thus more memory-hungry.
And the "hardness" of writing lockless structures is strongly offset by libraries, so unless you're doing something very exotic, it is rarely a real problem.
Tight, low level code in Java and Go is roughly as fast as average C code. The Go compiler is know to be less good at optimizing code than e.g. GCC, but this in many cases creates little practical difference, while the Java JIT compilers have become excellent to a point where they often beat GCC, especially as they can use run time profiling for code optimization. So they can optimize the code for the actual task at hand.
Where the languages differ in "speed" is their runtime environment. Java and Go are languages with garbage collection, which of course means that some amount of CPU is required to perform GC. But as the modern garbage collectors run in parallel with the program, this CPU effort often enough is no bottleneck. On the other side, manual memory management has different performance trade-offs, which in many cases can make it quite slow on its own.
But now I sound like a Geico (insurance) commercial. Sorry about that.
For those who care, I was interested how off-heap caching works in Java and I did some quick searching around the Apache Ignite code.
The meat is here:
- GridUnsafeMemory, an implementation of access to entries allocated off-heap. This appears to implement some common Ignite interface, and invokes calls to a “GridUnsafe” class https://github.com/apache/ignite/blob/53e47e9191d717b3eec495...
- This class is the closest to the JVM’s native memory, and wraps sun.misc.Unsafe: https://github.com/apache/ignite/blob/53e47e9191d717b3eec495...
- And this, sun.misc.Unsafe, is what it’s all about: http://www.docjar.com/docs/api/sun/misc/Unsafe.html
It’s very interesting because I did my fair share of JNI work, and context switches between JVM and native code are typically fairly expensive. My guess is that this class was likely one of the reasons why Sun ended up implementing their (undocumented) JavaCritical* etc functions and the likes.
Unsafe was one of the cooler aspects to Java that Oracle is actively killing for, well, no good reason at least.
For instance fast access to un-GCd off heap memory is being added at the moment via the MemoryLayout class. Once that's here apps that upgrade won't need to use Unsafe anymore. MemoryLayout gives equivalent performance but with bounds checked accesses, so you can't accidentally corrupt the heap and crash the JVM.
They've been at it for a long time now. For instance VarHandle exposes various low level tools like different kinds of memory barriers that are needed to implement low level concurrency constructs. They're working on replacements for some of the anonymous class stuff too.
I mean, there's the obvious reason that it breaks the memory safety aspect that Java in general guarantees. The whole point of the feature is to subvert the language & expectations.
I'm not saying they should remove it, but it's pretty hard to argue there's "no good reason" to kill it, either. It is, after all, taking the worst parts of C and ramming it into a language that is otherwise immune from that entire class of problems.
I'm guessing at least some of that was a side effect of wanting to support C++; not having pointers as an option would have killed C++/CLI from the get go.
Project Panama is what is driving that effort.
Aren't these Unsafe memory read and write methods intrinsified by any serious compiler? I don't believe they're using JNI or doing any kind of managed/native transition, except in the interpreter. They turn into the same memory read and write operations in the compiler's intermediate representation as Java field read and writes do.
I’m now guessing that this might actually have been those Unsafe classes as an intended use case. It makes total sense and I can see how that will be very fast.
As far as I know, a mark-and-sweep collector like Go's doesn't have any advantage over malloc/free when it comes to memory fragmentation. Am I missing some way in which Go's GC helps with fragmentation?
The guts of BTreeMap's memory management code is here: https://github.com/rust-lang/rust/blob/master/src/liballoc/c.... (warning: it is some of the most gnarly rust code I've ever come across, very dense, complex, and heavy on raw pointers. this is not a criticism at all, just in terms of readability). Anecdotally I've had very good results using BTreeMap in my own projects.
In terms of how the "global" allocator impacts performance, I'd expect it to play a bigger role in terms of Strings (I mean, it's a chat program), and possibly in terms of how the async code "desugars" in storing futures and callbacks on the heap (just guessing, I'm not an expert on the rust async internals).
One of these days I’ll get around to turning my Rust set implementation into a full-blown map (it’s already <50% the size of BTreeSet for ints)...
What I wanted to understand is what is the difference in fragmentation between a non-copying, non-compacting GC and a non-GC runtime.
Yeah, but I really do not bite your argument.
When you are reduced to do manual memory management and fight the GC of your language, maybe you should simply not use a language with GC in the first place.
They are right to use rust ( or C/C++) for that. It's not for nothing that redis (C) is so successful in the LRU domain.
> It's worth mentioning that Apache Cassandra itself uses an off-heap cache.
And still ScyllaDB (C++) is able to completely destroy Cassandra in term of AVG latency 
I only glossed over the article but the problem they had with Go seems to be the GC incurred from having a large cache. Their cache eviction algorithm was efficient, but every 2 minutes there was a GC run which slowed things down. Re-implementing this algorithm in Rust gave them better performance because the memory was freed right after the cache eviction.
Splitting it across more processes will result in more cache misses and more DB calls.
Now I'm wondering if there's a Rust library for a generational copying arena--one that compacts strings/blobs over time.
I'm other words, doing some of the work a moving compacting collector would do during compaction but continuously during normal program execution.
Possibly half and half, I don't remember what language it was (possibly obj-c?) which would hand out pointers, and on needing to move the allocations it'd transform the existing site into a "redirection table". Accessing pointers would check if they were being redirected, and update themselves to the new location if necessary.
edit: might have been the global refcount table? Not sure.
I have no idea if this actually a good idea, seems like you get rid of a lot of the cache locality advantages.
Rust is better suited to deal with it since there's a similar issue with refcounting across threads, so you might be able to get away with doing it for objects that are exclusive to one thread.
You cannot use a caching server at that scale with those latency requirements. It has to be embedded
Can you speak to why using something like memcache or redis may not be an option?
Looks like this issue was resolved for maps that don't contain pointers by . From the article, sounds like the map keys were strings (which do contain pointers, so the map would need to be scanned by the GC).
If pointers in the map keys and values could be avoided, it would have (if my understanding is correct) removed the need for the GC to scan the map. You could do this for example by replacing string keys with fixed size byte arrays. Curious if you experimented this approach?
After spending weeks fighting with Java's GC tuning for a similar production service tail latency problem, I wouldn't want to be caught having to do that again.
We use C++ at Scylla (saw that we got a shout-out in the blog! Woot!) but it's not like there isn't a whole industry about writing blogs avoiding C++ pitfalls.
C++ pitfalls through the years...
• https://www.horstmann.com/cpp/pitfalls.html (1997)
• https://stackoverflow.com/questions/30373/what-c-pitfalls-sh... (2008)
• http://blog.davidecoppola.com/2013/09/cpp-pitfalls/ (2013)
• https://www.typemock.com/pitfalls-c/ (2018)
I am not saying any of these (Go, Rust, C++, or even Java) are "right" or "wrong" per se, because that determination is situational. Are you trying to optimize for performance, for code safety, for taking advantage of specific OS hooks, or oppositely, to be generically deployable across OSes, or for ease of development? For the devs at Scylla, the core DB code is C++. Some of our drivers and utilities are Golang (like our shard aware driver). There's also a Cassandra Rust driver — it'd be sweet if someone wants to make it shard-aware for Scylla!
Actually we didn't update the reference to Cassandra in the article -- the read states workload is now on Scylla too, as of last week. ;)
We'll be writing up a blog post on our migration with Scylla at some point in the next few months, but we've been super happy with it. I replaced our TokuMX cluster with it and it's faster, more reliable, _and_ cheaper (including the support contract). Pretty great for us.
Once I spend even the plurality of my time cleaning up messes instead of doing something new (and there are ways to do both), then all the life is sucked out of me and I just have to escape.
Telling me that I have to keep using a tool with known issues that we have to process or patches to fix would be super frustrating. And the more times we stumble over that problem the worse my confirmation bias will be.
Even if the new solution has a bunch of other problems, the set that is making someone unhappy is the one that will cause them to switch teams or quit. This is one area where management is in a tough spot with respect to rewrites.
Rewrites don't often fix many things, but if you suspect they're the only thing between you and massive employee turnover, you're between a rock and a hard place. The product is going to change dramatically, regardless of what decision you make.
> Telling me that I have to keep using a tool with known issues that we have to process or patches to fix would be super frustrating.
All tools have known issues. It's just that some have way more issues than others. And some may hurt more than others.
Go has reached an interesting compromise. It has some elegant constructs and interesting design choices (like static compilation which also happens to be fast). The language is simple, so much so that you can learn the basics and start writing useful stuff in a weekend. But it is even more limiting than Java. A Lisp, this thing is not. You can't get very creative – which is an outstanding property for 'enterprises'. Boring, verbose code that makes you want to pull your teeth out is the name of the game.
And I'm saying this as someone who dragged a team kicking and screaming from Python to Go. That's on them – no-one has written a single line of unit tests in years, so now they at least get a whiny compiler which will do basic sanity checks before things blow up in prod. Things still 'panic', but less frequently.
"Referring to the term of "janitors" as demeaning is pretty demeaning and says more about you than your judgement of the parent."
I don't like this rhetoric device you just used.
Also, I think that janitors do important work as well.
The demeaning of janitors was introduced by GP by describing it as something they would rather not do.
No mental gymnastics required.
> The common factor in most of my decisions to look for a new job has been realizing that I feel like a very highly compensated janitor instead of a developer.
So for that person, feeling like a janitor is incentive for seeking a new job. It's that simple really.
Really all languages with tracing GC are at a disadvantage when you have a huge number of long-lived objects in the heap. The situation is improved with generational GC (which Go doesn't have) but the widespread use of off-heap data structures to solve the problem even in languages like Java with generational GC suggests this alone isn't a good enough solution.
In Go's defense, I don't know another GC'ed language in which this optimization is present in the native map data structure.
Since you mention not knowing such languages, have a look at Eiffel, D, Modula-3, Active Oberon, Nim, C#/F# (specially after the latest improvements).
Also Java will eventually follow the same idea as Eiffel (where inline classes are similar to expanded classes in Eiffel), and ByteBuffers can be off-GC heap.
Go 1.9 is fairly old (1.14 is about to pop out), and there have been large improvements on tail latency for the Go GC over that period.
One of the Go 1. 12 improvements in particular seems to at least symptomatically line up with what they described, at least at the level of detail covered in the blog post:
“Go 1.12 significantly improves the performance of sweeping when a large fraction of the heap remains live.“
The problem is that garbage collectors are optimized for applications that mostly have short-lived objects, and a small amount of long-lived objects.
Things like large in-RAM LRU are basically the slowest thing for a garbage collector to do, because the mark-and-sweep phase always has to go through the entire cache, and because you're constantly generating garbage that needs to be cleaned.
I think it's not quite that.
Applications typically have a much larger old generation than young generation, i.e. many more long lived objects than short lived objects. So GCs do get optimized to process large heaps of old objects quickly and efficiently, e.g. with concurrent mark/sweep.
However as an additional optimization, there is the observation that once an application has reached steady state, most newly allocated objects die young (think: the data associated with processing a single HTTP request or user interaction in a UI).
So as an additional optimization, GCs often split their heap into a young and an old generation, where garbage collecting the young generation earlier/more frequently overall reduces the mount of garbage collection done (and offsets the effort required to move objects around).
In the case of Go though, the programming language allows "internal pointers", i.e. pointers to members of objects. This makes it much harder (or much more costly) to implement a generational, moving garbage collector, so Go does not actually have a young/old generation split nor the additional optimization for young objects.
The allocation is going to be close to the last allocation, which was touched recently, no? The first allocation after a compaction wii be far from recent allocations, but close to the compacted objects?
An extreme case of that problem happens when using GC in an app that gets swapped out. Performance drops to virtually zero then.
There would be no need for a GC to traverse the entire map, but that's because rust doesn't use a GC.
So you could argue that they are still going to suffer some of the downsides of a GC'ed memory allocation. Some potential issues include non-deterministic object lifespan, and ensuring that any unsafe code they write which interacts with the cache does the "right thing" with the reference counts (potentially including de-allocation; I'm not sure what unsafe code needs to do when referencing reference counted boxes).
That's so misleading as to essentially be a lie.
Rust uses reference counting if and only if you opt into it via reference-counted pointers. Using Rc or Arc is not the normal or default course of action, and I'm not aware of any situation where it is ubiquitous.
> So you could argue [nonsense]
No, you really could not.
If there’s a better way to handle a global LRU cache, I’m all ears.
This is made possible by rust's memory model, where it understands ownership of data, and the lifetime of each reference that's being taken from that owned data. This means that the compiler can statically determine how long an object needs to live, and that references to the object don't outlive the owned data. For use-cases where the lifetime of references are able to be statically understood, an arc/rc is not required. This blog-post goes into it in much better detail than I can: https://words.steveklabnik.com/borrow-checking-escape-analys...
Locking on one thread at a time seems like a pretty obvious performance flaw. It just doesn't seem like an appropriate design for the given workload (lots of requests, lots of stored items, largely write-only (except for its position in the queue)). It would make a lot more sense to grant multiple threads access the LRU at any given time.
And early optimization and all that aside, creating the LRU in such a way that it can be easily restricted to one thread or opened up makes the most sense to me. Otherwise, you get to re-write the LRU (and all the code which accesses it) if it should be identified as a bottleneck.
Of course, I'm not responsible for the code or truly involved in the design process, so my perspective may be limited.
You can scale-out by running multiple instances of the service (shared-nothing, N many depending on how cores you want to run on.) Or, you can do message-passing between cores.
In this case, we have 2 modes of scale-up/out (add more nodes to the cluster, or add more shared-nothing LRU caches that are partitioned internally that the process runs, allowing for more concurrency).
We however only run one LRU per node, as it turns out that the expensive part is not the bottleneck here, nor will it probably ever be.
I also think it is great that Discord is using the right tool for the job. It isn't often that you need the performance gains that Rust & Tokio so pick what works best to get the job done and iterate.
> Rust is blazingly fast and memory-efficient: with no runtime or garbage collector, it can power performance-critical services, run on embedded devices, and easily integrate with other languages.
I’m not so sure they would have done the rewrite if the Go GC was performing better, and the choice of Rust seems primarily based on prior experience at the company writing performance sensitive code rather than delivering business value.
Collections are one of the big areas where Go's lack of generics really hurts it. In Go, if one of the built in collections does not meet your needs, you are going to take a safety and ergonomic hit going to a custom collection. In Rust, if one of the standard collections does not meet your needs, you (or someone else) can create a pretty much drop-in replacement that does that has similar ergonomic and safety profiles.
A corollary to this is that adding more generic collections to Go’s standard library implies expanding the set of magical constructs.
It is rather that due to a lack of genericity (namely const generics) you can't implement traits for [T;N], you have to implement them for each size individually. So there has to be an upper bound somehow, and the stdlib developers arbitrarily picked 32 for stdlib traits on arrays.
A not entirely dissimilar limit tends to be placed on tuples, and implementing traits / typeclasses / interfaces on them. Again the stdlib has picked an arbitrary limit, here 12, the same issue can be seen in e.g. Haskell (where Show is "only" instanced on tuples up to size 15).
These are not "weird hacks", they're logical consequences of memory and file size not being infinite, so if you can't express something fully generically… you have to stop at one point.
 here's 47 https://play.rust-lang.org/?version=stable&mode=debug&editio...
 even if you use macros to codegen your impl block
(anecdotal: in Java I've never needed anything else than a HashMap or an ArrayList)
The idea for this is (if I remember correctly) to be able to return unused memory to the OS. As returning memory requires a gc to run, it is forced in time intervals. I am a bit surprised that they didn't contact the corresponding Go developers, as they seem to be interested in practical use cases where the gc doesn't perform well. Besides that newer Go releases improved the gc performance, I am a bit surprised that they didn't just increase this time interval to an arbitrary large number and checked, if their issues went away.
To be fair, most languages without GCs also don't have good language constructs to support manual memory management. If you're going to make wide use of manual memory management, you should think very carefully about how the language and ecosystem you're using help or hinder your manual memory management.
Compared to if I wrote something in house entirely in C... lolno
The main issue however is manpower. At my current client, one of the technologies still actively being used for this reason is PHP (which is a horrible fit for microservices for a lot of reasons), because they have a ton of PHP devs employed, and finding a ton of (good) people with something more fitting like Go or Rust knowledge is hard and risky and training costs a lot of money (and more importantly: time)...
And that's also how management usually sees it, and if they're smart they also realise that the first project using an unfamiliar technology is usually one to throw away.
Of course it has some advantages, but it's hardly universally better.
> We figured we could tune the garbage collector to happen more often in order to prevent large spikes, so we implemented an endpoint on the service to change the garbage collector GC Percent on the fly. Unfortunately, no matter how we configured the GC percent nothing changed. How could that be? It turns out, it was because we were not allocating memory quickly enough for it to force garbage collection to happen more often.
As someone not too familiar with GC design, this seems like an absurd hack. That this 2-minute hardcoded limitation is not even configurable comes across as amateurish even. I have no experience with Go -- do people simply live with this and not talk about it?
> The ballast in our application is a large allocation of memory that provides stability to the heap.
> As noted earlier, the GC will trigger every time the heap size doubles. The heap size is the total size of allocations on the heap. Therefore, if a ballast of 10 GiB is allocated, the next GC will only trigger when the heap size grows to 20 GiB. At that point, there will be roughly 10 GiB of ballast + 10 GiB of other allocations.
> For those interested, there is a proposal to add a target heap size flag to the GC which will hopefully make its way into the Go runtime soon.
What's wrong with the existing parameter?
I'm sure they aren't going this far to avoid all environment config without a good reason, but any good reason would be a flaw in some part of their stack.
Also, it is almost trivial to edit the Go sources (they are included in the distribution) and rebuild it, which usually takes just a minute. So Go is really suited for your own experiments - especially, as Go is implemented in Go.
Well, parts of it. You can't implement "make" or "new" in Go yourself, for example.
I do, I'm just objecting to "Go is implemented in Go".
The stable compiler is permitted to use unstable features in stable builds, but only for compiling the compiler. In essence, there are some Rust features that are supported by the compiler but only permitted to be used by the compiler. Unsurprisingly, various non-compiler users of Rust have decided that they want those features and begun setting the RUSTC_BOOTSTRAP envvar to build things other than the compiler, prompting consternation from the compiler team.
I don't see why you couldn't do something similar in your own Go code. It just won't be as convenient to use as the compiler wouldn't fill in the type information (element size, suitable hash function, etc.) for you. You'd have to pass that yourself or provide type-specific wrappers invoking the unsafe base implementation. More or less like you would do in C, with some extra care to abide by the rules required for unsafe Go code.
Obviously they consider spending 50% more on hardware is a worthwhile compromise for the gains they get (e.g. reduction of developer hours and reduced risk of security flaws or avoiding other effects of invalid pointers).
If you do a lot of allocations, the GC overhead rises of course, but also would the effort of doing allocations/deallocations with a manual managing scheme. In the end it is a bit trade-off, what fits the problem at hand best. The nice thing about Rust is, that "manual" memory management doesn't come at the price of program correctness.
As a consequence, the heap pressure of a Go program is not necessarily significantly larger than that of an equivalent C or Rust program.
In contrary, in e.g. C I can wrap two 32-bit fields in a struct and freely pass then anywhere with zero heap allocations.
Also, record types are not going to fix the pointer chasing problem with arrays. This is promised by Valhalla, but I'be been hearing about it for 3 years or more now.
> We kept digging and learned the spikes were huge not because of a massive amount of ready-to-free memory, but because the garbage collector needed to scan the entire LRU cache in order to determine if the memory was truly free from references.
So maybe this is one of those things that just doesn't come up in most cases? Maybe most services also generate enough garbage that that 2-minute maximum doesn't really come into play?
But (virtually) nobody is writing games in Go, so it's entirely possible that it's an unusual case in the Go ecosystem. Being an unsupported usecase is a great reason to switch language.
I started up KSP just now, and it was at 5.57GB before I even got to the main menu. To be fair, I hadn't launched it recently, so it was installing its updates or whatever. Ok, I launched it again, and at the main menu it's sitting on 5.46GB. (This is on a Mac.) At Mission Control, I'm not even playing the game yet, and the process is using 6.3GB.
I think a better takeaway is that you can get away with GC even in games now, because it sucks and is inefficient but it's ... good enough. We're all conditioned to put up with inefficient software everywhere, so it doesn't even hurt that much anymore when it totally sucks.
Is this true? Go was built specifically for C++ developers, which, even when Go was first release, was a pretty unpopular language for writing web services (though maybe not at Google?). That a non-trivial number of Ruby/Python/Node developers switched was unexpected. (1)
> Our target community was ourselves, of course, and the broader systems-programming community inside Google. (1)
IDK if this is true for earlier versions, but as of today C# has pretty clear rules: 16MB in desktop or 64MB in server (which type is used can be set via config) will trigger a full GC . Note that less than that may trigger a lower level GC, but those are usually not the ones that are noticed. I'm guessing at least some of that is because of memory locality as well as the small sizes.
On the other hand, in a lot of the Unity related C# posts I see on forums/etc, passing structs around is considered the 'performant' way to do things to minimize GC pressure.
Often something like Redis is used as a shared cache that is invisible to the garbage collector, there is a natural key with a weak reference (by name) into a KV store. One could embed a KV store into an application that the GC can't scan into.
Doesn't Go have something like this available? It's an obvious thing for garbage-collected languages to have.
You could maybe hack around the GC performance without destroying the aims of LRU eviction by batching additions to your LRU data structure to reduce the number of pointers by a factor of N. It's also possible that a Go BTree indexed by timestamp, with embedded data, would provide acceptable LRU performance and would be much friendlier on the cache. But it might also not have acceptable performance. And Go's lack of generic datastructures makes this trickier to implement vs Rust's BtreeMap provided out of the box.
This is something important to know before choosing a GC-based language for a task like this. I don't think "generating more garbage" would help, the problem is the scan is slow.
If Discord was forced to do this in pure Go, there is a solution, which is basically to allocate a byte or a set of bytes, and then treat it as expanse of memory yourself, managing hashing, etc., basically, doing manual arena allocation yourself. GC would drop to basically zero in that case because the GC would only see the byte slices, not all the contents as individual objects. You'll see this technique used in GC'd languages, including Java.
But it's tricky code. At that point you've shucked off all the conveniences and features of modern languages and in terms of memory safety within the context of the byte expanses, you're writing in assembler. (You can't escape those arrays, which is still nice, but hardly the only possible issue.)
Which is, of course, where Rust comes in. The tricky code you'd be writing in Go/Java/other GC'd language with tons of tricky bugs, you end up writing with compiler support and built-in static checking in Rust.
I would imagine the Discord team evaluated the option of just grabbing some byte arrays and going to town, but it's fairly scary code to write. There are just too many ways to even describe for such code to end up having a 0.00001% bug that will result in something like the entire data structure getting intermittently trashed every six days on average or something, virtually impossible to pick up in testing and possibly even escaping canary deploys.
Probably some other languages have libraries that could support this use case. I know Go doesn't ship with one and at first guess, I wouldn't expect to find one for Go, or one I would expect to stand up at this scale. Besides, honestly, at feature-set maturity limit for such a library, you just end up with "a non-GC'd inner platform" for your GC'd language, and may well be better off getting a real non-GC'd platform that isn't an inner platform . I've learned to really hate inner platforms.
By contrast... I'd bet this is fairly "boring" Rust code, and way, way less scary to deploy.
To be clear: I wasn't suggesting that generating garbage would help anyone. Only that in a more typical case, where more garbage is being generated, the two minute interval itself might never surface as the problem because other things are getting in front of it.
At least with code hacking around the GC's behavior that code ends up being portable across the ecosystem.
There doesn't seem to really be a good option here either way. This amount of tuning-by-brute-force (either by knobs or by code re-writes) seems to just be the cost of using a GC.
Or do you think it adds to the discussion?
It is worth it to read and understand:
If these issues were more common, there would be more configuration available.
[EDIT] to downvoters: I'm not saying it's not an issue worth addressing (and it may have already been since they were on 1.9), I was just answering the question of "why this might happen"
More details here: https://golang.org/pkg/runtime/
I think with that, you could turn off GC after startup, then turn it back on at desired intervals (e.g. once an hour or after X cache misses).
It's definitely risky though. E.g. if there is a hiccup with the database backend, the client library might suddenly produce more garbage than normal, and all instances might OOM near the same time. When they all restart with cold caches, they might hammer the database again and cause the issue to repeat.
CloudFront, for this reason, allocates heterogeneous fleets in its PoPs which have diff RAM sizes and CPUs , and even different software versions .
> When they all restart with cold caches, they might hammer the database again and cause the issue to repeat.
Reminds me of the DynamoDB outage of 2015 that essentially took out us-east-1 . Also, ELB had a similar outage due to unending backlog of work .
Someone must write a book on design patterns for distributed system outages or something?
This is definitely a familiar problem if you rely on caches for throughput (I think caches are most often introduced for latency, but eventually the service is rescaled to traffic and unintentionally needs the cache for throughput). You can e.g. pre-warm caches before accepting requests or load-shed. Load-shedding is really good and more general than pre-warming, so it's probably a great idea to deploy throughout the service anyway. You can also load-shed on the client, so servers don't even have to accept, shed, then close a bunch of connections.
The more general pattern to load-shedding is to make sure you handle a subset of the requests well instead of degrading all requests equally. E.g. processing incoming requests FIFO means that as queue sizes grow, all requests become slower. Using LIFO will allow some requests to be just as fast and the rest will timeout.
I've read the first SRE book but having worked on large-scale systems it is impossible to relate to the book or internalise the advice/process outlined in it unless you've been burned by scale.
I must note that there are two Google SRE books in-circulation, now: https://landing.google.com/sre/books/
But, per other comments, there isn't any direct malloc/free behavior. It just provides tools to help you enable the compiler to determine that GC is not needed for some.
Rust will probably be faster because it benefits from optimizations in LLVM that Go likely doesn't have.
Go arrays of structs are contiguous, not indirect pointers.
IIRC I have used GitLab and Bitbucket and self-hosted Gitea instances the same exact way, and I'm fairly sure there was an hg repo in one of those. Don't recall doing anything out of the ordinary compared to how I would use a github URL.
Ouch, Go never ceases to amaze. The Bitbucket case is even more crazy, calling out to the Bitbucket API to figure out which VCS to use. It has a special case for private repositories, but seems to hard-code cloning over HTTPS.
If only we had some kind of universal way to identify resources, that told you how to access it...
Wow, that's sad. I'm glad it works seamlessly, don't get me wrong, but I was assuming I could chalk it up to defacto standards between the various vendors here.
Sometimes it means an easy thing in most other languages is difficult or tiresome to do in Go. Sometimes it means hard-coded values/decisions you can't change (only tabs anyone?).
But overall this makes for a language that's very easy to learn, where code from project to project and team to team is very similar and quick to understand.
Like anything, it all depends on your needs. We've found it suits ours quite well, and migrating from a Ruby code base has been a breath of fresh air for the team. But we don't have the same performance requirements as Discord.
These are two things that make a lot of sense at Google if you read why they were done.
But unless you're working at Google, I struggle to guess why you would care about either of these things. The first requires sacrificing anything resembling a reasonable type system, and even with that sacrifice Go doesn't really deliver: are we really supposed to buy that "go generate" isn't a compilation step? The second is sort of nice, but not nice enough to be a factor in choosing a language.
The core language is currently small, but every language grows with time: even C with its slow-moving, change-averse standards body has grown over the years. Currently people are refreshed by the lack of horrible dependency trees in Go, but that's mostly because there aren't many libraries available for Go: that will also change with time (and you can just not import all of CPAN/PyPy/npm/etc. in any language, so Go isn't special anyway).
If you like Go for some aesthetic of "simplicity", then sure, I guess I can see how it has that. But if we're discussing pros and cons, aesthetics are pretty subjective and not really work talking about.
I like Go and I consider it a simple language because:
1. I can keep most of the language in my head and I don't hit productivity pauses where I have to look something up.
2. There is usually only one way to do things and I don't have to spend time deciding on the right way.
For me, these qualities make programming very enjoyable.
You mean where I explicitly said that "simple" didn't mean anything, so we should talk about what we mean more concretely?
> 1. I can keep most of the language in my head and I don't hit productivity pauses where I have to look something up.
The core language is currently small, but every language grows with time: even C with its slow-moving, change-averse standards body has grown over the years.
> 2. There is usually only one way to do things and I don't have to spend time deciding on the right way.
Go supports functional programming and object-oriented programming, so pretty much anything you want to do has at least two ways to do it--it sounds like you just aren't familiar with the various ways.
The problem with having more than one way to do things isn't usually choosing which to use, by the way: the problem is when people use one of the many ways differently within the same codebase and it doesn't play nicely with the way things are done in the codebase.
This isn't really a criticism of Go, however: I can't think of a language that actually delivers on there being one right way to do things (most don't even make that promise--Python makes the promise but certainly doesn't deliver on it).
I've been happy working with it for a year now, though I've had the chance to work with Kotlin and I have to say, it's very nice too, even if the parallelism isn't quite easy/ convenient to use.
1. map/filter are 2 lines of code each.
3. Generics are more of a strong type thing than an OOP thing.
Have you ever worked in a code base with many contributors that changed over the course of years? In my experience it always ends up a jumble where indentation is screwed up and no particular tab setting makes things right. I've worked on files where different lines in the same file might assume tab spacing of 2, 3, 4, or 8.
For example, say there is a function with a lot of parameters, so the argument list gets split across lines. The first line has, say, two tabs before the start of the function call. The continuation line ideally should be two tabs then a bunch of spaces to make the arguments line up with the arguments from the first line. But in practice people end up putting three or four tabs to make the 2nd line line up with the arguments of the first line. It looks great with whatever tab setting the person used at that moment, but then change tab spacing and it no longer is aligned.
some_function(arg1, arg2, arg3, arg4,
rustfmt looks to try and be "smarter" as it will move the argslist and add linebreaks to it go not go beyond whatever limit is configured on the playground, gofmt apparently doesn't insert breaks in arglists.
Instead, format arguments in a separate block:
arg1, arg2, arg3, arg4,
My own preference is tabs, because of less visual noise in code diff [review].
Not all editors, however, support this style of alignment, even if they support plugins (looks at vim and its forks).
Consider linting tools in your build.
The Azul guys get to claim that you don't need to tune their gc, golang doesn't.
Not sure how that makes the golang position any better.
Thing is, few if any IDEs support the concept, so if I want to have half-indents, I must use spaces. Unfortunately, these days that means giving up and using a more common indent style most of the time, as the extra bit of readability generally isn't worth losing automatic formatting or multi-line indent changes.
BTW when you mix spaces with tabs you eliminate all benefits that tabs give (for example you no longer can dynamically change tab size without ruining formatting.
No, I used to be a notepad user (on personal projects, not shared work) (you can kinda see it in the use of indentation to help convey information that IDEs would use font or colour to emphasize), and these days use tabs but longingly wish Eclipse, etc. had more options in their already-massive formatting configuration dialogues.
<tab>(inserts 4 spaces)<tab>(replaces 4 spaces into a tab that is 8 columns)<tab>(adds 4 spaces after the tab)<tab>(replaces with two tabs and so on)
Unless I misunderstood what formatting you were using.