I'm wondering if the "optimized for reference-counting" thing applies to other languages too. i.e. if I write a piece of software in Rust, and I make use of Rc<>, will Macs be extra tolerant of that overhead? In theory it seems like the answer should be yes
I sure hope so. In macOS 10.15, the fast path for a retain on a (non-tagged-pointer) Obj-C object does a C11 relaxed atomic load followed by a C11 relaxed compare-and-exchange. This seems pretty standard for retain-release and I'd expect Rust's Rc<> to be doing something similar. It's possible Apple added some other black magic to the runtime in 10.16 (and they haven't released the 10.16 objc sources yet) but it's hard to imagine what they'd do that makes more sense than just optimizing for relaxed atomic operations.
I didn't understand why the implementation wouldn't just do an atomic increment, but I guess Obj-C semantics provide too much magic to permit such a simple approach. The actual code, in addition to [presumably] not being inlined, does not seem easy to optimize at the hardware level: https://github.com/apple/swift-corelibs-foundation/blob/main...
The short answer for why it can’t just be an increment is because the reference count is stored in a subset of the bits of the isa pointer, and when the reference count grows too large it has to overflow into a separate sidetable. So it does separate load and CAS operations in order to implement this overflow behavior.
No, because Rc<> isn't atomic. Arc<>, however, would get the benefit. The reason retain/release in ObjC/Swift are so much faster here is because they are atomic operations.
Yes, it applies to everything which uses atomics and is not something special in the runtime. It's also worth noting that this is an optimization that iphones have had for the last few years ever since those switched to arm64.
If anything, an Objective-C machine: they apparently have a special branch predictor for objc_msgSend()!
But they learned the lesson from SOAR (Smalltalk On A RISC) and did not follow the example of LISP machines, Smalltalk machines, Java machines, Rekursiv, etc. and build specialized hardware and instructions. The benefits of the specialization are much, much less than expected, and the costs of being custom very high.
Instead, they gave the machine large caches and tweaked a few general purpose features so they work particularly well for their preferred workloads.
I wonder if they made trap on overflow after arithmetic fast.
Remember, ARM instruction set, aka an interface, not an implementation. Arm holdings does license their tech to other companies though, so I don't know how much apple silicon would have in common, with say a Qualcomm CPU. They may be totally different under the hood.
Basically reference counting requires grabbing a number from memory in one step, then increasing or decreasing it and storing it in a second step.
This is two operations and in-between the two -- if and only if the respective memory location is shared between multiple cores or caches -- some form of synchronization must occur (like locking a bank account so you can't double draft on two ATMs simultaneously).
Now the way this is implemented varies a bit.
Apple controls most of the hardware ecosystem, programming languages, binary interface, and so on meaning there is opportunity for them to either implement or supplement ARM synchronization or atomicity primitives with their own optimizations.
There is nothing really preventing Intel from improving here as well -- it is just a easier on ARM because the ISA has different assumptions baked in, and Apple controls everything up the stream, such as the compiler implementations.
I know it's a fast Arm CPU - I've read the Anandtech analysis etc - and that there is lots of extra hardware on the SoC. But the specific point was why is it a Swift machine. What makes it particularly suited to running Swift?
2. low memory latencies between cache and system memory (so dirty pages in caches are faster updated etc.)
3. potential a coherency impl. optimized for this kind of atomic access (purely speculative: e.g. maybe sometimes updating less then a page in a cache when detecting certain atomic operations which changed a page or maybe wrt. the window marked for exclusive access in context of ll/sc-operations and similar)
Given that it's common for a object to stay in the same thread I'm not sure how much 2. matters for this point (but it does matters for general perf.). But I guess there is a lot in 3. where especially with low latency ram you might be able to improve performance for this cases.
These are interesting points. I'd like to hazard a guess that the leading contributor is cache-related. Just looking at the https://en.wikipedia.org/wiki/Reference_counting suggests as much: "Not only do the operations take time, but they damage cache performance and can lead to pipeline bubbles."
I roughly understand how refcounting causes extra damage to cache coherency: anywhere that a refcount increment is required, you mutate the count on an object before you use it, and then decrement it later. Often times, those counting operations are temporally distant from the time that you access the object contents.
I do not really understand the "pipeline bubbles" part, and am curious if someone can elaborate.
Reading on in the wiki page, they talk about weak references (completely different than weak memory ordering referenced above). This reminds me that Cocoa has been making ever more liberal use of weak references over the years, and a lot of iOS code I see overuses them, particularly in blocks. I last looked at the objc implementation years ago, but it was some thread safe LLVM hash map split 8 or 16 ways to reduce lock contention. My takeaway was roughly, "wow that looks expensive". So while weak refs are supposed to be used judiciously, and might only represent 1% or less of all refs, they might each cost over 100x, and then I could imagine all of your points could be significant contributors.
In other words, weak references widen the scope of this guessing game from just "what chip changes improve refcounting" to "what chip changes improve parallelized, thread safe hash maps."
The "pipeline bubbles" remark refers to the decoding unit of a processor needing to insert no-ops into the stream of a processing unit while it waits for some other value to become available (another processing unit is using it). For example, say you need to release some memory in a GC language, you would just drop the reference while the pipeline runs at full speed (leave it for the garbage collector to figure out). In an refcount situation, you need to decrease the refcount. Since more than one processing unit might be incrementing and decrementing this refcount at the same time, this can lead to a hot spot in memory where one processing unit has to bubble for a number of clock cycles until the other has finished updating it. If each refcount modify takes 8 clock cycles, then refcounting can never update the same value at more than once per 8 cycles. In extreme situations, the decoder might bubble all processing units except one while that refcount is updated.
For the last few decades the industry has generally believed that GC lets code run faster, although it has drawbacks in terms of being wasteful with memory and unsuitable for hard-realtime code. Refcounting has been thought inferior, although it hasn't stopped the Python folks and others from being successful with it. It sounds like Apple uses refcounting as well and has found a way to improve refcounting speed, which usually means some sort of specific silicon improvement.
I'd speculate that moving system memory on-chip wasn't just for fewer chips, but also for decreasing memory latency. Decreasing memory latency by having a cpu cache is good, but making all of ram have less latency is arguably better. They may have solved refcounting hot spots by lowering latency for all of ram.
From Apple's site:
"M1 also features our unified memory architecture, or UMA. M1 unifies its high-bandwidth, low-latency memory into a single pool within a custom package. As a result, all of the technologies in the SoC can access the same data without copying it between multiple pools of memory." That is paired with a diagram that shows the cache hanging off the fabric, not the CPU.
That says to me that, similar to how traditionally the cpu and graphics card could access main memory, now they have turned the cache from a cpu-only resource into a shared resource just like main memory. I wonder if the GPU can now update refcounts directly in the cache? Is that a thing that would be useful?
Extremely low memory latency is another. It's also has 8 memory channels, most desktops have 2. It's an aggressive design, Anandtech has a deep dive. Some of the highlights, lower latency cache, larger reorder buffer, more in flight memory operations, etc.
Typical desktops have 2 64 bit dimm into 2 channels (64 bits wide each) or 1 channel (128 bits wide).
The M1 Mac's seem to be 8 channels x 16 bits, which is the same bandwidth as a desktop (although running the ram at 4266 MHz is much higher than usual). The big win is you can have 8 cache misses in flight instead of 2. With 8 cores, 16 GPU cores, and 16 ML cores I suspect the M1 has more in flight cache misses than most.
The DDR4 bus is 64-bit, how can you have a 128-bit channel??
Single channel DDR4 is still 64-bit, it's only using half of the bandwidth the CPU supports. This is why everyone is perpetually angry at laptop makers that leave an unfilled SODIMM slot or (much worse) use soldered RAM in single-channel.
> The big win is you can have 8 cache misses in flight instead of 2
Only if your cache line is that small (16 bit) I think? Which might have downsides of its own.
> The DDR4 bus is 64-bit, how can you have a 128-bit channel??
Less familiar with the normal on laptops, but most desktop chips from AMD and Intel have two 64 bit channels.
> Which might have downsides of its own.
Typically for each channel you send an address, (a row and column actually), wait for the dram latency, and then get a burst of transfers (one per bus cycle) of the result. So for a 16 bit wide channel @ 3.2 Ghz with a 128 byte cache line you get 64 transfers, one ever 0.3125 ns for a total of 20ns.
Each channel operates independently, so multiple channels can each have a cache miss in flight. Otherwise nobody would bother with independent channels and just stripe them all together.
Here's a graph of cache line throughput vs number of threads.
So with 1,2 you see an increase in throughput, the multiple channels are helping. 4 threads is the same as two, maybe the L2 cache has a bottleneck. But 8 threads is clearly better than 4.
It's pretty common for hardware to support both. On the Zen1 Epyc's for instance some software preferred a consistent latency from stripped memory over the NUMA aware latency with separate channels where the closer dimms have lower latency and the further dimms had higher.
I've seen similar on Intel servers, but not recently. This isn't however typically something you can do at runtime, just boottime, at least as far as I've seen.
But doesn't that only help if you have parallel threads doing independent 16 bit requests? If you're accessing a 64 bit value, wouldn't it still need to occupy four channels?
Depends. Cachelines are typically 64-128 bytes long and sometimes depending on various factors that might be across on memory channel, or spread across multiple memory channels, somewhat like a RAID-0 disk. I've seen servers (opterons I believe) that would allow mapping memory per channel or across channels based on settings in BIOS. Generally non-NUMA aware OS ran better with stripped memory and NUMA aware OSs ran better non-stripped.
So striping a caching line across multiple channels goes increase bandwidth, but not by much. If the dram latency is 70ns (not uncommon) and your memory is running at 3.2 GHz on a single 64 bit wide channel you get 128 bytes in 16 transfers. 16 transfers at 3.2GHz = 5ns. So you get a cache line back in 75ns. With 2 64 bit channels you can get 2 cache lines per 75ns.
So now with a 128 bit wide channel (twice the bandwidth) you wait 70ns then get 8 transfers @ 3.2GHz = 2.5ns. So you get a cache line back in 72.5ns. Clearly not a big difference.
So the question becomes for a complicated OS with a ton of cores do you want one cacheline per 72.5ns (the stripped config) or two cachlines per 75ns (the non-stripped config).
In the 16 bit 8 channel (assuming the same bus speed and latency) you get 8 cacheline per 90ns. However not sure what magic apple has but I'm seeing very low memory latencies on the M1, on the order of 33ns! With all cores busy I'm seeing cacheline througput of a cacheline per 11ns or so.
I believe modern superscalar architectures can run instructions out of order if they don't rely on the same data, so when paused waiting for a cache miss, the processor can read ahead in the code, and potentially find other memory to prefetch. I may be wrong about the specifics, but these are the types of tricks that modern CPUs employ to achieve higher speed.
Sure, but generally a cacheline miss will quickly stall, sure you might have a few non-dependent instructions in the pipeline, but running a CPU at 3+GHz and waiting 70ns is an eternity. Doubly so when you can execute multiple instructions per cycle.
That's true but for it to be a "swift machine" as mentioned above it would imply some kind of isa level design choices, as opposed to "just" being extremely wide or having a branch predictor that understands what my favourite food is
This is where it's worth pointing out that Apple is an ARM architecture licensee. They're not using a design from ARM directly, they're basically modifying it however it suits them.
Indeed, they’re an ISA licensee, and I don’t think they’re using designs from ARM at all. They beat ARM to the first ARM64 core back in 2013 with the iPhone 5s.