Hacker News new | past | comments | ask | show | jobs | submit login

This is fascinating:

> Retain and release are tiny actions that almost all software, on all Apple platforms, does all the time. ….. The Apple Silicon system architecture is designed to make these operations as fast as possible. It’s not so much that Intel’s x86 architecture is a bad fit for Apple’s software frameworks, as that Apple Silicon is designed to be a bespoke fit for it …. retaining and releasing NSObjects is so common on MacOS (and iOS), that making it 5 times faster on Apple Silicon than on Intel has profound implications on everything from performance to battery life.

> Broadly speaking, this is a significant reason why M1 Macs are more efficient with less RAM than Intel Macs. This, in a nutshell, helps explain why iPhones run rings around even flagship Android phones, even though iPhones have significantly less RAM. iOS software uses reference counting for memory management, running on silicon optimized to make reference counting as efficient as possible; Android software uses garbage collection for memory management, a technique that requires more RAM to achieve equivalent performance.




This quote doesn’t really cover why M1 macs are more efficient with less ram than intel macs? You’ve got a memory budget, it’s likely broadly the same on both platforms, the speed at which your retains/releases happen isn’t going to be the issue. it’s not like intel macs use GC where m1 uses RC.

(It explains why iOS does better with less ram than android, but the quote is specifically claiming this as a reason for 8GB ram to be acceptable)


I doubt the M1 macs are really using memory much more efficiently; the stock M1 macs with 8GB were available rapidly; the Macs with 16GB ram or larger disk space had a three to four week delay when ordering; a lot of enthusiasts and influencers rushed out and got base models; they are then surprised to find they can work ok in most apps with only 8GB.

Perhaps they never really needed to fit 32GB into their intel macs either. Some days after the glowing reviews; and strange comments about magic memory utilization; we now see comments concerned about SSD wear due to swap file usage.

If the applications and data structures are more compact in memory on the arm processors; it should be easy to test; you just need an intel mac; and an M1 mac running the same app on the same document and look at how much memory it uses.


When you need 32 or 64gb of ram is not because the data structures or the programs you use need memory, it's because the data they use (database content, virtual machines, images, videos, music... ) fill that ram, and those data is not going to occupy less on an arm machine.


However real case usage of such massive amount of data are limited for typical desktop user. Massive database load usually happen on specialized servers in which 256Go RAM and more are pretty mundane.

So on customer PC ram is maybe more used as caching mechanisms or eaten away by poorly designed memory leak/garbage collection.

And if your GPU is able do to real time rendering on data heavy load maybe you need less caching of intermediate results as well.


Plenty of use cases for more than 8gb of ram. When you're doing data analysis on even smaller datasets you may need several times more available memory than the size of the dataset as you're processing it.


1.Again typical use case for entry level PC is not data analysis on bigdata.

2.My current production server is a PostgreSQL database on a 16GB RAM VM running on Debian (my boss is stingy). This doesn't prevent me from managing a 300GB+ data cluster with pretty decent performances and perform actual data analysis.

3.If Chrome sometimes use +8GB for a godsake webrowser the only explanation is poor design, there is no excuse.


"Nobody needs more than 640k of RAM"


that was not my point, my point was that if you need it, you will need it regardless of the architecture.


I think you’re right. I’ve only ever needed 32 gb when I was running a local hadoop cluster for development. Those virtual images required the same amount of ram regardless of OS.


It's a contributing factor. If things like retain/release are fast and you have significantly more memory bandwidth and low latency to throw at the problem, you can get away without preloading and caching nearly as much. Take something simple like images on web pages: don't bother keeping hundreds (thousands?) of decompressed images in memory for all of the various open tabs. You can just decompress them on the fly as needed when a tab becomes active and then release them when it goes inactive and/or when the browser/system determines it needs to free up some memory.


You've completely changed the scope of what's being discussed, though. Retain/release being faster would just surface as regular performance improvements. It won't change anything at all about how an existing application manages memory.

It's possible that apps have been completely overhauled for a baseline M1 experience. Extremely, extraordinarily unlikely that anything remotely of the sort has happened, though. And since M1-equipped Macs don't have any faster IO than what they replaced (disk, network, and RAM speeds are all more or less the same), there wouldn't be any reason for apps to have done anything substantially difference.


From the original article:

Third, Marcel Weiher explains Apple’s obsession about keeping memory consumption under control from his time at Apple as well as the benefits of reference counting:

>where Apple might have been “focused” on performance for the last 15 years or so, they have been completely anal about memory consumption. When I was there, we were fixing 32 byte memory leaks. Leaks that happened once. So not an ongoing consumption of 32 bytes again and again, but a one-time leak of 32 bytes.

>The benefit of sticking to RC is much-reduced memory consumption. It turns out that for a tracing GC to achieve performance comparable with manual allocation, it needs several times the memory (different studies find different overheads, but at least 4x is a conservative lower bound). While I haven’t seen a study comparing RC, my personal experience is that the overhead is much lower, much more predictable, and can usually be driven down with little additional effort if needed.


But again that didn't change with M1. We're talking MacOS vs. MacOS here. Your quote is fully irrelevant to what's being discussed which is the outgoing 32gb macbook vs the new 16gb-max ones. They are running the same software. Using the same ObjC & Swift reference counting systems.


We've run full circle there

ARC is not specific to M1, BUT have been widely used in ObjC & Swift for years AND is thus heavily optimized on M1 that perform "retain and release" way faster (even when emulating x86)

Perfect illustration of Apple software+hardware long term strategy.


That still doesn't mean that M1 Macs use less memory. If retain/release is faster then the M1 Macs have higher performance than Intel Macs. That is easily understood. The claim under contention here is that M1 Macs use less memory, which is not explained by hardware optimized atomic operations


And I never stated that. It's just more optimized.


Ok. However the posts in this thread were asking how the M1 Macs could use less RAM than Intel Macs, not if they were more optimized. The GP started with:

>This quote doesn’t really cover why M1 macs are more efficient with less ram than intel macs? You’ve got a memory budget, it’s likely broadly the same on both platforms


Well, if less memory is used to store garbage thanks to RC, less memory is needed. But that was largely discussed in other sub-comments hence why we focused more on the optimisation aspect in this thread.


>Well, if less memory is used to store garbage thanks to RC, less memory is needed

But both Intel Macs and ARM Macs use RC. Both chips are running the same software.


Aren't most big desktop apps like office on PC still written in C++? Same with almost all AAA games. And the operating system itself.

Browsers are written in C++ and javascript has full-blown GC.

I don't see how refcounting gives you advantage over manual memory management for most users.


Decompression is generally bound by CPU speed, not memory bandwidth or latency.


CPU speed is often bound by memory bandwidth and latency... it's all related. If you can't keep the CPU fed, it doesn't matter how fast it is theoretically.


What I mean is that (to my understanding) memory bandwidth in modern devices is already high enough to keep a CPU fed during decompression. Bandwidth isn't a bottleneck in this scenario, so raising it doesn't make decompression any faster.


RAM bandwidth limitations (latency and throughput) are generally hidden by the multiple layers of cache in between the ram and CPU prefetching more data than is generally needed. Having memory on chip could make the latency less, but as ATI has shown with HBM memory on a previous generation of its GPUs its not a silver bullet solution.

I am going to speculate now, but maybe, just maybe, if some of the silicon that apple has used on the M1 is used for compression/decompression they could be transparently compressing all ram in hardware. Since this offloaded from the CPUs and allows a compressed stream of data from memory, they achieve greater ram bandwidth, less latency and less usage for a given amount of memory. If this is the case I hope that the memory has ECC and/or the compression has parity checking....


> I am going to speculate now, but maybe, just maybe, if some of the silicon that apple has used on the M1 is used for compression/decompression they could be transparently compressing all ram in hardware. Since this offloaded from the CPUs and allows a compressed stream of data from memory, they achieve greater ram bandwidth, less latency and less usage for a given amount of memory.

Are you aware of any x86 chips that utilize this method?


Not that I am aware. I remember seeing apple doing something it in software with the intel macs. Which is why I speculated about it being hardware for M1.

Cheers


> Blosc [...] has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations (which is typical in vector-vector operations).

https://blosc.org/pages/blosc-in-depth/


I can't speak to the MacOS system, but from years spent JVM tuning: you're in a constant battle finding the right balance of object creation/destruction (the former burning CPU, the latter creating garbage), keeping memory use down (more collection, which burns CPU and can create pauses and hence latency), or letting memory balloon (which can eat resource, and makes the memory sweeps worse when they finally happen).

Making it cheaper to create and destroy objects with hardware acceleration, and to do many small, low-cost reclaims without eating all your CPU would be a magical improvement to the JVM, because you could constrain memory use without blowing out CPU. From what's described in TFA it sounds like the same is true for modern MacOS programming.


Manual memory management isn't magic and speeding up atomic ops doesn't fundamentally change anything. People have to spend time tuning memory management in C++ too, that's why the STL has so many ways to customise allocators and why so many production C/C++ codebases roll custom management schemes instead of using malloc/free. They're just expensive and slow so manual arena destruction etc is often worth it.

The JVM already makes it extremely cheap to create and destroy objects: creation is always ~free (just a pointer increment), and then destruction is copying, so very sensitive to memory bandwidth but done in parallel. If most of your objects are dying young then deallocation is "free" (amortized over the cost of the remaining live objects). Given the reported bandwidth claims for the M1 if they ever make a server version of this puppy I'd expect to see way higher GC throughput on it too (maybe such a thing can be seen even on the 16GB laptop version).

The problem with Java on the desktop is twofold:

1. Versions that are mostly used don't give memory back to the OS even if it's been freed by the collector. That doesn't start happening by default until like Java 14 or 15 or so, I think. So your memory usage always looks horribly inflated.

2. If you start swapping it's death because the GC needs to crawl all over the heap.

There are comments here saying the M1 systems rely more heavily on swap than a conventional system would. In that case ARC is probably going to help. At least unless you use a modern pauseless GC where relocation is also done in parallel. Then pausing background threads whilst they swap things in doesn't really matter, as long as the app's current working set isn't swapped out to compensate.


Yea, this is a BS theory. I have a 16Gb M1 MacBook Air and the real answer is that it has super fast SSD access, so you don’t notice the first few gigabytes of swap.

But when swap hits 8-9 Gb, it’s effects start to get very noticeable.


This seems correct. RC vs GC might explain how a Mac full of NSObjects needs less memory than a Windows full of .NET runtimes, but it doesn’t explain how M1 Mac with 16GB of RAM is faster than x86 Mac with 16GB or more of RAM.

Besides, a lot of memory usage is in web browsers, which must use garbage collection.

Looking at the reviews of M1 Macs, those systems are still responsive and making forward progress at a “memory pressure” that would make my x86 Mac struggle in a swap storm. It seems to come down to very fast access to RAM and storage, large on-die caches, and perhaps faster memory compression.


Oh one more thing, they said in the Apple Silicon event that they had eliminated a lot of the need for copying RAM around so … could be some actual footprint reduction there?


I tend to agree! I think Big Sur on M1 uses 16kB page size vs 4kB on Intel so maybe that contributes to more efficient / less obvious perf issue when swapping.


yeah its a bit of a stretch. to the extent that macos apps use garbage collection less than pc apps it would need less ram. but they are kinda hopping around a macos vs android comparison which makes no sense. I think mac enthusiasts trying to imagine why a max of 8 or 16gb is ok. it is ok for most people anyway.


It also would have no difference between the outgoing Intel ones and the incoming Apple Silicon ones. Same pointer sizes, same app memory management, etc... Some fairly minor differences in overall binary sizes, so no "wins" there or anything either.


All Swift/ObjC software has been doing ARC for ten (?) years. Virtual memory usage will be the same under M1. It will just pay off in being faster to refcount (ie as fast as it already is on an iPhone), and therefore the same software runs faster. Probably won't work under Rosetta 2 with the per-thread Total Store Ordering switch. And it's probably not specific to NSObject, any thread safe reference counter will benefit. There are more of those everywhere these days.

2 more points:

- All the evidence I've seen is gifs of people opening applications in the dock, which is... not impressive. I can do that already, apps barely allocate at all when they open to "log in to iCloud" or "Safari new tab". And don't we see that literally every time Apple launches Mac hardware? Sorry all tech reviewers everywhere, try measuring something.

- I think the actual wins come from the zillion things Apple has done in software. Like: memory compression, which come to think of it might be possible to do in hardware. Supposedly a lot of other work/tuning done on the dynamic pager, which is maybe enabled by higher bandwidth more than anything else.

Fun fact: you can stress test your pager and swap with `sudo memory_pressure`. Try `-l critical`. I'd like to see a benchmark comparing THAT under similar conditions with the previous generation.


You might try reading the article. One example was a large software build taking ~25% less time on a 13" MBP than on a 12-core Mac Pro.

I'm curious about FP/vector performance, but I'm pretty sure it's fine. I'm definitely eyeing a MBP myself! 20 hours of video playback? Crazy...


Right, which wasn't a test of the purportedly increased ability to work in high memory pressure at all.


> macos vs android comparison which makes no sense.

Because the M1 is similar to the chips used in iOS, hence the comparison is not inappropriate.


all comparisons are appropriate, but the question here was whether the mac laptops memory limits were somehow made better by more efficient use of memory. they are not. these are laptops, not phones or tablets. memory is used as efficiently as in previous laptops.


The quote right after explains your concerns.

>The memory bandwidth on the new Macs is impressive. Benchmarks peg it at around 60GB/sec–about 3x faster than a 16” MBP. Since the M1 CPU only has 16GB of RAM, it can replace the entire contents of RAM 4 times every second. Think about that…


Yes, reading some more of the discussions it seems like the answer is that (roughly) the same amount of memory is used, but hitting swap is no longer a major problem, at least for user-facing apps. Seems like the original quote is reading too much into the retain/release thing.


If the same amount of memory is used, then wouldn't swap usage be the same?


Yeah, but swapping suddenly isn't as big problem as before (probably). You had to be very careful not to hit trashing on x86_64, now you don't have to worry so much.

Or that's how I understand this, I don't actually own M1 Mac.


first there is no sensible reason why ram bandwidth would be different by 3x, its lpddr4x either way, and you can’t replace it from an ssd that fast, the ssd would limit swap speed


What?

https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

> Besides the additional cores on the part of the CPUs and GPU, one main performance factor of the M1 that differs from the A14 is the fact that’s it’s running on a 128-bit memory bus rather than the mobile 64-bit bus. Across 8x 16-bit memory channels and at LPDDR4X-4266-class memory, this means the M1 hits a peak of 68.25GB/s memory bandwidth.

The point of the memory bandwidth is so that it never has to swap to disk in the first place.


> The point of the memory bandwidth is so that it never has to swap to disk in the first place.

What? How does memory bandwidth obviate the need for disk swapping?


How does memory bandwidth prevent it needing to swap to disk?


By swap speed I think he meant that the bottle neck is the time that it takes to move data from the SSD to the RAM, not how fast can the RAM be read from the processor.


Sounds like some form of GDDR instead of plain DDR. Not only faster, but I bet simultaneously accessible from both the CPU and GPU.


> Not only faster, but I bet simultaneously accessible from both the CPU and GPU.

It is. We know they're using a unified memory architecture, they pointed it out in the presentation.


Unified doesn't necessarily mean dual ported.


Due to how DRAM works the array itself cannot have more than one port and almost certainly even the DRAM chip as a whole is still some variation of single ported SDRAM (long time ago there were various pseudo-dual-ported DRAM chips, but these were only really useful for framebuffer-like applications). But given that there are multiple levels of cache in the SoC it is somewhat moot point.


LPDDR4x (the ram chips in the m1) comes in dual port flavors.


I suspect you mean they come with two channels on a single chip, which is not the same as two ports. Channels access separate bits of memory. Ports access the same bits of memory.


Nope. Nothing too magical, I suspect. It says LPDDR4 right there in System Profiler, although it does not indicate the frequency.


This reminds me that Xbox series X/PS5 use Unified GDDR for higher memory bandwidth, I'm curious if such design can help x86 catch up M1?


x86 has been doing unified memory for integrated GPUs since at least 2015. It's not a new thing, see for example https://bit-tech.net/news/tech/cpus/amd-huma-heterogeneous-u...

The reason GDDR isn't typically used for system RAM is it's higher latency & more power hungry. Like, the GDDR6 memory on a typical discreet card uses more power than the an entire M1-powered Mac Mini power hungry.


Thanks, I haven't noticed the latency/power difference of GDDR memory.


no its normal mobile ram


The M1 has 4 LPDDR4x chips inside the the CPU package running at 4266 Mhz and showing some of the best latencies I've seen.

What "normal" laptop has that?


I believe the poster meant "normal" in the sense that it's a conventional memory technology for laptops. (ie LPDDR4 not GDDR like had been suggested above).


being inside the cpu package should allow for less than a 1% improvement in latency by my napkin math. its got good latency because it is a top of the line mobile ram setup but that isn’t unique to m1


I 100% agree, but I've audited my code, and on other platforms my code closely agrees with LMbench's lat_mem_rd, which seems pretty well regarded for accuracy.

The latency appears real to me.


Someone please correct me for the sake of all of us if I’m wrong, but it sounds like Apple is using specialized hardware for “NSObject” retain-and-release operations, which may bypass/reduce the impact on general RAM.


On recent Apple Silicon CPUs uncontended most atomic operations are essentially free - almost identical in speed to the non-atomic version of the same operation. Reference counting must be atomic safe whether using ARC or MRR. On x86 systems those atomic operations impose a performance cost. On Apple Silicon they do not. It does not change how much memory is used but it does mean you can stop worrying about the cost of atomic operations. It has nothing to do with the ARMv8 instruction set, it has to do with how the underlying hardware implements those operations and coordinates among cores.

Separately from that x86's TSO-ish memory model also imposes a performance cost whether your algorithm needs those guarantees or not. Code sometimes relies on those guarantees without knowing it. Absent hardware support you would need to insert ARM atomics in translated code to preserve those guarantees which on most ARM CPUs would impose a lot of overhead. The M1 allows Rosetta to put the CPU into a memory ordering mode that preserves the expected memory model very efficiently (as well as using 4K page size for translated processes).


> On recent Apple Silicon CPUs uncontended most atomic operations are essentially free - almost identical in speed to the non-atomic version of the same operation.

They are fast for atomics but still far, far slower than the equivalent non-atomic operation. An add operation takes around half a cycle (upper bound here - with how wide the firestorm core is an add operation is almost certainly less than half a cycle). At 1ghz a cycle is 1 nanosecond. The M1 runs at around 3ghz. So you're still talking the atomic operation being >10x slower than non-atomics.

Which should not be surprising at all. Apple didn't somehow invent literal magic here. They still need coherency across 8 cores, which means at a minimum L1 is bypassed for the atomic operation. The L2 latency is very impressive, contributing substantially to that atomic operation performance. But it's still coming at a very significant cost. It's very, very far from free. There's also no ARM vs. x86 difference here, since the atomic necessarily forces a specific memory ordering guarantee that's stricter than x86's default. Both ISAs are forced to do the same thing and pay the same costs.


> So you're still talking the atomic operation being >10x slower than non-atomics.

How did you arrive at this number?


> How did you arrive at this number?

It's in the post. Half a cycle for an add or less, and cycles are every 1/3 nanosecond. So upper bound for an add would be around 1/6th a nanosecond. Likely less than that still yet, since the M1 is probably closer to an add in 1/8th a cycle not 1/2. Skylake by comparison is at around 1/4th a cycle for an add, and since M1's IPC is higher it's not going to be worse at basic ALU ops.

6 nanoseconds @ 3ghz is 18 cycles. That's on the slow end of the spectrum for a CPU instruction.


Where? 6 nanoseconds is pretty long, that’s about how long it’d take to do an entire retain/release pair, which is a couple dozen instructions I believe.


I don't think that's quite right. Apple believes strongly in retain-and-release / ARC. It has designed its software that way; it has designed its M1 memory architecture that way. The harmony between those design considerations leads to efficiency: the software does things in the best way possible, given the memory architecture.

I'm not an EE expert and I haven't torn apart an M1, but Occams's Razor would suggest it's unlikely they made specialized hardware for NSObjects specifically. Other ARC systems on the same hardware would likely see similar benefits.


I suspect that Apple didn't anything special to improve performance of reference counting apart from not using x86. Simply put x86 ISA and memory model is built on assumption that atomic operations are mostly used as part of some kind of higher-level synchronization primitive and not for their direct result.


M1 is faster at retain/release under Rosetta2 than x86, and yet Rosetta2 still has the same strong memory model that x86 does.


One thing is that M1 has incredible memory BW and is implemented on single piece of silicon (which certainly helps with low-overhead cache consistency). Another thing is that rosetta certainly does not have to preserve exact behavior of x86 (and in fact it cannot because doing so will negate any benefits of dynamic translation) it only has to care about what can be observed by the user code running under it.


The hardware makes uncontended atomics very fast, and Objective-C is a heavy user of those. But it would really help any application that could use them, too.


But GCd languages don't need to hit atomic ops constantly in the way ref-counted Objective C does, so making them faster (though still not as fast as regular non-atomic ops) is only reducing the perf bleeding from the decision to use RC in the first place. Indeed GC is generally your best choice for anything where performance matters a lot and RAM isn't super tight, like on servers.

Kotlin/Native lets us do this comparison somewhat directly. The current and initial versions used reference counting for memory management. K/N binaries were far, far slower than the equivalent Kotlin programs running on the JVM and the developer had to deal with the hassle of RC (e.g. manually breaking cycles). They're now switching to GC.

The notion that GC is less memory efficient than RC is also a canard. In both schemes your objects have a mark word of overhead. What does happen though, is GC lets you delay the work to deallocate from memory until you really need it. A lot of people find this quite confusing. They run an app on a machine with plenty of free RAM, and observe that it uses way more memory than it "should" be using. So they assume the language or runtime is really inefficient, when in reality what's happened is that the runtime either didn't collect at all, or it collected but didn't bother giving the RAM back to the OS on the assumption it's going to need it again soon and hey, the OS doesn't seem to be under memory pressure.

These days on the JVM you can fix that by using the latest versions. The runtime will collect and release when the app is idle.


I saw that point brought up on Twitter and I don't know how it it makes more efficient use of RAM.

Specifically, as I understood it is that Apple software (written in objective C/Swift) uses a lot of retain/release (or Atomic Reference Counting) on top of manual memory, for memory management rather than other forms of garbage collection (such as those found in Java/C#), which gives Objective C programs a lower memory overhead (supposedly). This is why the iPhone ecosystem is able to run so much more snappier than the Android ecosystem.

That said, I don't see how that translates to lower memory usage than x86 programs. I think the supporting quotes he used for that point are completely orthogonal. I don't have an M1 mac, but I believe the same program running on both machines should use the same amount of memory.


The only thing I can think of that would actually reduce memory usage on M1 vs the same version of MacOS on x86 would be if they were able to tune their compressed memory feature to run faster (with higher compression ratio) on the M1. That would serve to reduce effective memory usage or need to fall back to swap. I would not expect something like that to be responsible for more than, say, a 5-10% RAM usage decrease though.


That would serve to reduce effective memory usage or need to fall back to swap. I would not expect something like that to be responsible for more than, say, a 5-10% RAM usage decrease though.

I think you can reach a lot more than that. Presumably, on Intel they use something like LZO or LZ4, since it compresses/decompresses without too much CPU overhead. But if you have dedicated hardware for something like e.g. Brotli or zstd, one could reach much higher compression ratios.

Of course, this is assuming that memory can be compressed well, but I think this is true in many cases. E.g. when selecting one of the program/library files in the squash benchmarks:

https://quixdb.github.io/squash-benchmark/

you can observe higher compression ratios for e.g. Brotli/gzip/deflate than LZO/LZ4.


I suspect Apple is using their own LZFSE[0] compression, perhaps now with special tweaks for M1. The reason I only suspected a 5-10% increase though is even if it’s able to achieve a massive increase in compression ratio (compressing 3GB to 1GB instead of 2GB, say), that’s still only saving 1GB total. Which I guess isn’t nothing and is more than 10% on an 8GB machine.

[0] https://en.m.wikipedia.org/wiki/LZFSE


For a very long time, budget Android devices that were one or even two generations older were faster than just released iPhones at launching new apps to interactivity (see: hundreds of Youtube "speed comparison" videos). This was purely due to better software, as the iPhone had significantly faster processors and I/O. RAM doesn't play a big factor at launch. One very minor contributor would be that the GC doesn't need to kick in until later, while ARC is adding its overhead all the time.

Edit: apparently, this isn't common knowledge.

https://www.youtube.com/watch?v=emPiTZHdP88

https://youtu.be/hPhkPXVxISY

https://youtu.be/B5ZT9z9Bt4M


Ridiculous that you're downvoted. There are a lot of people posting here who haven't worked on memory management subsystems.

GC vs RC is not a trivial comparison to make, but overall there are good reasons new systems hardly use RC (Objective-C dating back to the 90s isn't new). Where RC can help is where you have a massive performance cliff on page access, i.e. if you're swapped to disk. Then GC is terrible because it'll try and page huge sections of the heap at once where as RC is way more minimal in what it touches.

But in most other scenarios GC will win a straight up fight with an RC based system, especially when multi-threading gets involved. RC programs just spend huge amounts of time atomically incrementing and decrementing things, and rummaging through the heap structures, whereas the GC app is flying along in the L1 cache and allocations are just incrementing a pointer in a register. The work of cleaning up is meanwhile punted to those spare cores you probably aren't using anyway (on desktop/mobile). It's tough to beat that by hand with RC, again, unless you start hitting swap.

If M1 is faster at memory ops than x86 it's because they massively increased memory bandwidth. In fact I'd go as far as saying the CPU design is probably not responsible for most of the performance increase users are seeing. Memory bandwidth is the bottleneck for a lot of desktop tasks. If M1 core is say 10% faster than x86 but you have more of them and memory bandwidth is really 3x-4x larger, and the core can keep far more memory ops in flight simultaneously, that'll explain the difference all by itself.


Indeed, one of the articles cited by the article that this discussion is about (https://blog.metaobject.com/2020/11/m1-memory-and-performanc...) links to a paper saying that GC needs 4x the memory to match manual memory management and then makes a huge wacky leap to say that ARC could achieve that with less. It can't. ARC will always be slower than manual memory management because it behaves the same way as naive manual memory management with some overhead on top.

On the other hand, that same paper shows that for every single one of their tested workloads, the generational GC outperforms manual memory management. Now obviously, you could do better with manual memory management if you took the time to understand the memory usage of your application to reduce fragmentation and to free entire arenas at a time, but for applications that don't have the developer resources to apply to that (the vast majority), the GC will win.

I'm not saying that better memory management is the reason Android wins these launch to interactivity benchmarks because the difference is so stark relative to the hardware performance that memory management isn't nearly enough to explain it, but it does contribute to it. (My own guess is that most of the performance difference comes from smarter process initialization from usage data. Apple is notoriously bad at using data for optimization.)


ARC stands for Automatic Reference Counting.

https://en.wikipedia.org/wiki/Automatic_Reference_Counting


Tfa says the hardware optimization for ARC is to the point of being bespoken. Hardware will always beat software optimization. Further, the other GCs have much higher ram overheads than this combined, bespoke system.

Apple has decades of proven experience producing and shipping massively over engineered systems. I believe em when they say these processors do ARC natively.


I’m not denying that Apple has better ARC performance. It’s that I don’t understand how an application would use less memory on ARM than x86. I’d expect the ARM code to run faster (as a result of being able to do atomic operations faster), but I don’t see how that translates to less memory usage


In one of those random “benchmark tests” online where someone opened several applications on an M1 Mac with 8GB RAM and did some work on those, they kept Activity Monitor open alongside and pointed to the increase in swap at some stage. So it seems like the swap is fast enough and is used more aggressively. That reduces the amount of RAM used at any point in time. The data in RAM has also benefited from compression in macOS for several years now.


Read up on the performance overhead of GC across other languages. They’re messy and can lock up periodically. They take up significant ram and resources.


We're talking about non-GC apps on x86 vs. the same non-GC apps on Arm.


Even those still do memory management-and usually poorly.


it makes objective c and swift memory management faster but it doesn’t reduce ram usage at all. (maybe a weee bit less bandwidth used)


If memory is released as soon as possible instead of waiting for the next GC cycle, does not it make it more efficient?


Yes!

Hopefully I can clear up the discussion a little:

Q: Does reference counting 'use' less RAM than GC?

A: Yes (caveats etc. go here, but your question is a good explanation)

Q: Does the M1 in and of itself require less RAM than x86 processors?

A: No

Q: So why are people talking about the M1 and its RAM usage as if it's better than with x86?

A: It's really just around the faster reference counting. MacOS was already pretty efficient with RAM.

I'd like to propose tokamak-teapot's formula for hardware purchase:

Minimum RAM believed to be required = actual amount of RAM required * 2

N.B. I am aware that a sum that's greater than 16GB doesn't magically become less than 16GB, but it is somewhat surprising how well MacOS performs when it feels like RAM should be tight, so I'd suggest borrowing a Mac or making a Hackintosh to experience this if you're anxious about hitting the ceiling.


There is no “next GC cycle”, objc and swift use ref counting on every platform (there was an abortive GC attempt on desktops a few years back but it never saw wide use and has been deprecated since mountain lion).


these are not gc apps, they are reference counted.


It's ARMv8.x atomic instructions.


Their hardware is almost certainly specialized for reference counting, but I would be surprised if they had a custom instruction or anything.


The post kind of does: > The benefit of sticking to RC is much-reduced memory consumption. It turns out that for a tracing GC to achieve performance comparable with manual allocation, it needs several times the memory (different studies find different overheads, but at least 4x is a conservative lower bound). It implies that ref-counting is more economical in terms of wasted memory than GC, with the tradeoff being performance. This is solved thanks to the M1.


Apple’s runtimes have always used rc, this is not a change between Intel and arm.


Nope, they also tried a tracing GC for Objective-C on the desktop, but it was a failure due to interoperability across libraries compiled in different modes alongside C semantics.

Then they pivoted into automating retain/release patterns from Cocoa and sold it, Apple style, as a victory of RC over tracing GC, while moving the GC related docs and C related workarounds into the documentation archive.


> Nope, they also tried a tracing GC for Objective-C on the desktop

Operative word: tried. GC was an optional desktop-only component deprecated in Mountain Lion, which IIRC has not been accepted on MAS since 2015 was removed entirely from Sierra.

Without going into irrelevant weeds, "apple has always used refcounting everywhere" is a much closer approximation.


Operative word: failed.

Which then in Apple style ("you are holding it wrong") turned it around in a huge marketing message, while hiding away the tracing GC efforts.


> Operative word: failed.

That's not exactly relevant to the subject at hand of what memory-management method software usually uses on macos.

> Which then in Apple style ("you are holding it wrong") turned it around in a huge marketing message, while hiding away the tracing GC efforts.

Hardly?

And people are looking to refcounting as a reason why apple software is relatively light on memory, which is completely fair and true and e.g. generally assumed as one of the reasons why ios devices fare well with significantly less ram than equivalent android devices. GCs have significant advantages, but memory overhead is absolutely one of the drawbacks.


Nope, as proven by performance tests,

https://github.com/ixy-languages/ixy-languages

And the fact that M1 has special instructions dedicated to optimize RC,

https://blog.metaobject.com/2020/11/m1-memory-and-performanc...

Memory overhead in languages with tracing GC (RC is a GC algorithm) only happens in languages like Java without support for value types.

If the language supports value types, e.g. D, and there is still memory overhead versus RC, then fire the developers or they better learn to use the language features available on their plate.


> Nope, as proven by performance tests,

> https://github.com/ixy-languages/ixy-languages

This shows latency, not memory consumption, as far as I can tell.

> If the language supports value types, e.g. D, and there is still memory overhead versus RC, then fire the developers or they better learn to use the language features available on their plate.

Memory overhead of certain types of garbage collectors (notably generational ones) is well-known and it's specified relative to the size of the heap that they manage. Using value types is of course a valid point, regarding how you should use the language, but it doesn't change the overhead of the GC, it just keeps the heap it manages smaller. If the overhead was counted against the total memory use of a program, then we wouldn't be talking about the overhead of the garbage collector, but more about how much the garbage collector is actually used. Note that I'm not arguing against tracing GCs, only trying to keep it factual.


I think the author doesn't understand what Gruber wrote here. Android uses more memory because most Android software is written to use more memory (relying on garbage collection). It has nothing to do with the chips. If you ran Android on an M1, it wouldn't magically need less RAM. And Photoshop compiled for x86 is going to use about the same amount of memory as Photoshop compiled for Apple silicon. Sure, if you rewrote Photoshop to use garbage collection everywhere then memory consumption would increase, but that has nothing to do with the chip.


Maybe I misread, but I understood that more as Apple using ARC and that gives them a memory advantage. M1 is simply making that more efficient by doing retain-release faster. But I agree that should not change total memory usage.

But I think in general you could say that Apple has focused more on optimizing their OS for memory usage than the competition may have done. Android uses Java which eats memory like crazy and I suspect C# is not that much better being a managed and garbage collected language. Not sure how much .NET stuff is used on Windows, but I suspect a lot.

macOS in contrast is really dominated by Objective-C and Swift which does not use these memory hungry garbage collection schemes, nor require JIT compilation which also eats memory.


> I suspect C# is not that much better being a managed and garbage collected language

C# is better than JVM in that it has custom value types.

Say you want to allocate an array of points in Java you basically have to allocate array[pointer] all pointing to tiny 8 byte objects (for eg. 32 bit float x and y coords) + the overhead of object header. If you use C# and structs it just allocates a flat array of floats with zero overhead.

Not only do you pointlessly use memory, you have indirection lookup costs, potential cache misses, more objects for GC to traverse, etc. etc.

JVM really sucks at this kind of stuff and so much of GUI programming is passing around small structs like that for rendering.

FWIW I think they are working on some proposal to add value types to JVM but that probably won't reach Android ever.


I had a C# project that was too slow and used too much RAM.

I can attest that structs use less memory however IIRC they don't have methods so no GetHashCode() which made them way too slow to insert in a HashSet or Dictionary.

In the end I used regular objects in a Dictionary. RAM usage was a bit higher than structs (not unbearably so) but speed improvement was massive.


1. structs can have methods 2. the primary value of value types is not to use less ram (you just save a pointer, I guess times two because of GC) but the ability to avoid having to GC the things, since they are either on the stack or in contiguos chunks of memory, and to leverage cpu caches are you can iterate over contiguous data rather than hopping around in the heap. Iterating over contiguous data can be a large constant factor faster than over a collection of pointers to heap objects.


>I can attest that structs use less memory however IIRC they don't have methods so no GetHashCode() which made them way too slow to insert in a HashSet or Dictionary

You can and should implement IEquatable on a struct, especially if you plan on placing them in a hashset - the default implementation will use reflection and will be slow but it's easy to override.


I just checked, apparently structs can have methods. Is it a new thing or me that was ignorant?


You could always have methods (for as long as I can remember at least, I started using .NET in 3.0 days), you just can't inherit structs or use virtual methods because structs don't have virtual method table. You can implement interfaces however and override operators - it's very nice for implementing 3D graphics math primitives like vectors and matrices, way better than Java in this regard which was what got me into C# way back then.


If your hardware enables more regular/efficient garbage collection, then it absolutely can lower memory consumption.

Given that the M1 chip was designed to better support reference counting, it makes sense that doing the same for HC could lead to a benefit


> "M1 and memory efficiency"

Hi folks!

It looks like my blog post[1] was the primary source for this (it's referenced both by this post and by the Gruber post), and to be clear, I did not claim that this helps ARM Macs use less RAM than Intel Macs. I think John misunderstood that part and now it has blown up a bit...

I did claim that this helps Macs and iPhones use less RAM than most non-Apple systems, as part of Apple's general obsessiveness about memory consumption (really, really obsessive!). This part of the puzzle is how to get greater convenience for heap allocation.

Most of the industry has settled on tracing GCs, and they do really well in microbenchmarks. However, they need a lot of extra RAM to be competitive on a system level (see references in the blog post). OTOH, RC trends to be more frugal and predictable, but its Achilles heel, in addition to cyclic references, has always been the high cost of, well, managing all those counts all the time, particularly in a multithreaded environment where you have to do this atomically. Turns out, Apple has made uncontented atomic access about as fast as a non-atomic memory access on M1.

This doesn't use less RAM, it decreases the performance cost of using the more frugal RC. As far as I can tell, the "magic" of the whole package comes down to a lot of these little interconnecting pieces, your classic engineering tradeoffs, which have non-obvious consequences over there and then let you do this other thing over here, that compensates for the problem you caused in this other place, but got a lot out etc. Overall, I'd say a focus on memory and power.

So they didn't add special hardware for NSObject, but they did add special hardware that also tremendously helps NSObjet reference counting. And apparently they also added a special branch predictor for objc_msgSend(). 8-). Hey, 16 billion transistors, what's a branch predictor or two among friends.. ¯\_(ツ)_/¯

[1] https://blog.metaobject.com/2020/11/m1-memory-and-performanc...


Thanks for the clarification, it seems this post has gotten lost in the comments. You could try making a new post on your blog and add it to HN as a new post.


I can't understand why less RAM is enough specially in Apple Silicon rather than Intel. Is the argument proved?

RAM capacity is just RAM capacity. Possibly Swift-made apps uses less RAM compared to other apps, but microarchitecture shouldn't be matter.


> RAM capacity is just RAM capacity. Possibly Swift-made apps uses less RAM compared to other apps, but microarchitecture shouldn't be matter.

My guess it's mostly faster swapping.

Microarchitecture could help, perhaps by making context switches faster.

But it could also be custom peripheral/DMA logic for handling swapping between RAM and NVM.

I think it makes sense.. NVM should be fast enough that RAM only needs to act as more of a cache. But existing architectures have a lot of legacy of treating NVM like just a hard drive. Intel is also working on this with its Optane related architecture work.

You could also do on-the-fly compression of some kinds of data to/from RAM. But I havent heard any clues that M1 is doing that, and you'd need applications to give hints about what data is compressible.


I also believe this is the case.

Most experiment with 8GB M1 Macs I've seen so far (on YouTube) seems to start slowing down once the data cannot fit in a RAM, although the rest of the system remain responsive e.g. 8K RED RAW editing test. In the same test with 4K RED RAW there were some shuttering on the first playback but subsequent playback were smooth, which I guessed it was a result of swap being moved back into a RAM.

My guess would be they've done a lot of optimization on swap, making swapping less of an performance penalty (as ridiculous as it sounds, I guess they could even use Neural Engine to determine what should be put back into RAM at any given moment to maximize responsiveness.)

macOS has been doing memory compression since Mavericks using WKdm algorithm, but they also support Hybrid Mode[1] on ARM/ARM64 using both WKdm and a variant of LZ4 for quite some time (WKdm compress much faster than LZ4). I wouldn't be too surprised if M1 has some optimization for LZ4. So far I haven't seen anybody tested it.

It might be interesting to test M1 Macs with vm_compressor=2 (compression with no swap) or vm_compressor=8 (no compression, no swap)[2] and see how it runs. I'm not sure if there's a way to change bootargs on M1 Macs, though.

[1]: https://github.com/apple/darwin-xnu/blob/a449c6a3b8014d9406c...

[2]: https://github.com/apple/darwin-xnu/blob/a449c6a3b8014d9406c...


Exactly, several reports since the launch have pointed out that both the memory bandwidth of the RAM is higher than before, and also the SSDs are faster than before (by 2-3x in both cases I think?)

combined it should make a big difference to swapping


I don't quite see that either. But I suspect that it is just macOS itself which uses less memory than Windows in general.

But the much faster SSD to RAM transfer for the M1 means that shuffling stuff in and out of RAM is much faster meaning RAM matters less.


On my almost-stock (primarily used for iOS deployment) MBP, Catalina uses 3GB RAM (active+wired) memory on boot. It's much more than my Linux laptop (~400MB). I haven't booted Win10 recently but I'd assume it'd be close to macOS.

The transfer speeds of the M1 SSDs have been benched at 2.7GB/s - about the same speed as mid-range NVMe SSDs (my ADATA SX8200 Pro and Sabrent Rocket are both faster and go for about $120/TB).


I think OS' usage isn't main matter for RAM. Browser, Electron apps, professional apps like Photoshop, IntelliJ, and VM should be.

I expected SSD is way fast, but benchmark says its SSD is about below 3000MB/s RW, not very fast but usual Gen3 x4 speed.


I’m guessing swapping happens quicker, perhaps due to the unified memory architecture. With quicker swapping you’d be less likely to notice a delay.

That said I’d still be very hesitant buying a 8 GB M1 Mac. When my iMac only had 8 GB it was a real pain to use. Increasing my iMac’s memory to 24 GB made it usable.


Bottleneck for swap in/out must be SSD, not memory. Also its SSD isn't fast compared to other NVMe SSDs in both throughput and IOPS. Possibly latency is great thanks to integrated SSD controller onto M1 but I don't think it changes the game.


Is a SSD suitable for swap, given its limited number of write cycles?


When I looked into this, the information I found suggested that modern consumer SSDs generally have more than enough write cycles to spare for any plausible use case. Possibly this was more of an issue five to ten years ago.


I, for one would love to see an M1 chip cripple itself under the memory load of 4 intellij windows and chrome tabs, among other applications.


That's my take as well; I have a fairly modern Macbook, it's just fine, it's just that the software I run on it is far from ideal.

Intellij has a ton of features but it's pretty heavyweight because of it; I'd like a properly built native IDE with user experience speed at the forefront. That's what I loved about Sublime Text. But I also like an IDE for all the help it gives me, things that the alternative editors don't do yet.

I've used VS Code for a while as well, it's faster than Atom at least but it's still web technology which feels like a TON of overhead that no amount of hardware can compensate for.

I've heard of Nova, but apparently I installed it and forgot to actually work with the trial so I have no clue how well it works. I also doubt it works for my particular needs, which intellij doesn't have a problem with (old codebase with >10K LOC files of php 5.2 code and tons of bad JS, new codebase with Go and Typescript / React).


If you want faster Intellij experience, you can try disable built-in plug-ins and features you don’t need, power saving mode is one quick way to disable heavyweight features.


The problem with intellij is the JVM - memory usage is just off the charts. And then combine it with the gradle daemon, and it gets very hairy.


Remember Lisp machines? The M1 is a Swift machine.


I'm wondering if the "optimized for reference-counting" thing applies to other languages too. i.e. if I write a piece of software in Rust, and I make use of Rc<>, will Macs be extra tolerant of that overhead? In theory it seems like the answer should be yes


I sure hope so. In macOS 10.15, the fast path for a retain on a (non-tagged-pointer) Obj-C object does a C11 relaxed atomic load followed by a C11 relaxed compare-and-exchange. This seems pretty standard for retain-release and I'd expect Rust's Rc<> to be doing something similar. It's possible Apple added some other black magic to the runtime in 10.16 (and they haven't released the 10.16 objc sources yet) but it's hard to imagine what they'd do that makes more sense than just optimizing for relaxed atomic operations.


(Rc does not use atomics, Arc does and from a peek at the source it does indeed use Relaxed)


I didn't understand why the implementation wouldn't just do an atomic increment, but I guess Obj-C semantics provide too much magic to permit such a simple approach. The actual code, in addition to [presumably] not being inlined, does not seem easy to optimize at the hardware level: https://github.com/apple/swift-corelibs-foundation/blob/main...

The native Swift retain (swift_retain above) seems to be somewhere inside this mess: https://github.com/apple/swift/blob/main/stdlib/public/runti...


What you’ve linked is CoreFoundation’s retain, Objective-C’s can be found in https://opensource.apple.com/source/objc4/objc4-787.1/runtim... (look for objc_object::rootRetain).

The short answer for why it can’t just be an increment is because the reference count is stored in a subset of the bits of the isa pointer, and when the reference count grows too large it has to overflow into a separate sidetable. So it does separate load and CAS operations in order to implement this overflow behavior.


I’d think the bigger win would have to be in the release part of the code, which actually cares about contention.


No, because Rc<> isn't atomic. Arc<>, however, would get the benefit. The reason retain/release in ObjC/Swift are so much faster here is because they are atomic operations.


So, useful for structural sharing between threads.


Yes, it applies to everything which uses atomics and is not something special in the runtime. It's also worth noting that this is an optimization that iphones have had for the last few years ever since those switched to arm64.


The fast atomics are relatively new.


If anything, an Objective-C machine: they apparently have a special branch predictor for objc_msgSend()!

But they learned the lesson from SOAR (Smalltalk On A RISC) and did not follow the example of LISP machines, Smalltalk machines, Java machines, Rekursiv, etc. and build specialized hardware and instructions. The benefits of the specialization are much, much less than expected, and the costs of being custom very high.

Instead, they gave the machine large caches and tweaked a few general purpose features so they work particularly well for their preferred workloads.

I wonder if they made trap on overflow after arithmetic fast.


But how? It has an Arm CPU - how does it differ from any other machine with a 64 bit Arm CPU?


Remember, ARM instruction set, aka an interface, not an implementation. Arm holdings does license their tech to other companies though, so I don't know how much apple silicon would have in common, with say a Qualcomm CPU. They may be totally different under the hood.


Yes, Apple's microarchitecture is indeed completely their own, while Qualcomm uses rebranded/tweaked Arm Cortex cores.


Basically reference counting requires grabbing a number from memory in one step, then increasing or decreasing it and storing it in a second step.

This is two operations and in-between the two -- if and only if the respective memory location is shared between multiple cores or caches -- some form of synchronization must occur (like locking a bank account so you can't double draft on two ATMs simultaneously).

Now the way this is implemented varies a bit.

Apple controls most of the hardware ecosystem, programming languages, binary interface, and so on meaning there is opportunity for them to either implement or supplement ARM synchronization or atomicity primitives with their own optimizations.

There is nothing really preventing Intel from improving here as well -- it is just a easier on ARM because the ISA has different assumptions baked in, and Apple controls everything up the stream, such as the compiler implementations.


The M1 is much more than a CPU, and a CPU is much more than an instruction set


I know it's a fast Arm CPU - I've read the Anandtech analysis etc - and that there is lots of extra hardware on the SoC. But the specific point was why is it a Swift machine. What makes it particularly suited to running Swift?


The exact reason is out of my depth, but the original quote makes it clear that there is something. Memory bandwidth would be one possibility.


My guess:

1. "weak" memory ordering (atomic Aquire/Release loads/stores)

2. low memory latencies between cache and system memory (so dirty pages in caches are faster updated etc.)

3. potential a coherency impl. optimized for this kind of atomic access (purely speculative: e.g. maybe sometimes updating less then a page in a cache when detecting certain atomic operations which changed a page or maybe wrt. the window marked for exclusive access in context of ll/sc-operations and similar)

Given that it's common for a object to stay in the same thread I'm not sure how much 2. matters for this point (but it does matters for general perf.). But I guess there is a lot in 3. where especially with low latency ram you might be able to improve performance for this cases.


These are interesting points. I'd like to hazard a guess that the leading contributor is cache-related. Just looking at the https://en.wikipedia.org/wiki/Reference_counting suggests as much: "Not only do the operations take time, but they damage cache performance and can lead to pipeline bubbles."

I roughly understand how refcounting causes extra damage to cache coherency: anywhere that a refcount increment is required, you mutate the count on an object before you use it, and then decrement it later. Often times, those counting operations are temporally distant from the time that you access the object contents.

I do not really understand the "pipeline bubbles" part, and am curious if someone can elaborate.

Reading on in the wiki page, they talk about weak references (completely different than weak memory ordering referenced above). This reminds me that Cocoa has been making ever more liberal use of weak references over the years, and a lot of iOS code I see overuses them, particularly in blocks. I last looked at the objc implementation years ago, but it was some thread safe LLVM hash map split 8 or 16 ways to reduce lock contention. My takeaway was roughly, "wow that looks expensive". So while weak refs are supposed to be used judiciously, and might only represent 1% or less of all refs, they might each cost over 100x, and then I could imagine all of your points could be significant contributors.

In other words, weak references widen the scope of this guessing game from just "what chip changes improve refcounting" to "what chip changes improve parallelized, thread safe hash maps."


The "pipeline bubbles" remark refers to the decoding unit of a processor needing to insert no-ops into the stream of a processing unit while it waits for some other value to become available (another processing unit is using it). For example, say you need to release some memory in a GC language, you would just drop the reference while the pipeline runs at full speed (leave it for the garbage collector to figure out). In an refcount situation, you need to decrease the refcount. Since more than one processing unit might be incrementing and decrementing this refcount at the same time, this can lead to a hot spot in memory where one processing unit has to bubble for a number of clock cycles until the other has finished updating it. If each refcount modify takes 8 clock cycles, then refcounting can never update the same value at more than once per 8 cycles. In extreme situations, the decoder might bubble all processing units except one while that refcount is updated.

For the last few decades the industry has generally believed that GC lets code run faster, although it has drawbacks in terms of being wasteful with memory and unsuitable for hard-realtime code. Refcounting has been thought inferior, although it hasn't stopped the Python folks and others from being successful with it. It sounds like Apple uses refcounting as well and has found a way to improve refcounting speed, which usually means some sort of specific silicon improvement.

I'd speculate that moving system memory on-chip wasn't just for fewer chips, but also for decreasing memory latency. Decreasing memory latency by having a cpu cache is good, but making all of ram have less latency is arguably better. They may have solved refcounting hot spots by lowering latency for all of ram.

From Apple's site:

"M1 also features our unified memory architecture, or UMA. M1 unifies its high-bandwidth, low-latency memory into a single pool within a custom package. As a result, all of the technologies in the SoC can access the same data without copying it between multiple pools of memory." That is paired with a diagram that shows the cache hanging off the fabric, not the CPU.

That says to me that, similar to how traditionally the cpu and graphics card could access main memory, now they have turned the cache from a cpu-only resource into a shared resource just like main memory. I wonder if the GPU can now update refcounts directly in the cache? Is that a thing that would be useful?


Extremely low memory latency is another. It's also has 8 memory channels, most desktops have 2. It's an aggressive design, Anandtech has a deep dive. Some of the highlights, lower latency cache, larger reorder buffer, more in flight memory operations, etc.


Isn’t a typical desktop. 64-bit bus meaning this would be 4 channels because it’s 128-bit?


Typical desktops have 2 64 bit dimm into 2 channels (64 bits wide each) or 1 channel (128 bits wide).

The M1 Mac's seem to be 8 channels x 16 bits, which is the same bandwidth as a desktop (although running the ram at 4266 MHz is much higher than usual). The big win is you can have 8 cache misses in flight instead of 2. With 8 cores, 16 GPU cores, and 16 ML cores I suspect the M1 has more in flight cache misses than most.


> or 1 channel (128 bits wide)

The DDR4 bus is 64-bit, how can you have a 128-bit channel??

Single channel DDR4 is still 64-bit, it's only using half of the bandwidth the CPU supports. This is why everyone is perpetually angry at laptop makers that leave an unfilled SODIMM slot or (much worse) use soldered RAM in single-channel.

> The big win is you can have 8 cache misses in flight instead of 2

Only if your cache line is that small (16 bit) I think? Which might have downsides of its own.


> The DDR4 bus is 64-bit, how can you have a 128-bit channel??

Less familiar with the normal on laptops, but most desktop chips from AMD and Intel have two 64 bit channels.

> Which might have downsides of its own.

Typically for each channel you send an address, (a row and column actually), wait for the dram latency, and then get a burst of transfers (one per bus cycle) of the result. So for a 16 bit wide channel @ 3.2 Ghz with a 128 byte cache line you get 64 transfers, one ever 0.3125 ns for a total of 20ns.

Each channel operates independently, so multiple channels can each have a cache miss in flight. Otherwise nobody would bother with independent channels and just stripe them all together.

Here's a graph of cache line throughput vs number of threads.

https://github.com/spikebike/pstream/blob/master/png/apple-m...

So with 1,2 you see an increase in throughput, the multiple channels are helping. 4 threads is the same as two, maybe the L2 cache has a bottleneck. But 8 threads is clearly better than 4.


> two 64 bit channels

Yeah, I'm saying you can't magically unify them into a single 128-bit one. If you only use a single channel, the other one is unused.


It's pretty common for hardware to support both. On the Zen1 Epyc's for instance some software preferred a consistent latency from stripped memory over the NUMA aware latency with separate channels where the closer dimms have lower latency and the further dimms had higher.

I've seen similar on Intel servers, but not recently. This isn't however typically something you can do at runtime, just boottime, at least as far as I've seen.


I don't know anything about memory access.

But doesn't that only help if you have parallel threads doing independent 16 bit requests? If you're accessing a 64 bit value, wouldn't it still need to occupy four channels?


Depends. Cachelines are typically 64-128 bytes long and sometimes depending on various factors that might be across on memory channel, or spread across multiple memory channels, somewhat like a RAID-0 disk. I've seen servers (opterons I believe) that would allow mapping memory per channel or across channels based on settings in BIOS. Generally non-NUMA aware OS ran better with stripped memory and NUMA aware OSs ran better non-stripped.

So striping a caching line across multiple channels goes increase bandwidth, but not by much. If the dram latency is 70ns (not uncommon) and your memory is running at 3.2 GHz on a single 64 bit wide channel you get 128 bytes in 16 transfers. 16 transfers at 3.2GHz = 5ns. So you get a cache line back in 75ns. With 2 64 bit channels you can get 2 cache lines per 75ns.

So now with a 128 bit wide channel (twice the bandwidth) you wait 70ns then get 8 transfers @ 3.2GHz = 2.5ns. So you get a cache line back in 72.5ns. Clearly not a big difference.

So the question becomes for a complicated OS with a ton of cores do you want one cacheline per 72.5ns (the stripped config) or two cachlines per 75ns (the non-stripped config).

In the 16 bit 8 channel (assuming the same bus speed and latency) you get 8 cacheline per 90ns. However not sure what magic apple has but I'm seeing very low memory latencies on the M1, on the order of 33ns! With all cores busy I'm seeing cacheline througput of a cacheline per 11ns or so.


I believe modern superscalar architectures can run instructions out of order if they don't rely on the same data, so when paused waiting for a cache miss, the processor can read ahead in the code, and potentially find other memory to prefetch. I may be wrong about the specifics, but these are the types of tricks that modern CPUs employ to achieve higher speed.


Sure, but generally a cacheline miss will quickly stall, sure you might have a few non-dependent instructions in the pipeline, but running a CPU at 3+GHz and waiting 70ns is an eternity. Doubly so when you can execute multiple instructions per cycle.


You have to consider that a DRAM delivers a burst, not a single word. Usually the channel width times the burst length equals the cache line size.


I doubt this is relevant, but a typical desktop has 2x64-bit channels and the M1 has either 4x32-bit or 8x16-bit channels.


Anandtech claims to have identified 8 x 16 channels on the die and the part number is compatible with 4 lpddr4x chips that support 2 channels each.


Memory bandwidth costs a lot of power so I think the idea is to reduce the need for unnecessary memory ops both saving power and reducing latency


That's true but for it to be a "swift machine" as mentioned above it would imply some kind of isa level design choices, as opposed to "just" being extremely wide or having a branch predictor that understands what my favourite food is


I think they were using hyperbole to make a point.


I doubt Apple allows anyone with the knowledge to speak about it.


This is where it's worth pointing out that Apple is an ARM architecture licensee. They're not using a design from ARM directly, they're basically modifying it however it suits them.


Indeed, they’re an ISA licensee, and I don’t think they’re using designs from ARM at all. They beat ARM to the first ARM64 core back in 2013 with the iPhone 5s.


I don't think this applies to good software. Nobody will retain/release something in a tight loop. And typical retain/releases don't consume much time. Of course it improves metrics like any other micro-optimization, so it's good to have it, but that's about it.


Retains and releases often happen across function call boundaries.


Taking that as true for a moment, I wonder what other programming languages get a benefit from Apple's silicon then? PHP et al. use reference counting too, do they get a free win, or is there something particular about Obj-C and Swift?


Not the specific thing mentioned in the article. There is other hardware that optimizes Objective-C specifically, but it’s a branch hint.


any links or other info about that?

sounds realy cool



thats realy impressive... and really vindicates the whole "if your serious about software, make your own hardware too"


Android phones are build on managed code, but PC computers are built on C/C++ mostly (almost all productivity apps, browsers, games, the operating system itself). And the only GC code most people run is garbage collected on apple too - it's Java Script on the web.

I'm not familiar with MacOs, are the apps there mostly managed code? Even if they were and even if refcounting on Mac is that much faster than refounting on PC - refcounted code would still lose to manual memory management on average.


How does this work? Isn't reference counting a lot of +1 and -1?


It is a lot of atomic +1 and -1, meaning possible thread contention, meaning that no matter how many cores your hardware has you have a worst case scenario where all your atomic reference counted objects have to be serialized, slowing everything down. I do not know how ObjectiveC/Swift deals with this normally, but making that operation as fast as possible on the hardware can have huge implications in real life, as evidenced by the new Macs.


On intel atomically increnting a counter is always sequentially consistent. ARM might get away with weaker barriers.


It's a lot +-1 on atomic variables guarded using atomic memory operations (mainly with the Aquire/Release ordering) on memory which might be shared between threads.

So low latency of the cache to system RAM can help here, at least for cases where the Rc is shared between threads. But also if the thread is not shared between threads but the thread is moved to a different CPU. Still it's probably not the main reason.

Given how atomic (might) be implemented on ARM and that the cach and memory is on the same chip my main guess is that they did some optimizations in the coherency protocol/implementation (which keeps the memory between caches and the system memory/RAM coherent). I believe there is a bit of potential to optimize for RC, i.e. to make that usage pattern of atomics fast. Lastly they probably take special care that the atomic related instructions used by Rc are implemented as efficient as possible (mostly fetch_add/fetch_sub).


Just to be clear, the RAM/memory and cache are not on the same chip/die/silicon. They are part of the same packaging though.

> which keeps the memory between caches and the system memory/RAM coherent Isn't this already true of every multi-core chip ever designed; the whole point of coherency is to keep the RAM/memory coherent between all the cores and their caches.


Oh, right.

> Isn't this already true of every multi-core chip ever designed;

Yes, I just added the explanation of what coherency is in this context as I'm not sure how common the knowledge about it is.

The thing is there are many ways how you can implement this (and related things) with a number of parameters involved which probably can be tuned to optimize for typical RC's usage of atomic operations. (Edit: Just to be clear there are constraints on the implementation imposed by it being ARM compatible.)

A related example (Not directly atomic fetch add/sub and not directly coherency either) would be the way LL/SC operations are implemented. Mainly on ARM you have a parameter of how large the memory region "marked for exclusive access" (by an LL-load operation) is. This can have mayor performance implications as it directly affects how likely a conditional store fails because of accidental inference.


> running on silicon optimized to make reference counting as efficient as possible

I'm curious to understand this. Is this because of a specific instruction set support, or just the overall unified memory architecture?


> This, in a nutshell, helps explain why iPhones run rings around even flagship Android phones,

For the price, it better run circles and squares. It should cook my dinner too.


A new 2020 iPhone SE costs US$399 and outperforms flagship Android phones that cost much more:

https://www.androidcentral.com/cheapest-iphone-has-more-powe...

https://www.androidauthority.com/iphone-se-vs-most-powerful-...


At the hardware level, does this mean they have a much faster TLB than competing CPU's, perhaps optimized to patterns in which NSObjects are allocated? Speaking of which, does Apple use a custom malloc or one of the popular implementations like C malloc, tcmalloc, jemalloc, etc.?



I don't think this really makes sense. How many of the benchmarks that people have been running are written in Objective-C? They're mostly hardcore graphics and maths workloads that won't be retaining and releasing many NSObjects.


I agreed. It think it's typical of cargo-culting: explanations don't need to make sense, it's all about the breathless enthusiasm.

Look, want to know how M1 achieve its result? Easy. Apple is first with a 5nm chips. Look in the past: every CPU maker gains both speed and power efficiency when going down a manufacturing node.

Intel CPU were still using a 14nm node (although they called 12+++) while Apple M1 is now at 5nm. According to this [1] chart, that's a transistor density at least 4x.

Not saying Apple has no CPU design chops, They've been at it for their phones for quite a while. But people are just ignoring the elephant in the room: Apple gives TSMC a pile of cash to be exclusive for mass production on their latest 5nm tech.

   [1] https://www.techcenturion.com/7nm-10nm-14nm-fabrication#nbspnbspnbspnbsp7nm_vs_10nm_vs_12nm_vs_14nm_Transistor_Densities


The bit about reference counting being the reason that Macs and iOS devices get better performance with less ram makes no sense. As a memory management strategy, reference counting will always use more ram because a reference count must be stored with every object in the system. Storing all of those reference counts requires memory.

A reference counting strategy would be more efficient in processor utilization compared to garbage collection as it does not need to perform processor intensive sweeps through memory identifying unreferenced objects. So reference counting trades memory for processor cycles.

It is not true that garbage collection requires more ram to achieve equivalent performance. It is in fact the opposite. For programs with identical object allocations, a GC based system would require less memory, but would burn more CPU cycles.


“A reference counting strategy would be more efficient in processor utilization compared to garbage collection as it does not need to perform processor intensive sweeps through memory identifying unreferenced objects. So reference counting trades memory for processor cycles.”

I think it’s the reverse.

Firstly, garbage collection (GC) doesn’t identify unreferenced objects, it identifies referenced objects (GC doesn’t collect garbage). That’s not just phrasing things differently, as it means that the amount of garbage isn’t a big factor in the time spent in garbage collection. That’s what makes GC (relatively) competitive, execution-time wise. However, it isn’t competitive in memory usage. There, consensus is that you need more memory for the same performance (https://people.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf: with five times as much memory, an Appel-style generational collector with a non-copying mature space matches the performance of reachability-based explicit memory management. With only three times as much memory, the collector runs on average 17% slower than explicit memory management)

(That also explains why iPhones can do with so much less memory than phones running Android)

Secondly, the textbook implementation of reference counting (RC) in a multi-processor system is inefficient because modifying reference counts requires expensive atomic instructions.

Swift programs spend about 40% of their time modifying reference counts (http://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf)

So, reference counting gets better memory usage at the price of more atomic operations = less speed.

That last PDF describes a technique that doubles the speed of RC operations, decreasing that overhead to about 20-25%.

It wouldn’t surprise me if these new ARM macs use a similar technique to speed up RC operations.

It might also help that the memory model of ARM is weaker than that of x64, but I’m not sure that’s much of an advantage for keeping reference counts in sync across cores.


> reference counting will always use more ram because a reference count must be stored

True, reference counting stores references… but garbage collection stores garbage, which is typically bigger than references :)

(Unless you’re thinking of a language where the GC gets run after every instruction - but I’m not aware of any that do that, all the ones I know of run periodically which gives garbage time to build up)


No, many garbage collection approaches WILL require more RAM, some need twice as much RAM to run efficiently. Then there is the case that with garbage collection can have a delay which lets garbage pile up thus using more memory than necessary. Retain-release used by Apple is not as efficient, but you reclaim memory faster. https://www.quora.com/Why-does-Garbage-Collection-take-a-lot...


I think it's more a design pattern you see more with GC collected languages where people instantiate objects with pretty much every action and then let the the GC handle the mess afterward. Every function call involves first creating a parameters object, populating it, then forgetting about it immediately afterward.

I've seen this with java where the memory usage graph looks like a sawtooth, with 100s of MB being allocated and then freed up a couple of seconds later.


Isn't it the case with Java that it will do this because you do have the memory to spend on it? Generally this "handling the mess afterward" involves some kind of nursery or early generation these days, but their size may be use-case-dependent. If tuned for a 8/16 GB environment, presumably the "sawtooth" wouldn't need to be as tall.


Yes, but the reason it does it is to make it fast. The shorter your sawtooth spokes, the slower your app.


Slower in what metrics? Latency? Throughput? Not to mention that the behavior may strongly depend on the GC design and the HW platform in question. It seems far too difficult to make a blanket statement about what is and isn't achievable in a specific use case.


> reference counting will always use more ram because a reference count must be stored with every object in the system

In the tracing GCs I have seen, an "object header" must be stored with every object in the system; the GC needs it to know which parts of the object are references which should be traced. So while reference counting needs extra space to store the reference count, tracing GC needs extra space to store the object header.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: