Hacker News new | past | comments | ask | show | jobs | submit login
M1 Memory and Performance (metaobject.com)
66 points by mpweiher on Nov 13, 2020 | hide | past | favorite | 36 comments



I don’t see the reason for the focus on the amount of memory given the potential theoretical advantages of a unified memory pool that extends beyond the DRAM. PCIe gen 4 SSDs have obscene data rates and tons of IOPs. In fact the whole SSD is now an extension of your RAM is a selling point of the new consoles.

Is it as fast as having more DRAM in all possible scenarios? No, but with good memory management the real world performance might very well be identical to a system with the same capabilities and even greater than traditional memory systems with larger non-unified pools.


Well, DRAM still has 2+ orders of magnitude lower latency than SSDs. Whether that matters clearly depends on your application. On a desktop system switching tabs in your browser, swapping the content in from SSD will be fine. A ML application on a server accessing memory quasi randomly won't.


Using SSDs for ML is what vendors like NVIDIA and AMD are trying to solve because this is a fallacy as you will never have as much RAM as you want / need compared to the size of your datasets.

https://developer.nvidia.com/blog/gpudirect-storage/

It’s the same fallacy as RAM size impacts video editing, it doesn’t.

8K RED raw is 4.374 terabytes per hour... at that point it doesn’t matter if you have 8, 16 or even 256GB of RAM there if your storage is too slow to support it you’ll drop frames and or have huge seek times, more memory won’t really help you there is no amount of prefetch you can do to bridge gaps that big.

There is a huge advantage of having a unified memory architecture where the application can address a single memory address space to access your data and you do not need to perform memory copies, allocations and all that stuff between your storage, CPU and GPU...


>It’s the same fallacy as RAM size impacts video editing, it doesn’t.

Computers are used for much more then just Video Editing...for Fluid Simulation Ram definitely has a impact. And Talking about SSD (QLC)...if they run out of SLC-Cache they are sometimes slower then HDD's. And i bet your Video editing is faster if you have 4TB of RAM.


Wouldn’t having more ram give the process more runway to prefetch data into ram before it’s needed ? Without having new prefetch evicting unprocessed older prefetched data


Not really you’ll run out of your buffer regardless of its size if it would be slower to fill it than to go through it.

It also doesn’t help you at all for random access.


>Is it as fast as having more DRAM in all possible scenarios?

Is it as fast as having more DRAM in any scenario?


Yes, and based on non Apple systems it’s faster than having much larger pools of non-unified memory.

Essentially all the big compute vendors moving towards memory unification and cache coherent interfaces.


Ok I misread the numbers somehow. This is interesting - I wonder what it means for the future of conventional desktop computing, when considering peripheral chips like GPUs that already have DMA capability


GPUs are going above DMA, DMA still needs memcopy and or address space translations which adds latency slows things down, with unified memory in theory you have a single address space and cache coherency across everything which means you can operate at the maximum bandwidth of all interfaces without wasting any cycles.


Obviously so based on Apple’s numbers.


Not just Apple we have multiple other examples of when a unified memory pool outperforms larger non unified memory pools just due to the overhead of having to copy memory manage multiple memory address spaces with all the translation and lookups you need to do etc.

We have cheap consumer SSDs today offering 5GB/s of read performance and more than enough I/O to satisfy caching.


This is absolutely a factor, though one I didn't focus on in this post. The sorts of large binary blobs that can take a lot of memory tend to be, well, large and contiguous, so perfect for moving between SSD and RAM at maximum speed.

While I haven't seen benchmarks yet, the claim was 2x relative to their current systems, IIRC, which were already crazy fast at > 2GB/s.

So it would probably do OK even just plain swapping, but really, really well moving these large blobs in and out, for example using mmap() and madvise().


PCIe Gen4 gives you sequential 5GB/s reads and even random reads at well above 4GB/s, this isn’t a page file on an ATA133 drive.

In theory this can also be extended over TB/USB4 and ofc it can be extended over the network (however that’s the slowest part) this going to be quite a big thing especially for use cases like video editing if they offer a true unified memory as a lot of the latency doesn’t necessarily comes from the interface latency but from memcopy and translating between multiple address spaces.


Does shelf life play a role in this? My understanding was that SSD has a short lifespan [0] and so a focus on memory might be better for long term planning, and reliability.

[0.] https://www.solarwindsmsp.com/blog/ssd-lifespan


Not in any realistic sense.

Also the big culprit in shortening the life of an SSD are writes, unified memory doesn’t mean that you have to write more often to the SSD in fact it means the opposite, unified memory isn’t a page file.


Is "unified" the same thing as "shared" outlined here?

https://en.wikipedia.org/wiki/Shared_graphics_memory

I'm curious if the same drawback mentioned for shared applies to unified

> A side effect of this is that when some RAM is allocated for graphics, it becomes effectively unavailable for anything else, so an example computer with 512 MiB RAM set up with 64 MiB graphics RAM will appear to the operating system and user to only have 448 MiB RAM installed.


>Is "unified" the same thing as "shared" outlined here?

Yes. Previously shared was used in the context and goal of cost savings. But* Unified* now has a goal of higher performance.


You pay roughly 1 NS per foot of light travel. Closer to the CPU means lower worst case speed of light bound latency.


I guess Apple's ARC needs all the help it can get to catch up with tracing GCs, there is a reason why everyone else in the industry went with them.

https://github.com/ixy-languages/ixy-languages


Rust doesn't do tracing GC, and Nim is moving to an ARC system I hear.


Borrow checking isn't reference counting and Nim still needs to prove itself.


Borrow checking isn't tracing GC. That's all I said and all I meant.


Rust can use reference counting though, in addition to borrow checking.


Just as it can use a tracing GC, like in Servo interop.


GCs are faster if you have 4 times the needed RAM. Most everyone in the industry doesn't care about wasting RAM.


Nope, reference counting is a GC algorithm, and having a tracing GC doesn't preclude other forms of memory management, e.g. D or .NET Native.


Mixing tracing GC with other forms of memory is an open research project. Doesn't mean it hasn't been "done", for example Google has talked about getting their GCs to interop in Chrome, but it's (a) really hard and (b) so far results are, at best, "mixed".


Let's see Swift when catches up with .NET 5 improvements.

Apple's talk about RC over GC is just marketing speak for the failure to have a tracing GC in Objective-C that wouldn't crash left and right when integrated with C like code.

It was a very sound decision given Cocoa semantics, and the difficulty to make anything written in C not to fall apart with segfaults, but lets not oversell it.

Likewise Swift RC makes sense from having to integrate with Objective-C runtime and existing ecosystem, but again that is all about it.

There is no RC implementation with comparable performance to tracing GC languages that isn't just yet another tracing GC from the amount of runtime support needed to make it actually fast.


So we are agreed that mixing tracing GCs with any other form of memory management (including another GC) is an unsolved problem. Good.

Swift is a disaster, also agreed. So?

For the rest: actual research disagrees with your forceful but unsubstantiated assertion:

https://2013.splashcon.org/details/oopsla-2013-papers/21/Tak...

And once again, tracing GCs do well in microbenchmarks where you only check the cost of local operations. They are horrible when you take the global effects into account, with those local/benchmarking advantages not translating into real world use.

https://people.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf

Very similar to JITs, which also do massively better in microbenchmarks than in production code:

http://blog.metaobject.com/2015/10/jitterdammerung.html

Oh, and integrating well and without high cost to fast languages where you have even more control is actually an important feature.


No we don't agree, unless you can make a point on how D, Active Oberon, Go, Modula-3, Eiffel, .NET, Nim, among several others, integration of tracing GC alongside stack, global and off heap memory allocation happens to be a research project.


Those aren't actually integrated. They just live side-by-side.


It is definlty integrated for the users of those languages.


Sanitory memory habits are still faster than using reference counting ...


So basically they needed their own silicon to deal with Arc's.


But all of these performance measurements won't matter if users don't actually own their devices; without being able to run applications without Apple's permission and knowledge, these devices should be ignored by many privacy-conscious computer scientists from the get-go.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: