I think how DMA operates needs another serious look. Right now we have to fetch everything into the CPU before we can make decisions. What if we had asynchronous HW embedded within the memory that could be given a small (safe) executable program to process the memory in-place rather than evaluating it on the CPU. In other words, a linked list would be much faster and simpler to traverse.
A lot of the software architecture theory we learn is based on existing HW paradigms without much thought being given to how we can change HW paradigms. By nature HW is massively parallel but where physical distance from compute = latency (vs the ultimately serial execution nature of traditional CPUs that can process all data at blistering speed but only one at a time with some SIMD exceptions). There are real-world benefits to this kind of design - memory is cheap and simple to manufacture and abundantly available. The downside though is that the CPU is sitting doing nothing but waiting for memory most of the time, especially when processing large data sets.
Imagine how efficient a GC algorithm would be if it could compute a result in the background just doing a concurrent mark and sweep, perhaps as part of a DRAM refresh cycle so that you could even choose to stop refreshing that RAM because your application no longer needs that row.
The power and performance savings are pretty enticing.
There are some memories supporting basic in-memory operations. For example: https://mosys.com/products/blazar-family/be3rmw-bandwidth-en.... This supports operations like read-modify-write within the memory device itself. (I have no affiliation with this company.)
The barrier to adoption of this is not technical, it's economic. Memory industry has focused on making the highest capacity and lowest cost/bit products. This drives high manufacturing volume which drives economies of scale. Memory products with integrated functions are inherently niche, and therefore do not have anywhere near the market size and economy of scale. Designers have decided (historically) that it is cheaper at the system level to keep the logic operations within the CPU and use a "dumb" commodity memory, even though this necessitates more bandwidth usage. (It's a complex engineering trade-off.)
With logic performance continuing to scale faster than memory bandwidth, at some point an architecture that reduces the required memory bandwidth (such as computing in-memory) might start to make sense economically.
I think there are niche use-cases that would warrant the cost/complexity trade-off. Namely cloud infrastructure for databases where you might be processing large amounts and transferring all that data first to the CPU is pretty ridiculous.
I think the complexity problem is solvable. If you can build such a thing for memory, you can reuse the same general concept for storage like NVME/SSD/spinning disk. It’s entirely possible this is warranted even in consumer devices as quite a bit of OS operations deal with modifying/querying memory whereas doing offload can win you some serious wins (making the machine feel way more snappy/interactive + more powerful to execute things locally).
I don’t have hope for traditional CPU designers so the question is whether someone can both design a memory system with this power AND a computational model that makes this easy to adopt in major languages while offering a perf win (given that a memory fetch costs ~100ns for ~256 bytes). It’s challenging and unlikely to come from x86 land where back Hw compatibility is extremely important. The innovation may eventually come from mobile land (which is where Apple is coming from) where memory controllers are custom designed and part of the SoC anyway, changing with each revision, so the hard part remains again how you make this fast, efficient, able to handle multiple concurrent programs (or at least have the OS control coarsest which program’s memory accesses were being prioritized), and language integration (so you can hand off an algorithm and it would be executed without upending existing software design knowledge). The OS integration could be even more intelligent - if the process/thread is spending its time processing memory without returning results to the CPU, just put it to sleep until the result is available (separating memory access and CPU utilization, resulting in drastically better utilization of cycles rather than naively waiting for each memory stall one at a time).
So tldr I agree with you totally. This is not going to happen if memory manufacturers continue to go for the “cheap and large” memory route. I do see hope that such concepts may be explored if we get more innovation in the CPU space.
> HW embedded within the memory that could be given a small (safe) executable program to process the memory in-place rather than evaluating it on the CPU
Well, that seems to be the exact definition of what UPMEM is doing:
Between the M1, GPUs, TPUs, RISC-V interesting times are coming in hardware. I blame the physical limits, which are putting the Duke Nukem development method to an end (you promise the client 2x the same performance in 18 months, then play Duke Nukem the whole time). The only way to better performance now is through hardware specialization. That and avoiding Electron.
You'd hope that the only way to better performance was HW specialisation, but there are SO MANY algorithmic improvements still to be made. Just the other day I found a case someone had rolled their own priority queue with a sorted array instead of a heap. In a fairly popular open source library too.
There's still loads of performance to be gained just by being better programmers.
> Just the other day I found a case someone had rolled their own priority queue with a sorted array instead of a heap.
I feel like this story needs an ending? Did somebody re-implement it using a heap and found significant performance wins? Or was the sorted array used on purpose to take advantage of some specific cache constraints and actually end up being a huge win?
The story ends (so far) with me filing a GH issue and offering a PR, my own benchmarking getting an average 3x speedup over a wide range of inputs. I happen to know that the maintainer has been on vacation for the last week and only got back a few days ago so no further discussion has been had yet.
A serious HW perf improvement will, generally improve everything, including shitty algorithms. There’s a reason that through the 90s and 2000s serious developers routinely ignored spending significant time optimizing code because the hardware doubled the speed of ALL code every ~12-18 months. So even though there are low-hanging fruit for software optimization, such low-hanging fruit can be missing from already heavily optimized code (eg games, scientific computations etc) and is orders of magnitude more expensive than “run this on newer HW for faster perf”. If you can do this “transparently” (max some OS/compiler integration and you can use the new Hw for perf wins), then the cost economics are powerful. If it’s a special-purpose piece of HW then it gets trickier and will be relegated to accelerating certain memory-heavy programs (like GPUs accelerate certain kinds of mass-data algorithms).
How do you propose to fix this? Should languages include high(er?) performance data structures in their standard libraries? Or possibly even include some segmentation for small/medium/huge data sets?
Oracle (Sun) latest CPUs support something called DAX (data analytics extension, not the same thing as the Intel DAX - direct access extensions). It's a coprocessor that allows offloading simpler operations closer to RAM, apparently:
"DAX is an integrated co-processor which provides a specialized set of instructions that can run very selective functionality – Scan, Extract, Select, Filter, and Translate – at fast speeds. The multiple DAX share the same memory interface with the processors cores so the DAX can take full advantage of the 140-160 GB/sec memory bandwidth of the SPARC M7 processor."
This reminds me of the Connection Machine architecture [1]
> Each CM-1 microprocessor has its own 4 kilobits of random-access memory (RAM), and the hypercube-based array of them was designed to perform the same operation on multiple data points simultaneously, i.e., to execute tasks in single instruction, multiple data (SIMD) fashion. The CM-1, depending on the configuration, has as many as 65,536 individual processors, each extremely simple, processing one bit at a time.
Interestingly enough, this is why Simultaneous multithreading [1] exists!
The revelation that "The CPU could be doing useful work while stalled waiting on a load" has lead to CPU designers "faking" the number of cores available to the OS to allow the CPU to do more useful work while different "threads" are paused waiting on memory to come back with the data they need.
I think the point is that this still eats into CPU <-> memory bandwidth. Offloading could let the CPU use that memory for better purposes, especially since the stall is from memory access anyways.
When doing pointer chasing, then it's gonna be more of a latency problem, there can be plenty of memory bandwidth available, but we don't know which memory line we want to go to next, before the previous line (where the pointer resides) has been loaded. So, the CPUs spend a lot of time in "stalled backend" mode due to the big difference in CPU cycle latency vs RAM access latency.
Offloading some operations closer to the RAM would be a hardware solution (like the Oracle/Sun SPARC DAX I mentioned in a separate comment).
Or you could design your software to rely more on the memory throughput (scanning through columnar structures) vs memory latency (pointer chasing).
Btw, even with pointer-chasing, you could optimize the application to work mostly from the CPU cache (assuming that the CPUs don't do concurrent writes into these cache lines all the time), but this would require not only different application code, but different underlying data structures too. That's pretty much what my article series is about - fancy CPU throuhgput features like SIMD would not be very helpful, if the underlying data (and memory) structures don't support their way of thinking.
I’m thinking more of something like a GC mark and sweep. You don’t care about how long the mark (or even sweep) operations take as long as they finish eventually. Those operations would be the easiest to write an algorithm you could offload to run directly inside a memory controller. These would avoid GC pauses (if using a concurrent GC) while also avoiding any CPU cycles doing anything but a trivial amount of small operations.
What stood out to me with this article is that the fancy “intelligent” columnar indices were actually performing worse than a naive brute traversal of the data. This is because there is a limit to your ability to predict whether a brute solution is faster or slower ahead of time than one using a fancy data structure. So a HW-offload is nice because in theory it should have less of an impact and you can run memory-intensive operations concurrently with CPU-intensive operations with no costs (+ the memory heavy operations might run faster due to living closer to the memory and thus having faster interconnects and lower latency).
Additionally, such HW can take advantage of the peculiarities of how RAM works for even added benefit. Your memory allocator could mark a block as unused (in the simplest v0 of what such Hw code do). The HW can then avoid refreshing those rows reducing power usage slightly, improving latency (since refresh is a stop-the-world operation) if you can get fine-grained enough and it’s not $ prohibitive. Garbage collection would certainly be a lot more attractive of a solution for this kind of design which would drastically change the SW economics (ie Rust/C/C++ suddenly lose a whole lot of the practical perf wins they hold over other languages).
Linked-list chasing (as in vlovich123's example) isn't bandwidth-limited as much as it is latency: SMT helps here because you're able to enqueue multiple wait-states "simultaneously".
I agree it’s not bandwidth limited. But the CPU is stuck waiting around for the results of that operation. So offloading it to dedicated Hw let’s you do an important background task without needing the OS and CPU to spend any wall-clock time on it. Same reason you have DMAs and other coprocessors/accelerators in the first place. A good such solution will increase effective memory bandwidth (offloading heavy background-worthy tasks for the coprocessor) but also latency-sensitive tasks (eg walking a large set of linked lists, you don’t need to pull in a 256 byte cache line at a time and can instead pull the 8 bytes for the pointer).
I don’t pretend that this is easy or an unbridled obvious idea. I recognize there are smarter people than me who have considered this (and from the posts clearly is being explored commercially). I am attracted to the idea on first principles that the CPU is served well by specialized coprocessors that have a performance profile better suited for some specific subset of operations. For example, memcpy is the easiest one. We still don’t have a simple (good) memcpy instruction/accelerator even though we know that an insanely large amount of work the CPU ends up doing is copying memory around in some way or another (yes Intel has one that isn’t useful/good performing).
> I am attracted to the idea on first principles that the CPU is served well by specialized coprocessors that have a performance profile better suited for some specific subset of operations.
It certainly sounds promising, that's for sure, e.g.
I know that POWER9 for example, reserves the superslices for different threads as you go from SMT1 (1-thread per core) up to SMT4 (4-threads per core).
Thread#0 and Thread#1 get Slice0, while Thread#2 and Thread#3 get Slice1.
--------
For an even more extreme example, Bulldozer's implementation of SMT / Cores / whatever you wanna call it... the L1 cache was split between the two cores. So you'd rather have 2-threads doing two different things (even if they shared the same decoder and significantly shared the same resources), because you'd "magically" get access to more L1 cache.
Even on modern Skylake / Zen systems, some SMT resources are locked to one thread or the other. So you do in fact get more resources from SMT.
On the other hand: you're absolutely right in that pointer-chasing can be done in parallel in a single thread due to ILP (and that ILP is probably even preferable to using more threads). Still, the fact that SMT systems are implemented so differently and so weirdly at times... it means that we can't really make general rules about SMT systems.
Yeah, thanks for adding details. SMT on Skylake and later is quirky, people don't appreciate that the CPU replicates some things, partitions others, and fights over some. The static partitioning in particular means that if you try to disable HT by taking all of the odd core numbers offline[+] you just wasted half of your L1 cache and iTLB entries.
+: I know this sounds like an obviously-wrong approach to the problem, but I came across it at a large, well-known, public company.
SMT gives you some probability that thread X and thread Y are under-using the resources of the CPU but if you lose that bet you get antagonism. On Intel CPUs in particular there are only 2 slots for filling cache lines, and filling them randomly from main memory takes many cycles, so two threads can easily get starved.
If thread X is chasing a linked list and thread Y is compressing an MPEG then it's brilliant.
And all kinds of manual prefetching instructions built into the CPUs (for compilers to use) and prefetching algorithms & predictors built into the CPUs too!
I think it's more correct to say that a single core has two "sets of registers" (and I guess instruction decoders/dispatchers perhaps)... so it's a single core with 2 "execution entry points"?
Skylake has 180 registers. "mov rax, [blah]" performs "rax = malloc()" from this pool of 180-registers / "the register file". Eventually, the retirement unit "garbage collects" the registers that aren't used anymore for recycling.
If you have one thread that doesn't need a lot of registers for some reason (ie: most of its dependency chains are short, and it has very few memory operations), then you can have a 2nd thread eat up the ~180 registers that it wasn't using. "Sharing" the malloc pool of registers.
This is not how Intel sees it. They describe hyperthreading as a single core having two (or more, but usually two) sets of architectural state which basically means registers. It's not two cores, it is one core that can switch between two different instruction pointers. They share almost everything else apart from the APIC.
This is indeed an idea that has been coming up from time to time for over 25 years at least. I think the earliest publication in this space was Gockhale's Terasys (DOI 10.1109/2.375174).
It is a good idea in principle, but it's never really taken off to my knowledge. Parallel programming is hard. Deeply embedded programming is hard. The languages we have are mostly bad at both.
If you want to search for more, the keyword is "processor in memory" or "processing in memory" (PIM). Also "in memory computing" is commonly used as well. A number of groups are working on this right now. The new buss is "memristors" (memristive computing). Whether or not any of it actually ends up working or being commercially viable outside of a lab remains to be seen.
> Right now we have to fetch everything into the CPU before we can make decisions.
Think about virtual memory. The memory at location #6000 is NOT at where the program thinks is at memory location #6000.
In fact, the program might be reading / writing to memory location 0x08004000 (or close to that, whatever the magic start address was on various OS like Linux / Windows). And then your CPU virtually-translates that address to memory-stick #0 column#4000 or whatever.
Because of virtual memory: all memory operations must be translated by the CPU before you actually go to RAM (and that translation may require a page-table walk in the worst case)
why is that? we need the CPU to handle page fault interrupts, in order to populate the RAM. But assuming the page is already in RAM, there's no reason any of the memory accesses actually need to go through the CPU. (hardware can already raise interrupts; if the MMU can raise a page fault indicator, then you might be able to bypass CPU entirely until a new page needs to be loaded)
moreover, if we have support for mmap at the MMU level, we can cut the CPU bottleneck for disk access entirely. the disk controller can already handle DMA, but there's simply no way for things that aren't the CPU to trigger it. DirectStorage is an effort for GPUs to trigger it, but what if we could also trigger it by other means?
Okay, lets say page-faults are off the table for some reason. Lets think about what can happen even if everything is in RAM still.
* MMAP is still on the table: different processes can share RAM at different addresses. (Process#1 thinks the data is at memory location 0x90000000, Process#2 thinks the data is at 0x70000000, but in both cases, the data is at physical location 0x42).
* Physical location 0x42 is a far-read on a far-away NUMA node. Which means the CPU#0 now needs to send a message to a CPU#1 very far away to get a copy of that RAM. This message traverses Intel Ultrapath Interconnect or AMD Infinity fabric (proprietary details), but its a remote message that happens nonetheless.
* Turns out CPU#1 has modified location 0x42. Now CPU#1 must push the most recent copy out of L1 cache, into L2 cache... then into L3 cache, and then send it back to CPU#0. CPU#0 has to wait until this process is done. If CPU#1 wants to modify the data again (or even read it), it may require messages from CPU#0 (who is now the owner of the data, according to simple MESI models).
Modern computers work very hard to hold the illusion of a singular memory space. Eventually, these details are turned into a consistent memory model and become well-ordered sequential operations. The CPU is a good place for that.
---------------
That's how stuff works _today_. If you wanted to make a new programming model that's incompatible, that's fine. (CUDA does it: GPUs don't have as many virtual-memory features as a CPU. And __shared__ memory has a different model than L1 cache.)
But if you invent a new memory model that does things differently, it means that it won't work for the vast majority of code. Which means you need to bootstrap a new programming environment (much like how CUDA bootstrapped a new community from scratch).
Or the mapping info for your process is shared with that coprocessor (the OS just keeps this around in RAM anyway). Heck, you could have an OS-provided code that needs to be loaded so that it can properly resolve that mapping (which is what the TLB does by the way - it has a well-defined structure and when it’s missing from the cache if I recall correctly it’ll fetch some info directly from RAM until it gets to a point where it has to generate a page fault).
I don’t disagree that the memory model becomes more complex. For one CPU cache invalidation becomes really tricky. So do memory coherence rules.
I’m less clear how mmap matters here. That’s just a mechanism the OS uses to hand out views into the page cache to the process - if you’ve solve the virtual-physical mapping (which you have to do) then mmap is not relevant.
You’re spot on though that the particular design decisions are critical for this to be successful - pick the wrong point on the complexity/cost/perf curve and your solution will definitely be DOA.
Pointers are virtual addresses but memory is accessed physically. All of the means of translating virtual to physical are in the CPU. If you are proposing throwing out virtual addressing, I imagine you won't get a lot of support for that idea.
I think the point was more like the following situation.
You have a linked list. Each memory location, since it’s user-space, stores the virtual address to the next location. How do you offload a program to process this “in situ”? You’d need to translate these to physical addresses. Parent is 100% correct about the challenge this poses.
Perhaps one problem with that is the modularity with various processors, which would have to know when they are configured with memory to which certain operations can be farmed out, and when they are configured with traditional memory that cannot manage those in-place operations.
SoCs are a good vector of this. Once a technology like this makes it into a space it’s unlikely to ever get removed if it’s actually useful (ie good perf/cost/power wins) as there’s a competitive advantage to having a better one. The longer it’s around, the more second order-effects you have (eg if you need SW integration, tooling etc) entrenching the tech (the same forces that make it difficult to get off the ground in the first place). Apple is a great example of being a company that can pull this off as they have the vertical integration to get it off the ground AND the commitment to ensure a consistent application within a vertical. It may not be within Apple’s business interests if this is primarily useful for “big data”. That seems unlikely to me though so you’d need other applications they might see as strategically useful. SQLite is used heavily at Apple, so accelerating such workloads may be useful. A good chunk of what an OS does overall is largely manipulate memory, so something like the Os scheduler could be bifurcated to pick the next task and have that ready, including modifying the various lists, so that the CPU is doing even less (although unlikely to be a good example as I suspect this is a relatively short operation since it happens so frequently that there’s not really any room for a win).
Even just "go to memory location X+Y, load the value there, set X = that value, repeat N times" would allow fast linked list traversal, as well as doing various high-level pointer chasing faster.
A few months ago I wanted to take a look at the Gen-Z fabric specifications, but unfortunately they still have a lame members-only download request form in place.
meanwhile, disk is potentially getting to be as fast as ram, throughput wise.
128 lanes of pcie 4.0 is 256GBps iirc. epyc's 8 channel ddr4-3200 otoh is good for 208GBps.
Let's Encrypt stopped a little short, using 24x nvme disks (it fits in a 2U though so that's nice)[1]. that could be up to 96 of 128 pcie links in use. with the right ssds, working on large data-objects, that'd be somewhere a bit under 192GBps versus the max 208GBps of their ram.
in truth, ram's random access capabilities are far better, there's much less overhead (although nvme is pretty good). and i'm not sure i've ever seen anyone try to confirm that those 128 lanes of pcie on epyc aren't oversubscribed, that devices really can push that much data around. note that this doesn't necessarily even have to mean using the cpu; pci p2p is where it's at for in-the-know folks doing nvme, network, and gpu data-pushing; epyc's io-die is acting like a data-packet switch in these conditions, rather than having the cpu process/crunch these peripheral's data.
Indeed, you can go further, but got to plan for other bandwidth needed by other peripherals, data movement and inter-CPU bandwidth (NUMA) and intra-CPU-core bandwidth limitations too (AMD's infinity fabric is point to point between chiplets, but intel has some ring-bus architecture for moving bits between CPU cores).
I got my Lenovo ThinkStation P620 workstation (with AMD Zen-2 ThreadRipper Pro WX, 8-memory channels like EPYC) to scan 10 x PCIe 4.0 SSDs at 66 GB/s (I had to move SSD cards around so they'd use separate PCIe root complexes to avoid a PCIe <-> CPU data transfer bottleneck. And even with doing I/O through 3 PCIe root complexes (out of 4 connected to that CPU), I seem to be hitting some inter-CPU-core bandwidth limitation. The throughput differs depending on which specific CPU cores happen to run the processes doing I/Os against different SSDs.
Planning to publish some blog entries about these I/O tests but a teaser tweet is here (11M IOPS with a single-socket ThreadRipper workstation - it's not even a NUMA server! :-)
Yes, that's what I'm suspecting too, although with higher clocked RAM, I should have somewhat more bandwidth. My DIMMs are 3200 MT, so should be running at 1600 MHz. But I saw a note (not sure where) that Infinity Fabric can run up to 2933 MT on my machine and it would run in sync with memory with DIMMs only up to 2933 MT. Unfortunately my BIOS doesn't allow to downgrade the RAM "clock" from 3200 MT to 2933, thus Ininity Fabric is running "out of sync" with my RAM.
This should mean non-ideal memory access latency at least, not sure how it affects throughput of large sequential transfers.
I'm planning to come up with some additional tests and hopefully write up a "part 2" too.
> Unfortunately my BIOS doesn't allow to downgrade the RAM "clock"
How deep are you willing to go?
The RAM clock is controlled by the memory training algorithms. They use data from the XMP, which can be edited.
The simplest is to reflash your memory sticks to alter their XMP, so the training algorithm will reach the conclusions you want. There's some Windows software to do that.
You could also implement your own MRC, something done by coreboot and the likes.
Ha, thanks for the idea! I was briefly thinking of buying 2933 "MHz" RAM for the test (as I later would put it into my other workstation that can go up to 2600 "MHz" only), but then I realized I don't have time for this right now (will do my throughput, performance stability tests first and maybe look into getting the most out of the latency later).
> I had to move SSD cards around so they'd use separate PCIe root complexes to avoid a PCIe <-> CPU data transfer bottleneck
I am doing similar things. Have you considered looking at how to control by software the PCI lanes assignment?
Intel HSIO seems to be software configurable - except that usually, it's all done just by the bios.
But as PCI specs allow for both device-side and host-side negotiations, it should be doable without "moving SSDs around"
> The throughput differs depending on which specific CPU cores happen to run the processes doing I/Os against different SSDs.
That strikes me as odd. I would check the detail of the PCI lanes and their routing. You could have something funky going on. My first guess would be that it's slow on one core because it's also handling something else, by design or by accident.
There're some bad hardware designs out there. But thanks to stuff like HSIO, it should now be possible to fix the worst ones by software (how else would the bios do it otherwise!) just like in the old days of isapnptools!
As this is an AMD machine - and as it's a workstation, not server, perhaps this is why they've restricted it in BIOS.
I'm not too much of an expert in PCI express - but if this workstation has 4 PCIe root complexes/host bridges, each capable of x32 PCIe 4.0 lanes - and there are no multi-root PCIe switches, wouldn't a lane physically have to communicate with just one PCIe root complex/CPU "port"?
In practice many things can be different. Carefully check what's happening under the hood.
Hopefully, manual configuration of the PCIe will become more commonplace, as most bioses are badly broken each in their own unique way - and unfixable.
Changing NUMA per socket(NPS) config would be interesting.
EPYC Rome (or equivalent TR) was advertised as unified compared to Naples, but actually it's a bit NUMA / NUPA? (Peripheral Access, anyone knows the correct word?) so it has "Quadrant".
A lot of the software architecture theory we learn is based on existing HW paradigms without much thought being given to how we can change HW paradigms. By nature HW is massively parallel but where physical distance from compute = latency (vs the ultimately serial execution nature of traditional CPUs that can process all data at blistering speed but only one at a time with some SIMD exceptions). There are real-world benefits to this kind of design - memory is cheap and simple to manufacture and abundantly available. The downside though is that the CPU is sitting doing nothing but waiting for memory most of the time, especially when processing large data sets.
Imagine how efficient a GC algorithm would be if it could compute a result in the background just doing a concurrent mark and sweep, perhaps as part of a DRAM refresh cycle so that you could even choose to stop refreshing that RAM because your application no longer needs that row.
The power and performance savings are pretty enticing.