Hacker News new | past | comments | ask | show | jobs | submit login
The prospects for 128 bit processors (1995) (yarchive.net)
85 points by luu on Feb 23, 2022 | hide | past | favorite | 99 comments



I think 128 bit computers will come around eventually, despite it having been declared that 64 bit is "enough". Some pressures may come from:

* Memory addressing - As the article suggests, addressing large amount of address space. Not just RAM, but disk control on the bit/byte level (something solid state drives may enable and new filesystems may take advantage of). There may also be applications where you want an etabyte disk as low-speed RAM.

* Multi-byte processing - accelerating instructions like AVX have shown the power of processing multiple bytes at a time. One can imagine that wider registers would accelerate these processes and would allow for multi-byte processes to happen in parallel.

* Gaming/simulation - We have seen quite a few examples where physics in games and simulations have broken down due to the inaccuracy of `double` for large values. I believe Minecraft physics for example used to become extremely unstable near the world border.

* Hashing - With `int32_t` and large amounts of data, you will see a lot of collisions. `int64_t` is lesser, but still likely. `int128_t` is rarer but still possible. `int256_t` (`long long` on a 128 bit processor) would be highly unlikely. Being able to compare hashes in just a few clock cycles would be awesome.

* Custom instructions - When programs can define a custom instruction to speed-up computation, 128 bits, or 16 bytes, could even be enough to contain the custom instruction and the payload.

These are just things I've noticed. I imagine there are others too. The prediction of 2043 is still quite realistic, I wouldn't be surprised if we beat it.

I was quite disappointed to see many Linux distros give up on 32 bit support because it was too much effort to support. It probably points towards some crappy code that is highly dependant on the platform.


> Memory addressing

Current 64-bit CPUs generally don't go past 52-bits of memory addressing. That's 4.5 Petabytes and they still have 12 doublings to go. It is also possible to have memory addressing wider than your CPU. 8-bit chips almost universally did this. Even the P4 had 36 bits on a 32-bit CPU.

> Multi-byte processing

You can have wider vector load instructions without affecting the main CPU width

> Gaming/simulation

Legit point. IBM POWER has 128-bit IEEE floats and decimal hardware coprocessors, but I believe the rest of the CPU is still 64-bits wide. The easiest solution here is to allow those 256/512 vector units to do 128-bit floats in addition to 64-bit floats.

> Hashing

If you have a hashtable that needs 128-bits due to collisions, it is a MASSIVE table. It definitely won't be fitting into cache and probably not into RAM.

A few extra instructions (or even a hundred extra instructions) to do the math with two 64-bit segments will still be nothing compared to the thousands of cycles reaching out to RAM or the millions/billions of cycles reaching out to the disk.


> Legit point. IBM POWER has 128-bit IEEE floats and decimal hardware coprocessors, but I believe the rest of the CPU is still 64-bits wide.

Yes (POWER9+, I think), but also 128-bit "IBM" format. Transitioning to an IEEE ABI for ppc64le GNU/Linux has caused some pain in GCC and elsewhere recently. (I don't know how software long double is typically compiled and how fast the results are.)


In my opinion none of these are good reasons for 128-bit computers, as they can be easily addressed by using 2x64 bit integers and software arithmetic, or bigints.

The issue with 32-bit specifically is that the maximum value (2 billion for a signed int) is really small and hits all kinds of practical limits: there are more humans than that! Whereas 2^63 is big enough for almost all practical purposes.

There were costs to the 64 bit transition in terms of higher memory usage so I don’t think we’ll see another transition because the costs at 128-bit would outweigh the benefits.


We already have 128-bit computers if you're willing to count 128-bit vector units. (The internal data path of most CPUs is 128-bit these days).

So the basic problem with 128-bit comes down to one thing: memory. There is a physical limitation on the amount of memory you can have ready to access; if you've paid attention, there has been basically no growth in the size of caches in the past decade. Doubling the memory size of your basic constituents means you cut the effective size of that cache in half, and effectively using that cache is the biggest barrier to performance.

If you look at prior sizes, it's clear that 16-bits is way too small-that's 65,536 entries, which is easy to overflow in many cases. 32-bits gives you about 4 billion entries, which is generally sufficient for the vast majority of cases, but can overflow (it only takes ~1 second for a CPU to count to 4 billion). Overflowing 64-bits in a counter would take that same CPU a century to do, and it should be rapidly clear that there's not much that's going to use it.

Consider your use case of address spaces. Memory can't get all that much smaller; we're starting to come up on laws of physics, so there's only a factor of 1,000× or so smaller that we can get. We're barely at the threshold right now of exhausting the existing 48-bit address spaces (or 256TiB of memory), and even then, that requires things like memory-mapping entire disk drives to eat up that much memory. Apply that factor of 1,000 shrinkage, and you still don't cross the 16EiB threshold of a 64-bit address space. The need to address that much memory just isn't in the cards, and if it is, the memory is likely to be so heavily non-uniform that you'd probably see a development towards segmented address spaces again anyways to avoid needing 128-bit pointers generally.

The one use case that I consider the most likely is adding hardware support for quad-precision floating point numbers, since it already exists in several libraries as a soft float operation, it's well-understood, the datapaths can generally already handle it (a 128-bit vector FMA unit can relatively easily be extended to support a 128-bit scalar floating-point FMA anyways), and there's a small need for it.


> Memory addressing - As the article suggests, addressing large amount of address space. Not just RAM, but disk control on the bit/byte level (something solid state drives may enable and new filesystems may take advantage of). There may also be applications where you want an etabyte disk as low-speed RAM.

Think bigger. We could go all the way to a global, routable, address space. Such a system would need to be sparse to avoid terrible fragmentation.

We could also burn a bunch of address space on segmentation. Why spend instructions on checked array accesses when you could just put your 10 element array between a terabyte worth of unmapped virtual address space.

In practice, I suspect we will move to more heterogeneous register sizing, with increasingly large registers being used for computation while pointers stay at 64 bits or smaller (maybe with fatter pointers in niche applications, like a multi-machine runtime embedding the machine address in the upper bits)


The problems with games are not due to the inaccuracy of FP64 'double'; they are all caused by FP32 'single'.


Multi-issue 64bit machines can compare 256 bits in "just a few cycles". (You do 2-4 compares per cycle and then combine the results.)


Modern 128 bit designs are held back by trouble with heat dissipation and signal routing. Basically you need 2-8x the transistors depending on operation, and you need 4x the space or more to route it.


And a company like DEC to invest in research.


Dumb question: why stop at 128? Why not go directly to 512, or even 1024 bits?


There's cost in the CPU because

- Every register must be usable as a pointer, so if you make your addresses (i.e. pointers) too big for no good reason you're wasting silicon.

- A solution to the above is picking a subset of registers that can address memory and making them big and leaving others as is, but this complicates the architecture and makes it ugly and complex

- Another solution is to alias names so that registers are accessible either as monolithic 128/512/1024 blobs or as smaller subslices (A32_1 is the least significant 32 bits of the 128-bit register A, A32_2 is the next 32 bit slice and so on). This lessens the waste because when you don't need a 128-bit register you have 4 32 bit register instead and there's no such thing as too many registers. x86 does something vaguely similar but it's super ugly for different reasons, I don't see any problems with an architecture designed like this from the start, someone better at Computer Architecture might.

- All functional units in the CPU will have to be the size of the largest register N, this would be a waste if they don't have the ability to function instead as N/n parallel n-bit units instead, and designing this into the architecture might be complex and entailing a lot of decisions. (e.g. If a 128-bit adder is functioning as parallel 4 32-bit adders, where do the 4 separate overflow bits go?)


We already have 128-bit processors, of sorts. They are called "vector units". ;)

128-bit integers could be useful for bignums. You can do bignum addition on existing SIMD units but you'd have to use tricks to carry the carry between elements and to the next vector — and that makes the resulting code slower than using GPRs.[1] But I think this would still fit better in vector units than making the integer registers larger: The hw trend (ARM SVE, RISC-V V) is to make vector-width implementation-defined, starting at 128 bits in the low-end.

128-bit floating point is apparently a thing in scientific computing (but I've not seen it myself).

128-bit "pointers" exist in CHERI [2], where they are tagged capabilities that also store the bounds and access rights (rwx) for accessing memory through the pointer. Addresses are still 64-bit though. GPRs are still 64-bit. The caps are stored in a new register file.

If you want to scale up address spaces, I think the biggest issue is not the number of bits but rather how to coordinate the address space to avoid collisions. One approach would be to re-introduce segmentation (pointer ⇒ segment ID and offset within segment), but unlike earlier system make each segment have its own local namespace of segment IDs it knows and translate between those when transferring pointers between segments. This idea is used by e.g. the research OS "Twizzer" [3], which was made for a large address space using NVRAM as RAM. I think this "pointer swizzling" could also be helped by hardware support.

[1] Integer Addition and Carryout, by Alexander Yee <http://numberworld.org/y-cruncher/internals/addition.html>

[2] Capability Hardware Enhanced RISC Instructions (CHERI)<https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/>

[3] Twizzler: An Operating System for Next-Generation Memory Hierarchies <https://www.ssrc.ucsc.edu/pub/bittman-ssrctr-17-01.html>


128-bit vectors can do some stuff with integers, but most don't work at all for 128-bit floats which are where 128-bit is most important in hardware (because the software fallback is SLOW).

I believe IBM POWER chips have both IEEE 128 floats (I think they call them a long double) and IEEE 128-bit decimal too.

Outside of some very precise simulations though, I doubt there is very much demand.


Very interesting. Does this have any implications for more accurate precision when floating points are being worked with in big numbers by any chance?


I'd be curious if anyone with understanding of memory transistor density has an idea of what sort of physical footprint and power requirements 128-bit addressable memory would look like, because I'm having a hard time picturing even a distributed, shared virtual memory space realistically exceeding 16 exabytes.

It looks like the Fugaku ARM cluster with 158,000 nodes uses < 5 PiB for 40 MW.

As an aside, man does reading something from @sgi.com on USENET take me back to a better time. Title should probably note [1995]

[0] https://www.fujitsu.com/global/about/innovation/fugaku/speci...

[1] Also https://arxiv.org/abs/quant-ph/9908043 "Ultimate physical limits to computation"


From Jeff Bonwick, ZFS Developer:

  Although we'd all like Moore's Law to continue forever, quantum mechanics imposes some
  fundamental limits on the computation rate and information capacity of any physical device. In
  particular, it has been shown that 1 kilogram of matter confined to 1 liter of space can perform at most
  10^51 operations per second on at most 1031 bits of information [see Seth Lloyd, "Ultimate physical
  limits to computation." Nature 406, 1047-1054 (2000)]. A fully-populated 128-bit storage pool would
  contain 2^128 blocks = 2^137 bytes = 2140 bits; therefore the minimum mass required to hold the bits
  would be (2^140 bits) / (10^31 bits/kg) = 136 billion kg.
  That's a lot of gear.
  To operate at the 10^31 bits/kg limit, however, the entire mass of the computer must be in the form of
  pure energy. By E=mc^2, the rest energy of 136 billion kg is 1.2x10^28 J. The mass of the oceans is
  about 1.4x10^21 kg. It takes about 4,000 J to raise the temperature of 1 kg of water by 1 degree
  Celcius, and thus about 400,000 J to heat 1 kg of water from freezing to boiling. The latent heat of
  vaporization adds another 2 million J/kg. Thus the energy required to boil the oceans is about
  2.4x10^6 J/kg * 1.4x10^21 kg = 3.4x1^027 J. Thus, fully populating a 128-bit storage pool would,
  literally, require more energy than boiling the oceans.


It is not too difficult to imagine for virtual memory. Servers with high storage density are right up against the current virtual memory limits such that you can't mmap() storage as it is. Not that you'd want to mmap() if you cared about performance. Similarly, the largest data models have been in the exabyte range for a while. It is just a matter of time before they are not byte-addressable using a 64-bit integer.

I think there is more utility in 128-bit ALUs than 128-bit pointers, as a practical matter.


A machine word/pointer as wide as 72 bits would already solve a lot of problems that appear in e.g. dynamic languages nowadays, such as an inability to represent an IEEE double float (64 bits wide) as a value without any sort of type tagging.

Still, I don't think anyone nowadays would settle for a non-power-of-2 machine word size, even if just for backwards compatibility and even if a pointer is limited to e.g. 72 bits out of all 128. If that is the case, then another use could be a "fat pointer" that was proposed somewhere in the C world to avoid passing raw pointers without their lengths.You could pass a length of an array along with its address as long as the length of your array can fit in e.g. 56 bits, which would be enough for most use cases if we feel like going into the "640kB is enough for everyone" route again.


JSC, Chakra, and Spidermonkey use NaN-tagging for values; 32 bit integers, 52-bit pointers, and 64-bit doubles are all encoded into one word, with the former two using encodings that fall into the NaN space.


if you lay out your virtual memory space correctly on a little endian machine you can use the denorm space (assuming you don't care about underflows) for the actual memory and have floats and the rest of your world coresident in 64 bits.


> Not that you'd want to mmap() if you cared about performance.

So far I never used mmap (no usecase that pushed me to find alternatives for the usual open&seek), but I was always curious about it.

Your statement made me search & find this ( https://unix.stackexchange.com/questions/474926/how-does-mem... ) => interesting.

Thanks :)


You might also consider this Q&A whose answers contain several reasons mmap might be faster, and several why it may be slower:

https://stackoverflow.com/questions/45972/mmap-vs-reading-bl...


mmap has a bunch of other use cases other than opening files. Shared memory, fast memory allocation, disk-backed memory regions. I believe it was invented after the gun that shoots both forwards and backwards failed to deliver on its fatality quota.


85.8 grams of DNA is enough to overflow 64bits. Not that we're anywhere close to being able to use or process information on those scales, but it at least storage on that scale doesn't outright violate the laws of physics/information theory.


Storage needs don't affect pointer size though. To need bigger pointers, you must need an addressable space this big.

That said, how to you make your calculation about DNA's information density? Without counting water, I've arrived at 720g for one “mole of bits”, which is a bunch of orders of magnitude higher than your figure: 2 amino-acids (262g/mole) [1], 2 Deoxyribose (135g/mole each) and 2 Phosphate (94g/mole each).

[1]: and it's the same value be it A-T or G-C, that's interesting


The information density is even higher than that (double), because one pair of bases has 4 possible states. The pair A-T is symbolically distinct from T-A, for example.


Not anywhere close? I've heard things about using DNA itself for the memory storage format which is super interesting. I guess it just has bandwidth limitations and is more for long term storage? [1]

[1] https://www.scientificamerican.com/article/dna-the-ultimate-...


85 grams of DNA contain around 8 * 10^22 base pairs, 2 bits per pair, so around 2 * 10^22 bytes. So you need about 75 bits to address all these bytes.

I do not know why Laremere chose 85.8 grams though as these are quite above the 64 bit limit, unless I got my multiplications wrong :P


Not that anything uses that much DNA, anyway:

"The male nuclear diploid genome extends for 6.27 Gigabase pairs (Gbp), is 205.00 cm (cm) long and weighs 6.41 picograms (pg)."

Or, many many orders of magnitude less than 85 grams...


It’s incorrect to view Fugaku (as cool as it is) as having a monolithically addressable 5 PiB. Each processor/node has 32 GiB directly attached via HBM and not expandable (Apple M1ish). It’s a really big cluster of (much) smaller memory spaces.


4.85 PiB of RAM -- WOW


Oh that machine is a monster. All the memory is HBM2 as well. So it's stacked in the same package as each processor chip, connected with an interposer. HBM gives you 4 stacks, each with 8 single duplex 128 bit memory buses, that can run at 2+ ghz. So that 4.85 PiB is accessible with a raw bandwidth of around 144 PiB/sec. Definitely a monster of a machine.


They are at the edge of current RAM limits of 52-bits (I believe this is the case on both x86 and ARM systems).

If they could get a system with the full 64-bits of addressable RAM, they'd likely have gone even bigger.


They had 128-bit computing in the 1970s if you count this sort of thing

https://en.wikipedia.org/wiki/IBM_AS/400

which used 128-bit unique identifiers for objects that could be in RAM or kept persistently in storage which was transparently managed by the OS, language runtime, etc.


Some VAXen could also do 128-bit ALU operations too. I thought DEC had a full 128-bit CPU but I've never been able to find it... I probably just remember wrong.


Of the various claims that something is X-bit I think the most credible is the size of the address space as opposed to the size of the ALU.

A 128-bit address space is large enough (UUIDs) that you can name 2^64 objects (about as many iron atoms are as in an iron filing) randomly. That lets you take two separate systems and then merge them into one system with no name conflicts, a handy property for 'distributed systems'


> Of the various claims that something is X-bit I think the most credible is the size of the address space as opposed to the size of the ALU.

But that would make the MOS 6502 a 16-bit processor, and the Intel 8088 20-bit (!) -- neither of which are consistent with common usage. (They're generally considered 8-bit and 16-bit, respectively.)

I'd stick with "the size of a general-purpose register". That neatly excludes special-purpose registers from consideration, like 256/512-bit AVX registers in x86_64 systems and the 16-bit program counter in most 8-bit CPUs.


That would yield a mess, historically. For instance, the 32-bit IBM System/360 had a 24-bit address space. The 16-bit Intel 8086 had a 20-bit address space. The 4-bit Intel 4004 had a 12-bit address space. It doesn't make sense to call the 4004 a 12-bit processor and the System/360 a 24-bit processor.


Relatedly, I hope we ever get CPUs where pointers are bit addressed. I’m not suggesting they should allow byte level access at arbitrary alignments. I just mean that the smallest addressable unit is a bit.

That would allow us to finally declare lists of booleans without needing any specialization or losing the possibility of referencing elements.


> Relatedly, I hope we ever get CPUs where pointers are bit addressed.

You mean like the famous PDP-6 from 1964? The width of a byte was variable and could be as small as 1: http://pdp10.nocrew.org/docs/instruction-set/Byte.html

(That's the PDP-10 instruction set documentation, but the PDP-6's instruction set was essentially identical).

I never wrote PDP-6 code but wrote a lot of PDP-10 code and the byte instructions were used heavily. In particular characters were commonly six or seven bits wide so string manipulation used byte pointers and was quite fast, but they were also good for packed arrays of all sorts of sizes.


Some ARM cortex-m chips have a feature where you can address individual bits. https://developer.arm.com/documentation/ddi0439/b/Programmer...


This would also go nicely with arbitrary-width integers. Load/store instructions would just need to set aside a few opcode bits to define the number of bits to load or store, and pointers would just get 3 bits wider (basically left-shifted 3 bits, the upper bits in a 64-bit pointer are not used anyway). Pointer alignment could be the same as the bit width, e.g. 2-bit load/store could require 2-bit alignment, byte load/store byte alignment, and so forth...


https://en.wikipedia.org/wiki/IBM_7030_Stretch

> Bytes are variable length (1 to 8 bits).[17]


the CM-1 and the CM-2 were bit addressed. I _loved_ that part of it, and I don't even understand why

edit: I hadn't started doing systems programming yet, but now that I am I would like it even more


Here is a commercial bit addressed processor:

https://en.wikipedia.org/wiki/TMS34010


Funny, I've had the opposite thought. If registers are a certain width, and all memory access is aligned anyway, why do we need pointers that can address arbitrary bytes? Heck, why are bytes even a thing in a 64-bit machine?


Having done some network code on the 16-bit Xerox Alto, it's a big pain if you can't access bytes directly. If you want to deal with characters, you need to do shifts and AND operations, depending if you want the odd byte or the even byte. It's very annoying.


Do you mean characters as in real languages? Because then bytes don't matter, but codepoints, which can be any length.

Obviously dealing with legacy formats that are bit or byte aligned should still be supported using char arrays, it just would require compiler assistance for pointers.


Well let's just switch to 32bit unicode characters and drop all this UTF8 stuff. But even then we would addressing 32 bit words with a 64bit machine.


I don't know if you lived through the early RISC days, but it was really a bit of a bother when dealing with things that weren't words. but sure, its wasn't that bad. I just get really tired of all the shift and mask stuff...seems like we should be able to do better. I guess you could paper over it with sufficient language/runtime support


It's really convenient to pass around pointers to a single GPIO pin on embedded systems that have a special region of memory where each byte corresponds to one pin rather than the usual 8. Atomic access to bits would also be convenient.


We have atomic bitwise operations already (look at glibc's mutex implementation), and the unit atomic operations work on is a 64-byte cache line. Cache lines are useful because reading 64 bytes isn't really more expensive but it improves sequential memory access by a lot.


> Of course, some kinds of system designs burn physical memory addresses faster than you'd expect. In particular, suppose you build a system with multiple memory systems. A minimal/natural approach is to use the high-order bits of an address to select the memory to be accessed.

We're already starting to see this regarding some of the up-and-coming technologies like CXL and Gen-Z (which have now been merged). The idea is for a fabric consisting of multiple machines to share a common address space, and with the transition from 48-bit virtual addresses to 56-bit virtual addresses happening soon-ish, there's not many bits in the address space left to leave a machine identifier in your 56-bit address.


> For example, 36-bit physical addresses support 16GB memories ... and there already have been shipped single-rack microprocessor boxes with 16GB using just 16Mb DRAMs; there are of course, more in the 4GB-8GB range. Of course, a 32-bit physical addressing machine can get around this with extra external-mapping registers ... assuming one can ignore the moaning from the kernel programmers :-)

I remember 16 megabytes of RAM being enough for a good gaming rig in 1995. Who was using gigabytes of RAM back then, and for what?


I was building and speccing servers to run banking applications in the mid to late 90s and this was in the days when x86 was still rubbish (or perceived to be so), so it was big Solaris boxes (E10k etc) and beefy HP PA-RISC machines.

Applications were written in C++ and COBOL, the DB was Oracle.

CPUs were expensive, disks were slow (and expensive - this is when EMC ruled!), so memory was used to boost database performance, and you could put a lot of memory in these machines (and we did). There were often 3 machines at each site, 2 production (one hot standby) and 1 test/qa machine, usually the same chassis but with lower specs (less CPU, less memory, less disk etc). The production machines were often at two close sites (10k apart) with disk replication between them over fibre in the street.

It was quite common back then for the clients to specify ridiculous projected growth, and require total resilience from the hardware. I'd spec up the machines, and they'd baulk at the price (and when you get an international bank to baulk at the price you can imagine we're talking multiple millions for hardware let alone support contracts etc). Then, following a slice of realism, they'd relax some of the requirements, and we'd get something for 1/4 of the money.

Happy days, lots of travel though.


How did they secure the fiber? IIRC back then it was hard to get decent (especially non-backdoored) hardware encryption solutions @ high network throughput.

I wonder if anyone spliced fiber, MITM'd disk replication, made bank and got away with it...


I would be very surprised if there was any encryption, but I don't know how the tech worked, other than it was all 'dark fibre', that is, purchased links between sites without any offered services.

The EMC drive arrays were expensive enough that they'd be shared between projects, effectively, capacity was 'leased' to different teams within the organisation. Those drives were replicated between the sites via the fibre, so the infrastructure team responsible for that would have been doing whatever they could I guess (probably getting it working at all was the challenge rather than worrying about encryption).

I do know that exchange connectivity for trading was via leased X.25 connections over the PSTN, and there was no encryption on those links. The client orders were small though, nothing to get excited about.

Proper inter-bank stuff, the big bucks, were carried on the SWIFT network, which is encrypted, and that would be the place to attack if you wanted to get up to mischief. I guess that's still used today (was heavily used last time I worked in that industry 10 years ago for settlement, fx, etc)


IMHO the most interesting thing about SWIFT (whose official line is that they are just a messaging service and carry no money) is that it is an effective global monopoly on international inter-bank financial transfers with no published history. What backstory is available shows that it was founded by a European as a Belgian cooperative who was sent back to Europe after being head of international settlements at Amex. The organization appeared really suddenly given glacial technological norms of the day, and its first HQ was not located in a financial center but rather in Virginia, right by the CIA. Yes, I'm saying it was probably founded as an intelligence asset. The US really leveraged the post-WWII and cold war periods to cement its geopolitical dominance globally. Well played.


In 1996 electronic design automation tools were also hitting the limits of 32 bit address spaces. I don't remember the exact timing of when they were able to use more than 4GB in a single process; I think there might have been an in-between period where a Solaris machine running on a 64 bit SPARC CPU could use more than >4GB, but a single process was limited to that amount. Within a few years the tools were ported to use more than 4GB in a single process.


Most likely databases. Pretty sure one of the Sun SPARC boxes I managed in 1995 and that had a few GBs of RAM. It also ran Informix, which I don't miss administrating.


A couple years later, but in 1997 you could get one of these with 64GB of RAM.

https://en.wikipedia.org/wiki/Sun_Enterprise#Enterprise_1000...


Databases. Early high traffic internet services


I’m other words: the same things that, today, are using a terabyte of RAM. Cloud computing and databases.


https://en.wikipedia.org/wiki/128-bit_computing

"The RISC-V ISA specification from 2016 includes a reservation for a 128-bit version of the architecture, but the details remain undefined intentionally, because there is yet so little practical experience with such large memory systems"


Fabrice Bellard's TinyEmu actually implements enough of the RV128 spec to run code. So the spec is defined well enough to implement, I just think they don't want to paint themselves into a corner before there's an actual need.

I was going to link to his site, but it seems he needs to update his certs.



One small thread from the past:

The prospects for 128 bit processors (1995) - https://news.ycombinator.com/item?id=6300063 - Aug 2013 (2 comments)


He failed to account for the "why" with previous architectures.

8-bit systems offer numbers from 0 to 255. In order to solve most realistic problems, you need 16 or 32-bit numbers. Working with a 32-bit number requires a half-dozen instructions to add two numbers and closer to 30+ instructions to multiply them and that's assuming no register pressure. If you only have a couple registers and an accumulator, you will have many more instruction moving data between registers and memory.

16-bit processors were an almost uniform upgrade here. You halve the number of raw math operations and the amount of MOVs decrease way more than that.

32-bit processors were a mild upgrade in comparison. If you needed 32-bit numbers, they were faster, but any time you could get away with 16-bit numbers, the other half of the ALU was just wasting power and die space. There was still a pretty large advantage to this change though.

64-bit processors have comparatively mild advantages. Unfortunately, increasing cache size (especially L1 cache) explodes the implementation complexity when you try to keep latency cycles down. Actually using a bunch of 64-bit numbers where 32-bit numbers would be sufficient halves the hit rate in cache. The actual gains from a fully 64-bit processor are consequently rather mild in most real-world applications.

Most of the things people believe to be inherent advantages of 64-bit processors are actually x86 specific because it could double it's general purpose registers and ditch tons of legacy garbage when in 64-bit mode. The Pentium 4 for example could access 36-bits of memory despite being a 32-bit architecture. Meanwhile 64-bit x86 systems use 40-52 bits for their memory offsets rather than the full 64 and we're nowhere close to saturating that in any commonly-used system.

128-bit processors hit this wall hard. You can't use 128-bit numbers everywhere because the cache hit is too big (you could mitigate this by increasing addressable size from 8 bits to 16 or 32 bits, but that adds another set of problems). To get that full cache usage, you are still going to be using 32-bit integers wherever possible, but that means 3/4 of your pipeline goes to waste for most calculations.

For the occasional times where 128-bit or bigint numbers are needed, it's better to have a 64-bit processor and dedicate that extra die area to other things than wider ALUs. There is a special case for very specific supercomputers running highly-detailed simulations requiring high-precision floating-point numbers, but even in those cases, I suspect a 64-bit system with a 128-bit floating-point co-processor is probably a good approach.


> To get that full cache usage, you are still going to be using 32-bit integers wherever possible, but that means 3/4 of your pipeline goes to waste for most calculations.

I thought we were in an era of Dark Silicon where we are power/thermal constrained and it’s great to have lots of different special purpose parts of the die as long as you can turn them off when you’re not using them.


Not even close.

TSMC 10nm wafer $6,000

TSMC N7 wafer $9,000

TSMC N5 wafer $17,000

TSMC N3 wafer $20,000+ (exact pricing not public)

Cutting core size and saving just a few mm^2 per chip means a few more chips per wafer which is critical to the bottom line given the radical price increases for each node.

The only company this doesn't seem to be particularly true for is Apple.

Apple has 2 large, high-performance cores, 4 relatively large efficiency cores (around A78 performance levels), a large GPU, and a whopping 32MB of SLC (system level cache).

Meanwhile, Qualcomm's just released Snapdragon has ONE high-performance core (that is still smaller than Apple's performance cores) and 3 "medium" cores that are only a bit better than Apple's efficiency cores with a smaller GPU and a tiny 6MB SLC. Qualcomm has been so desperate to lower wafer/chip costs that they even moved to a terrible Samsung process.

The only specialized units are HIGHLY specialized. They are fixed function (or near fixed function) video decoders or ML units and none of these actually add very much die area. GPUs have actually moved heavily toward general purpose computing (GP-GPU) specifically so they could reuse the die area for other non-graphics things.


> Of course, a 32-bit physical addressing machine can get around this with extra external-mapping registers

When I came up in this industry, we referred to that as "Bank Switching"


Finally we can get to IPv6 and store an address in a register.


We've definitely found use cases for higher-width buses, but seemingly less so for operating on 128-bit integers and pointers, like this article is describing.

Take for instance VLIW architectures (Itanium used 64-bit registers but 128-bit instructions; you packed two ops into an instruction), vector extensions (AVX-512 uses 512-bit registers to hold multiple integers from 16 to 64 bits long) - we've long known these to be viable strategies for long word lengths in a CPU. But the ability to operate on a 128-bit integer directly doesn't seem to have a use case yet.


> (Itanium used 64-bit registers but 128-bit instructions; you packed two ops into an instruction)

Three ops.


Even 64-bit systems today don’t address 64-bits of physical memory. The Arm A78 (a very random example) uses 40 bits of physical address space. The 128 bits would really be for arithmetic and bus widths.


Some 64-bit systems still choose to use 32-bit pointers because it saves so much memory on storing them.


I don't know why you were downvoted. This is absolutely true and even the biggest x86 systems don't address more than 52-bits.


https://news.ycombinator.com/item?id=26112698

>On Nov 8, 2018, I sent Bill Joy a birthday greeting: "Happy 17,179,869,184 MIPS Birthday, Bill"! (2 to the (year - 1984))

https://medium.com/@donhopkins/bill-joys-law-2-year-1984-mil...


Great to see posts from John Mashey here.

John wrote the UNIX PWB aka Mashey shell, and then had a long career at MIPS and SGI.


At the time at least one 128 bit computer had been on the market for several years: The Rational R1000.


> Of course, if somebody does an operating system that uses 128-bit addressing to address every byte in the world uniquely, and this takes over the world, it might be an impetus for 128-bitters :-)

I'd like to see such OS.


Symbolics tried that.

Scary thought: how many bytes of RAM are there in the world? Probably more than 2^64. There are around 7 billion smartphones, and many have over 4GB of memory. That's over 2^64 right there.

I'd like to have 128-bit atomic operations. If you're doing lock-free programming, hardware compare and swap on a structure that contains two pointers is useful. Then you can update lists atomically. In a 64-bit pointer machine, that takes 128 bit atomics. So far, not much hardware offers that.


CMPXCHG16B has been standard for the vast majority of the time x86_64 has existed. If a CPU can run windows 8.1, then it has 128-bit atomic operations.


Rust does not have AtomicU128, for some reason. Need to look into that.



Ah. There are still some AMD64 CPUs out there that don't have cmpxchg16b capability. The Linux developers made the decision to not use cmpxchg16b and continue supporting them. Microsoft made the decision to use cmpxchg16b for Windows 8 and later. There are some libraries which probe on first use and either use cmpxchg16b or use a global lock. That's probably a viable strategy at this point, since CPUs without it are now rare. But I can see the argument against it.

[1] https://superuser.com/questions/187254/how-prevalent-are-old...


Symbolics maxed out at 40-bit pointers.


There was some kind of scheme for accessing objects on other machines over the local network using a very long memory address, but I don't know if it was ever used.


You joke but I found this interesting (from https://en.wikipedia.org/wiki/Metal_oxide_semiconductor):

> It is the basic building block of modern electronics, and the most frequently manufactured device in history, with an estimated total of 13 sextillion (1.3×10^22) MOSFETs manufactured between 1960 and 2018

64 bits can address 1.8x10^19 bytes. Of course there is not a 1:1 correlation between mosfets ever made and bytes of ram usable today but it's curiously close to the 64-bit limit so you could say 64 bits is just enough to address every byte in the world.

I wonder how much magnetic storage would add to that?


> I wonder how much magnetic storage would add to that?

A lot.

Just looking at hard drives, if you take an estimate of 250 million units made in 2020 and multiply by a conservative 4TB, you get 10^21 bytes for that single year.


I worked on a magnetic storage device that stored an exabyte, so quite a bit.


There might be a bit of a problem powering it:

https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/


So a few notes on that:

* That's the threshold where you go from 128 to 256, not where you go from 64 to 128.

* If you're not going for the physically maximum overclock, the energy per bit is a lot less.

* That's for storing 512 byte blocks, so you'll run out of RAM bits 512 times sooner.

* Humans use about 10^20 joules of electricity per year.

* 10^20 joules at the cost in the link would be about 2^110 bytes, which is not all that far off.


Notice the physical address they have listed is the current Google headquarters.

The Googleplex is built on/around the old SGI campus.


Non-volatile memory tech. Once storage and ram become one and the same then the need for addresses will explode.


CHERI/Morello pointers are 128 bit.


It is a geniune method to speed up computation if you can put all the operands in the instruction (VLWI)


FTA: Thus, a CPU family intended to address higher-end systems will typically add 2 more bits of physical address* every 3 years.*

For me, that loosely follows from the “CPU speed doubles every 18 months” variant on Moore’s law. Having extra address space doesn’t help you if your CPU is too slow to fill it (1), so if your CPU gets twice as fast in 18 months, the amount of memory you can effectively address doubles.

That would only change if we had non-volatile RAM.

(1) yes, you could have a hash table with effectively zero chance of collisions if you were to use petabytes to store a few hundred items, and that would be useful even with a 1MHz CPU, but if you can spend that money on memory, you’re better of spending it on better CPU. So, that only would help if CPU evolution stopped and memory got drastically cheaper, both in $ and in power usage.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: