
ZFS 128 bit storage: Are you high? (2004) - mixologic
https://blogs.oracle.com/bonwick/128-bit-storage:-are-you-high
======
vbezhenar
> Some customers already have datasets on the order of a petabyte, or 2^50
> bytes. Thus the 64-bit capacity limit of 264 bytes is only 14 doublings
> away. Moore's Law for storage predicts that capacity will continue to double
> every 9-12 months, which means we'll start to hit the 64-bit limit in about
> a decade. Storage systems tend to live for several decades, so it would be
> foolish to create a new one without anticipating the needs that will surely
> arise within its projected lifetime.

So exactly 14 years passed, does someone have 2^64 bytes for a single ZFS
filesystem (or anything close to that)? I don't really feel like storage
capacity (or 1/price) doubles every year.

~~~
bhouston
I think the largest file systems in the world are Amazon's S3, and maybe
Glacier. I guess Google's internal system used for YouTube and Gmail probably
outrank it but there are no numbers for it. But I doubt that they are in a
single address space on a single file system.

So basically we moved away from singular large unified file systems and built
swarms of little file systems.

~~~
vesinisa
Indeed. "Cloud" is essentially tons of commodity hardware nodes running a
tailored userspace distributed file system application. There now certainly
exist cloud services that host well in excess of 2^64 bytes, but they are not
hosted on traditional file systems, but an array of those.

I can't certainly blame their choice though, and 16-million terabyte hard
drive arrays (the limit of 64-bit addressing) on a large mainframe are really
just a few quantum leaps away. Heck, I still vividly recall buying a MASSIVE
40 GB HDD in 2000 - it seemed excessive beyond wildest dreams at the time (and
was promptly filled to brim at a LAN party.)

Since, advances in mass storage tech, most importantly the successful
commercialization of perpendicular recording in 2005, mean that nowadays 12 TB
drives are commercially viable - roughly 1000x the size of a 40 GB drive.

In the next 20 years, petabyte-sized drives might well be available (though
almost certainly not based on any form of magnetic recording). Once you hook
hundred 50-petabyte drives to a single fileserver, you start closing in on
limitations of 64-bit addressing.

~~~
dsr_
Overly pedantic quibble: a quantum of anything is the minimum possible unit.
One planck length, time, mass, charge, temperature.

Everybody uses it to mean "big". Oh, language.

~~~
taneq
"Quantum leap" is typically interpreted as "big leap" but I believe the true
implication is more like "indivisible leap", in line with your quibble. A
Turing-complete computer is a "quantum leap" ahead of a non-Turing-complete
calculator. A warp drive is a "quantum leap" ahead of a reaction drive. You
either have all of it or none of it.

~~~
rusk
Quantum leap is an increase in the units, or “quanta” you use to measure
something i.e. MHz to GHz - the term doesn’t relate directly to quantum
mechanics

~~~
dsr_
In your example you are using the same units - Hertz, one cycle per second -
and adding SI prefixes for million X and billion X.

~~~
rusk
Yes, but the "quantum" has increased by order of magnitude.

The point is, that the increase in frequency is so large, that using MHz is no
longer sensible.

There has been a "quantum leap" in frequency.

------
znpy
"we don't want to deal with this shit for another forty years or possibly
forever" would have sufficed for me.

~~~
DaiPlusPlus
I’m struggling to think of a project that released a new version to
accommodate some far-off limit that wasn’t blindsided and obsoleted by
something operating on a different paradigm - like using YYMMDD for dates, the
changing it to YYYYMMDD for Y2K when they should have been using integer
time_t - but then that was 32-bit and now 64-bit. But we won’t even be 10% of
the way there before a relativistic-capable time system will replace it.

So why worry? We’re doomed to rewrite it anyway :)

~~~
regularfry
Trading off a tiny design decision like that against knowing you're unlikely
to have to ever revisit except in an unforeseeable situation is a pretty good
place to be.

------
NamTaf
What's the _downside_ to having 128 bits over 64? I'm not familiar enough with
file systems to know, so I'm left wondering if there's any downside to
oversizing the hammer for the nail?

~~~
kev009
It expands the data structure size, which multiplies N for both on disk and in
RAM representation (and cache and shorn cache lines). Many modern
architectures have native 128b types (amd64, POWER) so it's not really a big
deal to the CPU itself even if you need to do atomic operations for
concurrency.

OTOH it guarantees there wont really need to be on disk changes for pointer
sizes. That may be useful in situations that weren't really imagined, for
instance 128b is enough for a Single Level Store where persistent and working
set (or NV) are all in the same mappable space for any conceivable workload.

------
paulsutter
Real-world large filesystems are distributed across many thousands of hosts
and multiple datacenters, not mounted as a Linux filesystem on a single host.
Because whole racks and whole datacenters fail, not just disk drives.

So they used 128 bit because of bikeshedding. Committees always make the most
conservative decision possible. Like IPv6.

~~~
wahern
The fact that real-world storage systems are distributed on the network
bolsters the case for supporting 128-bit and even larger types.

Creating unified namespaces is _really_ useful and a _great_ simplifier. The
reason we don't do that as often as we should is because of limitations in
various layers of modern software stacks, especially in the OS layers.

Unfortunately, AFAIU ZFS only supports 64-bit inodes. A large inode space,
like 128-bit or even 256-bit, would be ideal for distributed systems.

Larger spaces for unique values are useful for more than just enumerating
objects. IPv6 uses 128 bits not because anybody ever expected 2^128-1 devices
attached to the network, but because a larger namespace means you can segment
it easier. Routing tables are smaller with IPv6 because its easier to create
subnets with contiguous addressing without practical constraints on the size
of the subnet. Similarly, it's easier to create subnets of subnets (think
Kubernetes clusters) with a very simple segmenting scheme and minimal
centralized planning and control.

Similarly, content-addressable storage requires types much larger than 128
bits (e.g. 160 bits for Plan 9 Fossil using SHA-1). Not because you ever
expect more than 2^128-1 objects, but because generating unique identifiers in
a non-centralized manner is much easier. This is why almost everybody,
knowingly or unknowingly, only generates version 4 UUIDs (usually improperly
because they randomly generate all 128 bits rather than preserving the
structure of the internal 6 bits as required by the standard).

ZFS failed not by supporting a 128-bit type for describing sizes, but by only
supporting a 64-bit type for inodes. And probably they did this because 1)
changing the size of an inode would have been much more painful for the
Solaris kernel and userland given Solaris' strong backward compatibility
guarantees, and 2) because they were focusing on the future of attached
storage through the lens of contemporary technologies like SCSI, not on
distributed systems more generally.

~~~
paulsutter
Unified namespaces on many-petabyte filesystems are perfectly commonplace

HDFS, QFS,.... even old GFS

You wouldn’t make them Linux/fuse mountpoints though, that’s just an unneeded
abstraction. Command line tools don’t work with files that are 100TB each.

~~~
wahern

      Command line tools don’t work with files that are 100TB each.
    

No, but they do work with small files, which presumably most would be if the
number of objects visible in the namespace system were pushing 2^64.

100TB files are often databases in their own right, with many internal
objects. But because we can't easily create a giant unified namespace that
crosses these architectural boundaries, we can't abstract away those
architectural boundaries like we _should_ be doing and _would_ be doing if it
were easier to do so.

~~~
wahern
Just to be more specific, imagine inodes were 1024 bits. An inode could become
a handle that not only described a unique object, but encode _how_ _to_ reach
that object. Which means every read/write operation would contain enough data
for forwarding the operation through the stack of layers. Systems like FUSE
can't scale well because of how they manage state. One of the obvious ways to
fix that is to embed state in the object identifier.

A real world example are pointers on IBM mainframes. They're 128 bits. Not
because there's a real 128-bit address space, but because the pointer also
encodes information about the object, information used and enforced by both
software and hardware to ensure safety and security. Importantly, this is
language agnostic. An implementation of C in such an environment is very
straight forward; you get object capability built directly into the language
without having to even modify the semantics of the language or expose an API.

Language implementations like Swift, LuaJIT, and various JavaScript
implementations also make use of unused bits in 64-bit pointers for tagging
data. This is either not possible on 32-bit implementations, or in those
environments they use 64-bit doubles instead of pointers. In any event, my
point is that larger address spaces can actually make it much easier to
optimize performance because it's much simpler to encode metadata in a single,
structured, shared identifier than to craft a system that relies on a broker
to query metadata. Obviously you can't encode all metadata, but it's _really_
nice to be able to encode some of the most important metadata, like type.

~~~
AstralStorm
POSIX invented "slow" extended attributes for this kind of use.

------
ktpsns
> Thus, fully populating a 128-bit storage pool would, literally, require more
> energy than boiling the oceans.

This is actually an argument _against_ 128 bit, because it clearly shows that
128 bit are unreachable and thus a waste. What about 96 bit?

~~~
DaiPlusPlus
96-bit, and other non-power-of-two systems are cumbersome and error-prone to
work with in C - which is often used when writing firmware for computer
hardware.

The real risks of economic damage caused by bit-fiddling bugs is much greater
than the risk of bringing the universe’s thermodynamic heat-death ever so
slightly nearer...

~~~
speleding
I would think it's fairly easy to keep all calculations in C in 128-bit and
just mask out the top few bytes when retrieving and storing. You could also
argue that >64 bit values will be rare enough that they warrant their own code
path if they are encountered as an optimization (perhaps they already do
this?).

------
russellbeattie
Back in 2004, they were probably thinking, "Well, IPv6 is using 128bits, and
that's going to be the standard any day now..."

------
CognitiveLens
> If 64 bits isn't enough, the next logical step is 128 bits.

Can someone explain this? Is there some kind of awkwardness/waste with
anything less than doubling the number of bits?

~~~
phihag_
Yes, any operation is easy in 2^(2^n). For instance, take addition of two
128-bit numbers x and y (seen as 64-bit int arrays) on a 64-bit big-endian
CPU:

    
    
      sum[1] = x[1] + y[1]
      sum[0] = x[0] + y[0] + carry from previous operation
    

In contrast, if you'd use 96 bits, you couldn't just use 64 bit integer
operations. Instead, you'd have to cast a lot:

    
    
      sum[4..11] = *((int64*) x) + *((int64*) y)
      sum[0..3] = (int32) ( (int64) *((int32*) x) + (int64) *((int32*) y) + carry)
    

So you'd read 32 bit-values into 64 bit registers, set the top 32 bits to
zero, perform the addition, and then write out a 32bit value again.

It gets much worse if your CPU architecture does not support the addition to
2^(2^n); if you were to use 100 bits, you'd have to AND the values with a
bitmask, and write out single bytes.

So 128 is far easier to implement, faster on many CPU architectures, plus you
get the peace of mind that your code works for a long time. For instance,
let's assume the lower bound of 9 months per doubling (which is unrealistic as
described in this article), then you're going to hit:

    
    
      50 bits (baseline from article): 2004
      64 bits: 2014
      80 bits: 2026
      92 bits: 2035
      100 bits: 2040
      128 bits: 2062
    

Now, what's the expected lifetime of a long-term storage system? It's well-
known that the US nuclear force uses 8 inch floppy disks. Those were designed
around 1970. So a lifetime of roughly 50 years is to be expected. For ZFS,
that would be 2054. By this (admittedly very conservative) calculation, 128
bits is only barely more than required.

~~~
tzs
Don't 64-bit CPUs usually have efficient instructions for operating on
narrower values?

For instance, consider this C code for adding two 96-bit numbers on a 64-bit
machine (ignoring carry for now):

    
    
      #include <stdint.h>
    
      extern void mark(void);
    
      int sum(uint64_t * a, uint64_t * b, uint64_t * c)
      {
          mark();
          *c++ = *a++ + *b++;
          mark();
          *(uint32_t *)c = *(uint32_t *)a + *(uint32_t *)b;
          mark();
          return 17;
      }
    

The purpose of the mark() function is to make it easier to see the code for
the additions in the assembly output from the compiler. Here is what "cc -S
-O3" (whatever cc comes with MacOS High Sierra) produces for my 64-bit Intel
Core i5 for the parts that actually do the math:

    
    
      callq   _mark
      movq    (%rbx), %rax
      addq    (%r15), %rax
      movq    %rax, (%r14)
      callq   _mark
      movl    8(%rbx), %eax
      addl    8(%r15), %eax
      movl    %eax, 8(%r14)
      callq   _mark
    

I'm not too familiar with x86-64 assembly, but I am assuming that this could
be made to handle carry by changing the "addl" to whatever the 32-bit version
of adding with carry is.

Taking out the (uint32_t * ) casts to turn the C code from 96-bit adding into
128-bit adding generates assembly code that only differs in that both movl
instruction become movq instructions, and addl becomes addq.

So, if you were writing in C it looks like a 96-bit add would be a little
uglier than a 128-bit add because of the casts but isn't slower or bigger
under the hood. But note that this is assuming accessing the 96-bit number as
an array of variable sized parts. It's that assumption that introduces the
need for ugly casts.

If a struct is used, then there is no need for casts:

    
    
      #include <stdint.h>
    
      typedef struct {
          uint64_t low;
          uint32_t high;
      } addr;
    
      extern void mark(void);
    
      int sum(addr * a, addr * b, addr * c)
      {
          mark();
          c->low = a->low + b->low;
          mark();
          c->high = a->high + b->high;
          mark();
          return 17;
      }
    

This generates the same code as the earlier version.

(I still have no idea how to handle the carry in C, or at least no idea that
is not ridiculously inefficient. When I've implemented big integer libraries
I've either used a type for my "digits" that is smaller than the native
integer size so that I could detect a carry by a simple AND, or I've handled
low level addition in assembly).

~~~
smitherfield
1\. Accesses through pointers type-punned to something other than
`(un(signed)) char` are undefined behavior.

    
    
      uint64_t n = 0xdeadbeef;
    
      uint32_t foo = (uint32_t)n; // OK
    
      uint32_t *bar = (uint32_t*)&n; // "OK" but useless
      foo = *bar; // undefined behavior!!!
    
      uint8_t *baz = (uint8_t*)&n;
      uint8_t byte = *baz; // OK, uint8_t is `unsigned char`
    
      // Same-size integral types are OK
      const volatile long long p = (const volatile long long*)&n;
      const volatile long long cvll = *p; // well-defined
    

2\. Structs are aligned to the member with the strictest alignment
requirement, so a struct of a `uint64_t` and a `uint32_t` will be aligned on
an 8-byte boundary, meaning its size will be 128 bits.

~~~
tzs
> Structs are aligned to the member with the strictest alignment requirement,
> so a struct of a `uint64_t` and a `uint32_t` will be aligned on an 8-byte
> boundary, meaning its size will be 128 bits.

Don't most C compilers support a pragma to control this? "#pragma pack(4)" for
clang and gcc, I believe.

Given this (where I've made it add two arrays of 96-bit integers to make it
easier to figure out the sizes in the assemply):

    
    
      #include <stdint.h>
    
      #pragma pack(4)
      struct block_addr {
          uint64_t low;
          uint32_t high;
      };
    
      int sum(struct block_addr * a, struct block_addr * b, struct block_addr * c)
      {
          for (int i = 0; i < 8; ++i)
          {
              c->low = a->low + b->low;
              c++->high = a++->high + b++->high;
          }
          return 17;
      }
    

here is the code for the loop body, which the compiler unrolled to make it
even easier to see how the structure is laid out:

    
    
      movq    (%rbx), %rax
      addq    (%r15), %rax
      movq    %rax, (%r14)
      movl    8(%rbx), %eax
      addl    8(%r15), %eax
      movl    %eax, 8(%r14)
      
      movq    12(%rbx), %rax
      addq    12(%r15), %rax
      movq    %rax, 12(%r14)
      movl    20(%rbx), %eax
      addl    20(%r15), %eax
      movl    %eax, 20(%r14)
      
      movq    24(%rbx), %rax
      addq    24(%r15), %rax
      movq    %rax, 24(%r14)
      movl    32(%rbx), %eax
      addl    32(%r15), %eax
      movl    %eax, 32(%r14)
      
      ...
      
      movq    84(%rbx), %rax
      addq    84(%r15), %rax
      movq    %rax, 84(%r14)
      movl    92(%rbx), %eax
      addl    92(%r15), %eax
      movl    %eax, 92(%r14)
    

(Some white space added, and the middle cut out). The 96-bit inters are now
only taking up 96-bits.

~~~
smitherfield
Packed structs are possible, to be sure, but inhibit numerous optimizations,
such as (relevant to this case) the use of vector instructions and vector
registers.

Changing the loop to 4 iterations for compactness' sake, (aligned) structs of
two u64s generate the following, vectorized code:

[https://godbolt.org/g/jB4jki](https://godbolt.org/g/jB4jki)

    
    
      vmovdqu (%rsi), %xmm0
      vpaddq  (%rdi), %xmm0, %xmm0
      vmovdqu %xmm0, (%rdx)
      vmovdqu 16(%rsi), %xmm0
      vpaddq  16(%rdi), %xmm0, %xmm0
      vmovdqu %xmm0, 16(%rdx)
      vmovdqu 32(%rsi), %xmm0
      vpaddq  32(%rdi), %xmm0, %xmm0
      vmovdqu %xmm0, 32(%rdx)
      vmovdqu 48(%rsi), %xmm0
      vpaddq  48(%rdi), %xmm0, %xmm0
      vmovdqu %xmm0, 48(%rdx)
      retq
    

And if the pointer arguments are declared `restrict`, the loop can be
vectorized even more aggressively:

    
    
      vmovdqu64       (%rsi), %zmm0
      vpaddq  (%rdi), %zmm0, %zmm0
      vmovdqu64       %zmm0, (%rdx)
      vzeroupper
      retq
    

Either of which is much more efficient than the code generated for unaligned,
packed 96-bit structs:

    
    
      movq    (%rsi), %rax
      addq    (%rdi), %rax
      movq    %rax, (%rdx)
      movl    8(%rsi), %eax
      addl    8(%rdi), %eax
      movl    %eax, 8(%rdx)
      movq    16(%rsi), %rax
      addq    16(%rdi), %rax
      movq    %rax, 16(%rdx)
      movl    24(%rsi), %eax
      addl    24(%rdi), %eax
      movl    %eax, 24(%rdx)
      movq    32(%rsi), %rax
      addq    32(%rdi), %rax
      movq    %rax, 32(%rdx)
      movl    40(%rsi), %eax
      addl    40(%rdi), %eax
      movl    %eax, 40(%rdx)
      movq    48(%rsi), %rax
      addq    48(%rdi), %rax
      movq    %rax, 48(%rdx)
      movl    56(%rsi), %eax
      addl    56(%rdi), %eax
      movl    %eax, 56(%rdx)
      retq
    

A smaller cost is that in non-vector code, using a 64-bit register (rax) in
32-bit mode (eax) is wasting half of the register.

IIRC, unaligned loads and stores will also, at the hardware level, stall the
pipeline and inhibit out-of-order execution.

~~~
smitherfield
Oops, I used `#pragma pack` incorrectly in my code, but it doesn't change the
codegen for the 96-bit structs other than offsets. Also `restrict` is only
needed on the output argument to enable full vectorization of the 128-bit
structs.

New link: [https://godbolt.org/g/8uGn4h](https://godbolt.org/g/8uGn4h)

------
PeCaN
> Thus, fully populating a 128-bit storage pool would, literally, require more
> energy than boiling the oceans.

So we should be good until OceanCoin is introduced.

------
chickenthief
"Thus, fully populating a 128-bit storage pool would, literally, require more
energy than boiling the oceans.". Could we instead boil some Jupiter Moons?

~~~
saalweachter
Boiling the oceans is an alternative energy use, not a source.

------
exabrial
Is it true zfs never saw widespread adoption because of licensing compat with
Linux and patents or has come and gone?

~~~
boomboomsubban
People are reluctant to use openZFS on Linux as it can't be included in the
kernel due to license issues. I'd still say it has widespread adoption when
you consider the people willing to run it on Linux and how popular it is on
FreeBSD/some other systems.

------
sova
>That's enough to survive Moore's Law until I'm dead, and after that, it's not
my problem.

...really?

~~~
mort96
No, I don't think that was entirely serious. 128 bytes should be more than
enough to last for a very long time, not just a little bit longer than the
author's life.

~~~
mort96
Damn, just noticed I wrote 128 bytes and not 128 bits, too late to edit.

~~~
sova
I wasn't talking about the number of bits, my friend. I was talking about the
attitude that "it's not my problem after I'm dead." That's exactly why we are
stuck with so many societal problems. For example, the Federal Reserve Act of
1913 was a true victory for the sons-of-guns who got it enacted, but then they
died and now we're all left with the aftermath. It's this shortsighted
attitude that is the problem, not the number of bits in the implementation.

------
z3t4
It could run the file system of the death star (Star Wars).

------
frozenport
Why can't they have a 64 bit and 128 bit ZFS offering side-by-side. When
somebody actually needs 128bit they can migrate?

