Hacker News new | past | comments | ask | show | jobs | submit login
ZFS 128 bit storage: Are you high? (2004) (oracle.com)
248 points by mixologic on Feb 12, 2018 | hide | past | web | favorite | 144 comments

> Some customers already have datasets on the order of a petabyte, or 2^50 bytes. Thus the 64-bit capacity limit of 264 bytes is only 14 doublings away. Moore's Law for storage predicts that capacity will continue to double every 9-12 months, which means we'll start to hit the 64-bit limit in about a decade. Storage systems tend to live for several decades, so it would be foolish to create a new one without anticipating the needs that will surely arise within its projected lifetime.

So exactly 14 years passed, does someone have 2^64 bytes for a single ZFS filesystem (or anything close to that)? I don't really feel like storage capacity (or 1/price) doubles every year.

They had to be on the conservative side, so they assumed a doubling every 9-12 months. In reality we have seen one doubling every ~34 months in the last 14 years. In 2005 I bought 250GB HDDs for $110 each. In 2018 you can find 8TB HDDs for $150 each. That's 5 doublings.

So with 5 doublings, if we saw datasets on the order of 2^50 back then, we should see datasets on the order of 2^55 today (30 PB). And sure enough, here is a computer with a 30 PB global filesystem: https://www.fujitsu.com/downloads/TC/sc11/k-computer-system-...

So another 9 doublings to go to 2^64 bytes. With one every 34 months, this should happen in 25 years (2043). But again, if you are a filesystem developer you should be on the conservative side and assume it might double every ~12 months.

It's also worth noting that file system developers should think about their file system fifty to a hundred years out. Flash drives today often still come preformatted with FAT32, twenty two years after its release in 1996. Up until a couple years ago, my dad had an external drive formatted with FAT16--a thirty year old file system.

If ZFS lives as long as the FATs have (three decades) even at a doubling every 34 months, that means six remaining doublings before 2035. Not quite 2^64, but that's on the liberal side. Half that (doubling every 17 months) is twelve doublings, which is easily over 2^64.

>If 64 bits isn't enough, the next logical step is 128 bits. >That's enough to survive Moore's Law until I'm dead, and after that, it's not my problem


HFS stuck around for around 30 years as well. And is still shipped on some systems.

And let’s not forget all the crazy (better word?) features like independence from indianness. They built the file system with an eye toward versatility.

> indianness

That has to be one of the worst typos I have ever seen.

The filesystem does not share any culture with Native America and India.

The term does derive from Gullivers Travels, where the big endians and the little endians were both to be found on an island in the Indian ocean.

Of course, the real reason for the mistake is that phone spelling correction will convert endian to Indian.

It felt wrong at the time but it’s been too long for me to remember the correct spelling.

In 2016, Backblaze swapped out more than 3,500 2 TB HGST and WDC hard drives for 2,400 8 TB Seagate drives.

That’s 7 PB to 19.2 PB. If they had replaced all 3,500 drives it would be 4X.

But that's not all running under a single filesystem. It's multiple file systems that are orchestrated together to form their network storage space.

I think that's the point from some of the other comments (below). Instead of having every larger and larger single filesystems, once you get to a certain point (~1-2PB is my guess), it simply isn't worth it to organize your data under a single filesystem.

That's not to say ZFS was wrong in choosing 128 bits... at the time ZFS was designed, the level of horizontal partitioning that we are now accustomed to just wasn't a thing. Compared to HDD times, networks were far too slow to support such a thing. Now, however, we talk about HDDs being too slow compared to our networks. At the time, 128 bits was a solid choice, even if it was overkill.

Plus, Sun was definitely a scale up kind of company... if you want to get bigger, by a bigger ($$$$) box. So having a large capacity FS was good for their bottom line. Now, we think in terms of scaling out with multiple (cheaper) smaller boxes... which is probably the only way we've managed with current levels of data.

> It's multiple file systems that are orchestrated together to form their network storage space.

No, that is not true. A vault acts as a single unit, effectively making it a single filesystem as much as a ZFS pool is a single filesystem.

Internally, a vault operates on a set of "tomes", which are groups of 1 harddrive from each storage pod (20). These "tomes" handle the redundancy aspect as well. However, the "tome" is not the filesystem.

The concepts map cleanly to the ZFS world: A vault is equivalent to a ZFS pool, and a tome is equivalent to a ZFS vdev in redundant configuration. Neither vdevs nor tomes are filesystems on their own—they only become a filesystem when combined into pools and vaults, respectively.

Now, your argument does hold if interpreted differently: One does not use a normal filesystem for those absurd sizes. However, it is a single filesystem.

Yes, I had forgotten that they striped their data across multiple pods... I'm not familiar with how each Pod of their works. Is it really reading/writing directly to the devices, or is there an intermediate FS that is storing the files (well, chunks of files) and the Backblaze software is orchestrating it? Or do the 20 pods have one central interface?

I guess I was assuming it worked like Lustre or BeeGFS where there were smaller FSs (ext4, XFS) at play that handled storing file (chunks) and then a higher level interface that managed it all.

Either way -- the original comment was referring to Backblaze swapping 7PBs for 19PBs. Which we agree is not a normal size for a "normal" file system! ZFS is a strange beast... it was designed to support a world that just never showed up. It is a "normal" FS that works from laptops to servers, but was designed to support a big iron world that never really appeared (much). Instead, here we are comparing it not to XFS, but to cloud FS's that were designed for a completely different "cloud" world.

The vault is the lowest unit you can interact with. There is some routing on top which picks a vault for you to use, but that's about it.

I would assume that the tome, beneath network protocols, interacts directly with the individual block devices. They have implemented things like Reed-Solomon error correction themselves to handle redundancy across the tome (which to recap is their cross-server vdev equivalent), so they have indeed implemented tasks of a filesystem and RAID manager on top of their lower-level primitives.

I also believe that what their low-level primitives are is actually a rather irrelevant implementation detail. A filesystem presents a way to store and organize files, and is a filesystem regardless of whether it is implemented on a block device or an egg-engraver. It could be using FAT16 as object store, sharding across files and handling metadata elsewhere. I would still consider such setup to be a true filesystem, as it is not simply a 1:1 network-to-disk protocol.

And yes, ZFS won't see the large systems it is designed to handle anytime soon. What Sun did not see coming was the death of large, high-capacity servers, and the rise of distributed solutions. But then again, no one saw that coming in 2001 when ZFS development started.

However, considering that storage sizes will continue to increase, who knows what will happen in the future... 128 bit filesystems might come in handy at some point.

(Note that my information on backblaze's systems is based on public information—I am not an insider of any sort.)

> it simply isn't worth it to organize your data under a single filesystem

Is that a fundamental thing or is it just because there aren't many filesystems that can handle it? It seems to me like a single filesystem would be conceptually easier, but I'm far from knowledgeable in this area.

Comparing the amount of effort that has been spent on distributed filesystems (where everything is managed under one hierarchy) to the usage of such systems, I'm going to say that it's a fundamental law of trying to organize things.

But really, I do think it's physics at play... or rather logistics.

Let's say you're trying to put together a 10PB storage system in a datacenter. If you're using 60 disk JBODs with 10TB drives, thats 600TB of raw space per 4U (or a max of 6PB per rack, or 12PB in 2 racks). Now realistically due to power, network gear, etc... instead of maxing out 2 racks, you'd probably split that into at least 4 (or more) racks. If you're Backblaze (see above), you split those 20 pods across 20 separate racks (and add more volumes, but that's another story).

But once you're at that level, for redundancy, you'd not only managing which disk data is written/replicated to, but also the JBOD/pod, and the rack (and the data center). So, at that point, you're not going to working with a "normal"/traditional file system. You could with ZFS define redundancies that would work like that, but realistically, because of the extra overhead, you're better off changing how you organize your data.

(Plus: ZFS isn't very good at expanding a filesystem with new storage. You can do it, but your pool ends up unbalanced and performance suffers. When you decide to have a 1+PB system, you normally want to have the ability to add more storage space, which means a different type of FS)

Just hit df -lh on the EBI login nodes and see a 22PB filesystem at 93% full. I suspect if I dig deeper I will find a bigger one as they had 120PB of storage last year[1].


I think the largest file systems in the world are Amazon's S3, and maybe Glacier. I guess Google's internal system used for YouTube and Gmail probably outrank it but there are no numbers for it. But I doubt that they are in a single address space on a single file system.

So basically we moved away from singular large unified file systems and built swarms of little file systems.

Indeed. "Cloud" is essentially tons of commodity hardware nodes running a tailored userspace distributed file system application. There now certainly exist cloud services that host well in excess of 2^64 bytes, but they are not hosted on traditional file systems, but an array of those.

I can't certainly blame their choice though, and 16-million terabyte hard drive arrays (the limit of 64-bit addressing) on a large mainframe are really just a few quantum leaps away. Heck, I still vividly recall buying a MASSIVE 40 GB HDD in 2000 - it seemed excessive beyond wildest dreams at the time (and was promptly filled to brim at a LAN party.)

Since, advances in mass storage tech, most importantly the successful commercialization of perpendicular recording in 2005, mean that nowadays 12 TB drives are commercially viable - roughly 1000x the size of a 40 GB drive.

In the next 20 years, petabyte-sized drives might well be available (though almost certainly not based on any form of magnetic recording). Once you hook hundred 50-petabyte drives to a single fileserver, you start closing in on limitations of 64-bit addressing.

Overly pedantic quibble: a quantum of anything is the minimum possible unit. One planck length, time, mass, charge, temperature.

Everybody uses it to mean "big". Oh, language.

"Quantum leap" is typically interpreted as "big leap" but I believe the true implication is more like "indivisible leap", in line with your quibble. A Turing-complete computer is a "quantum leap" ahead of a non-Turing-complete calculator. A warp drive is a "quantum leap" ahead of a reaction drive. You either have all of it or none of it.

Quantum leap is an increase in the units, or “quanta” you use to measure something i.e. MHz to GHz - the term doesn’t relate directly to quantum mechanics

In your example you are using the same units - Hertz, one cycle per second - and adding SI prefixes for million X and billion X.

Yes, but the "quantum" has increased by order of magnitude.

The point is, that the increase in frequency is so large, that using MHz is no longer sensible.

There has been a "quantum leap" in frequency.

I know who we should ask: Ziggy

I loved that show so much! It was ages after Enterprise came out that I stopped expecting Captain Archer to say "Oh boy!" just before the main title.

But is a "Quantum leap" really equivalent to leaping one planck? Honest question.

I assumed it implied something far 'larger' such as a leap through spacetime...

A quantum leap is a discontinuity, a change that cannot happen gradually.

It typically implies a revolutionary change over evolutionary improvement.

I believe the term originally applied to electron energy level transitions within an atom, which only vaguely correlates with anything one might call distance.

The interesting thing there was that the electrons existed only at discrete energy levels, and the transition between them appeared instantaneous.

I always assumed that sprang from the tv show with that name.

In this case, I actually used it purposefully. Hitting on quantum mechanical limits are in part holding us from infinitely increasing the information density on HDDs and SSDs.

The usage in the article is that 64/128 bits is the addressing used for the blocks. A block is 512 bytes. So a 64-bit filesystem would actually be limited to 2^73 bytes = 8192 EiB = 8.6 billion TiB of storage.

No cloud vendor has a billion harddisks, that's for sure.

(The filesystem might only support a maximum file size of 2^64 bytes, but that's nothing to do with the capacity of the storage array)

The article isn't talking about the size of raw disk blocks, better called the sector size. It's talking about ZFS blocks, which are variable in size, defaulting to 128 kB: https://en.wikipedia.org/wiki/ZFS#Variable_block_sizes

ZFS aside, the 512 byte sector isn't even a hard limit at the storage medium level. Disks with 528 byte sectors — and other odd sizes — have been available in the enterprise world for a long time now: https://en.wikipedia.org/wiki/Hard_disk_drive#Market_segment...

HDDs and SSDs with 4 kB sectors dominate the consumer market now: https://en.wikipedia.org/wiki/Advanced_Format

Isn't there an XKCD about the limit of flash storage being 3 PB per gallon?

I think you may be referring to an image from this What-If article, where the capacity of a gallon of MicroSD cards is estimated at 1.6PB per gallon:


It's 12.8PB per gallon now.

A gallon of flash cards is great, but you do have to figure out which one you want after they are delivered. I'm wondering whether there is such a thing as an automated flash card library. I know people have archives of tapes like that.

The thing about flash is that it's cheap and easy to keep it all permanently online. You can get a stack of 2.5 inch SSDs for the same price as microsd cards, and plugging them into a bus takes less space than a tape robot.

When you're looking at the same price for two systems, and one is 100x-1000x faster at the cost of being 'merely' hundreds of TB per gallon instead of a few PB per gallon, nobody is going to bother engineering the latter just to save a few bucks on shipping.

Tape is a lot better at keeping data while in storage though.

"You know you have a large storage system when you get paged at 1 AM because you only have a few petabytes of storage left." (2010)


No, they're not a single filesystem, as the papers make clear here and there.

S3/glacier are not filesystems, they are object storage interfaces with labels...

I wonder what backblaze is up to these days...

Moore's Law actually predicted a doubling every two years (for CPUs), not every year, and even the very optimistic Kryder's Law (often cited as the version of Moore's Law for storage) only claimed an increase of 40% per year. Actual progress has fallen far short of that. The rate for 2010-2014 was only 15%/year.

Rates were quicker back around 2000-2005, but that pace has fallen dramatically.


I think it is reasonable to expect storage to jump dramatically at such a time that we begin storing “3D” data. Not that everything in VR will be voxels or anything, but even if the fractal dimension of our future data is 2.5 it would obliterate our current needs for storage. /speculation

Physics provides us with certain density of information in a volume and that is considering advanced quantum storage.

May I remind you that flash, DRAM and SRAM already are 3D structures.

I think they meant VR graphics data, not data stored on 3D things.

40%/yr is 100%/2years.

Wonder what the pace of ssd capacity is?

Last I checked the ZFS 128 bit values were not byte offsets, but block numbers. So the justification requires over 2^64 blocks, not 2^64 bytes.

I calculated it at some point and it was several years of the entire earths manufacturing capacity for disks in a single filesystem.

[added the missed "several years of"]

If you accept the default 128k ZFS block size then it's 2,361,183,241,434,822,606,848 bytes, or 2,097,152 PB

Or, exactly 2 Zettabytes.

Also, the Z in ZFS :D

>The ZFS name is registered as a trademark of Oracle Corporation; although it was briefly given the retrofitted expanded name "Zettabyte File System", it is no longer considered an initialism.


I don't believe Wikipedia is right. Search site:oracle.com for zettabyte on google and observe the results being documents and blogs from a multitude of years.

ZFS being an acronym for Zettabyte File System was neither retrofitted nor all that brief in use it seems.

Former employees of Sun Microsystems, I know some of you browse HN. Care to shed some light on this?

I wasn't in the room when it happened, but I was probably among the first dozen or so to hear the name -- and it was presented to me as both at once: "ZFS" was both "the last word in filesystems" and (conveniently) a zettabyte. I remember Bonwick quickly pointing out that yes, a yottabyte was larger but "YFS" was passed over for obvious reasons.

What's that obvious reason?

"Why? filesystem" I imagine.

Because "Z" matches the goal of being "the last word in filesystems" than "Y".

Indeed, LLNL have a humongo Lustre/ZFS cluster and kickstarted the ZFSoL project(mucho gracias to LLNL and Brian Behlendorf) and they refer to it as a Zettabyte File System.

Interestingly, IBM has a filesystem used on the zOS mainframe OS, that is also called ZFS: https://en.wikipedia.org/wiki/ZFS_(z/OS_file_system)

So, if they didn't use 128-bit... they'd have to change the name! /s

CERN are at >100PB, which is only about 6.6 doublings, which is about right with a doubling every two, not one, years. http://iopscience.iop.org/article/10.1088/1742-6596/664/4/04...

Right, but does CERN connect all 100PB to a single machine?

No. Data is mostly accessed using XRootD[1] which can be used to data from an (almost) unlimited number of nodes. So far as I know, it's optimised for large files so the number of available inodes is limited by RAM.

There is also EOS[2] which is based on XRootD and is planned to replace AFS[3] as a general purpose networked filesystem.

[1] http://xrootd.org/ [2] http://eos.web.cern.ch/ [3] http://information-technology.web.cern.ch/services/afs-servi...

To a cluster fs which ZFS is not. (And not all of it.)

It was probably incorrect for the original author to cite Moore's law.

Moore's law only ever really held for transistor density on integrated circuits, not magnetic storage. Magnetic storage capacity always grew more slowly.

Flash storage capacity today has more to do with die stacking than transistor density.

In any case, Moore's Law is definitely dead now. Like, for real this time.

Backblaze posts some good stuff on HD pricing and reliability. Here's a good chart from their data:


They also link to this great (logarithmic) chart:


TL;DR: price hasn't been dropping by half annually, but prices have come down from ~$0.70/GB to < $0.03/GB now. So roughly a factor of 20 over 13-14 years, somewhere in the neighborhood of 25% improvements compounded annually over that time.

> So exactly 14 years passed, does someone have 2^64 bytes for a single ZFS filesystem (or anything close to that)? I don't really feel like storage capacity (or 1/price) doubles every year.

Its unlikely. There are few opportunities to compile a data set of that size. One might be the NSA collections center in Utah, one might be Google or Microsoft's web cache, and one might be something like the Internet Archive's cache.

However the length choice gets 'weird' if you want something larger than 64 bits and less than 128. Your next intermediate choice is 96 bits. That is 64 + 32 (or one additional 32 bit long word. Growing by just 8 means you now have a 9 byte pointer, by 16 gives you a 10 byte pointer. Some architectures are penalized when doing off word alignment, and so on those architectures the structure is padded out to the next word size anyway.

I've always found Jef's reasoning in that post amusing but I continue to thing 64 bits would have been fine (just like I think it would have been fine for the Internet v6 work :-). Fortunately time has given us a crap ton of memory for our boxes so the storage penalty isn't too great.

Since 64 bits can serve /many/ use cases and also matches modern machine words I could see there being either two drivers (one with 64 bit native words and the other with full sized ones) or even a fancier driver that uses different ASM bits to just leave half the word all 0s and ignored.

The on-disk format could even remain the same; it wouldn't really matter except for lots of small files and those are already so horrid that ZFS might have some other way of handling them. (Like treating /all/ small file names and data as part of a larger directory file or something.)

Backblaze says they are using 400 PB of storage[1], or around 2^59 bytes. Of course, it's not one big filesystem. They aren't quite doubling annually, but looks like they are growing around ~1.5X each year, so in another decade, they could be close to the 2^64 limit.

[1] https://www.backblaze.com/company/about.html

Right, but their building block is 48 disks x whatever is best profitable (a mix of 4,6,8,10, and 12TB currently or similar). So each system is 500TB or less, zfs is a filesystem for a single node not a distributed filesystem.

Even if we ignore that factor of 800 or so (400PB spread across 500 nodes), they still would need another factor of 128,000 before they would need ZFS's 65th bit.

Keep in mind it's 2^128 blocks, not 2^128 bytes.

Backblaze would need to consume the entire planets supply of disks for a very long time AND put all those disks connected to a single linux, freebsd, or solaris box. Only then would they need the 65th bit that ZFS has for addressing blocks.

Not really. Global hard drive production is more than 2^64 bytes or 18 exabytes (SI).


From that it looks like production is 130 exabytes for Seagate and Western Digital combined. Using the default block size on ZFS the limit is 128 * 1024 * 2^64 = 2 zettabytes. Even allowing for other manufacturers it's an order of magnitude more than global annual production.

I don't think Moore's Law held up for much since then, but it would've been hard to know for sure. That being said, ZFS filesystems can span multiple disks, since it does volume management of its own. I don't know if that means we're going to see zettabyte volumes any time soon, though. After all, 2^64 bytes wouldn't be enough to overflow: it would need to be 2^64 blocks...

An apparently overlooked aspect of the SpaceX Falcon Heavy launch, is the inclusion of a small quartz disc whose format is expected to scale to something like 260 TB (360?, I forget). And that has a lifespan claimed to potentially extend into the billion plus year range.

I don't know about that exact disc/format, but if we and as we are finally able to stably write at such capacities... (whatever technologies end up being used to do so).

We're going to run into another "640KB should be enough for anyone", unless we are forward thinking with regard to potential storage format capacities.

(And if you think no one will ever be able to consume that much data and detail, just think of all the modeling that will be done and the data behind that. Think of the data being put out by CERN and not just how it will expand but also how people will want to explore lower significance hints, corner cases, and who knows what. Etc.)

A fully decked out Isilon cluster supports 68PB on a single file system.


ZFS is not really a 128 bit filesystem, it is merely a 64 bit filesystem that is "128 bit ready". I.e. the on disk format is 128 bit, the kernel drivers are not. At least, they were not until Oracle bought Sun, that's about the last time I really cared about ZFS.

"we don't want to deal with this shit for another forty years or possibly forever" would have sufficed for me.

I’m struggling to think of a project that released a new version to accommodate some far-off limit that wasn’t blindsided and obsoleted by something operating on a different paradigm - like using YYMMDD for dates, the changing it to YYYYMMDD for Y2K when they should have been using integer time_t - but then that was 32-bit and now 64-bit. But we won’t even be 10% of the way there before a relativistic-capable time system will replace it.

So why worry? We’re doomed to rewrite it anyway :)

Trading off a tiny design decision like that against knowing you're unlikely to have to ever revisit except in an unforeseeable situation is a pretty good place to be.

What's the downside to having 128 bits over 64? I'm not familiar enough with file systems to know, so I'm left wondering if there's any downside to oversizing the hammer for the nail?

It expands the data structure size, which multiplies N for both on disk and in RAM representation (and cache and shorn cache lines). Many modern architectures have native 128b types (amd64, POWER) so it's not really a big deal to the CPU itself even if you need to do atomic operations for concurrency.

OTOH it guarantees there wont really need to be on disk changes for pointer sizes. That may be useful in situations that weren't really imagined, for instance 128b is enough for a Single Level Store where persistent and working set (or NV) are all in the same mappable space for any conceivable workload.

tl;dr: doubling size of metadata (and usually a half of a 128bit address is zeros anyway).

Having twice as much space occupied has many implications: twice as much space lost for data itself, less metadata will fit CPU caches, less throughput in buses between CPU and storage, lengthier read time for the storage itself, etc.

Real-world large filesystems are distributed across many thousands of hosts and multiple datacenters, not mounted as a Linux filesystem on a single host. Because whole racks and whole datacenters fail, not just disk drives.

So they used 128 bit because of bikeshedding. Committees always make the most conservative decision possible. Like IPv6.

The fact that real-world storage systems are distributed on the network bolsters the case for supporting 128-bit and even larger types.

Creating unified namespaces is really useful and a _great_ simplifier. The reason we don't do that as often as we should is because of limitations in various layers of modern software stacks, especially in the OS layers.

Unfortunately, AFAIU ZFS only supports 64-bit inodes. A large inode space, like 128-bit or even 256-bit, would be ideal for distributed systems.

Larger spaces for unique values are useful for more than just enumerating objects. IPv6 uses 128 bits not because anybody ever expected 2^128-1 devices attached to the network, but because a larger namespace means you can segment it easier. Routing tables are smaller with IPv6 because its easier to create subnets with contiguous addressing without practical constraints on the size of the subnet. Similarly, it's easier to create subnets of subnets (think Kubernetes clusters) with a very simple segmenting scheme and minimal centralized planning and control.

Similarly, content-addressable storage requires types much larger than 128 bits (e.g. 160 bits for Plan 9 Fossil using SHA-1). Not because you ever expect more than 2^128-1 objects, but because generating unique identifiers in a non-centralized manner is much easier. This is why almost everybody, knowingly or unknowingly, only generates version 4 UUIDs (usually improperly because they randomly generate all 128 bits rather than preserving the structure of the internal 6 bits as required by the standard).

ZFS failed not by supporting a 128-bit type for describing sizes, but by only supporting a 64-bit type for inodes. And probably they did this because 1) changing the size of an inode would have been much more painful for the Solaris kernel and userland given Solaris' strong backward compatibility guarantees, and 2) because they were focusing on the future of attached storage through the lens of contemporary technologies like SCSI, not on distributed systems more generally.

Unified namespaces on many-petabyte filesystems are perfectly commonplace

HDFS, QFS,.... even old GFS

You wouldn’t make them Linux/fuse mountpoints though, that’s just an unneeded abstraction. Command line tools don’t work with files that are 100TB each.

  Command line tools don’t work with files that are 100TB each.
No, but they do work with small files, which presumably most would be if the number of objects visible in the namespace system were pushing 2^64.

100TB files are often databases in their own right, with many internal objects. But because we can't easily create a giant unified namespace that crosses these architectural boundaries, we can't abstract away those architectural boundaries like we should be doing and would be doing if it were easier to do so.

Just to be more specific, imagine inodes were 1024 bits. An inode could become a handle that not only described a unique object, but encode how to reach that object. Which means every read/write operation would contain enough data for forwarding the operation through the stack of layers. Systems like FUSE can't scale well because of how they manage state. One of the obvious ways to fix that is to embed state in the object identifier.

A real world example are pointers on IBM mainframes. They're 128 bits. Not because there's a real 128-bit address space, but because the pointer also encodes information about the object, information used and enforced by both software and hardware to ensure safety and security. Importantly, this is language agnostic. An implementation of C in such an environment is very straight forward; you get object capability built directly into the language without having to even modify the semantics of the language or expose an API.

Language implementations like Swift, LuaJIT, and various JavaScript implementations also make use of unused bits in 64-bit pointers for tagging data. This is either not possible on 32-bit implementations, or in those environments they use 64-bit doubles instead of pointers. In any event, my point is that larger address spaces can actually make it much easier to optimize performance because it's much simpler to encode metadata in a single, structured, shared identifier than to craft a system that relies on a broker to query metadata. Obviously you can't encode all metadata, but it's really nice to be able to encode some of the most important metadata, like type.

POSIX invented "slow" extended attributes for this kind of use.

For IPv6 the 128 bit has its justification. It's supposed to enable proper hierarchical routing and to reduce the number of entries inside the routing tables which is the pain point where it gets expensive. The idea is that no one, at any level, needs to request the allocation of a second subnet because what he has is large enough by default. So you need more bits than necessary to allow a little bit of "wastefulness" even after some layers of subnetting.

Moreover the convention that there should be no subnet smaller than /64 enables stateless auto configuration for hosts. 64 bits is enough to fit common (supposed to be unique) hardware identifiers and even is large enough to assign random addresses (like with privacy extensions) with a very low probability of collisions.

That was the idea, but it didn't really turn out that way. Stateless auto configuration leaks your MAC address, which is a privacy issue. Most servers use static IPs and most desktops use random IPs, with checks for collisions.

IMHO, the 128 space was a big mistake. It's twice as hard to communicate between humans, most languages and databases don't support the data type natively and it complicates high-speed routing.

An average of 48 bits for a network and 15 for the host would have been better. For other reasons you almost never want more than a few hundred hosts on one layer-2 network anyway.

Except IPv6 being 128-bit makes fast hardware implementations much easier than if it had to deal with shorter prefix lengths. Nothing shorter than 128 really makes sense at all in an IPv4 replacement.

No comprendo. Why would 128 be faster than 64?

Consider a tiny (5 machine) piece of the internet. Three hubs, an outlink and two smaller hubs, all connected (a triangle). With 4 bits, the left hub can have all the 0xxx addresses and the right hub can have all the 1xxx addresses. No matter where the devices connect, they can all get an IP and the outlink only needs to remember a simple rule (starts with one, go right, else go left).

Compare to a 3 bit network. By moving IPs from hub to hub, all five devices can always get an address, but the small hubs need to communicate which addresses they own to the outlink and to each other in order to avoid address exhaustion on either hub. Routing a packet is slower because the routing is more complex.

So it is routing efficiency? I’ve been asking this question for five years and this is the first time a coherent answer has been offered, thanks.

How much more efficient is the 128 vs 64 for routing and what trade offs does it make for other things? I’m now wondering.

IPV6 is basically 64 bits for routing and 64 bits for the local network segment. Seems plausible that it's faster than trying to mask out the bits you need.

Why not 32:32? Shouldn’t four billion internets be enough?

Probably, but having a 2^64 tolerance factor isn't a bad idea given how difficult moving from IPv4 is proving to be.

This way we have ~1 IP address per 6 micrograms in the solar system, or per 3.4 tonnes in the galaxy.

Hierarchy is nice, though. If you can model the bits as a tree, it becomes super quick to figure out where to route a packet. You can model stuff like that trivially with an FPGA.

On the other hand such committee bikeshedding seems to work rather well for PR. It makes them ignore hard problems and instead focus on things most people can understand and relate. Gaining more trust as opposed to a well designed thing with nothing to understand or relate.

> Thus, fully populating a 128-bit storage pool would, literally, require more energy than boiling the oceans.

This is actually an argument against 128 bit, because it clearly shows that 128 bit are unreachable and thus a waste. What about 96 bit?

96-bit, and other non-power-of-two systems are cumbersome and error-prone to work with in C - which is often used when writing firmware for computer hardware.

The real risks of economic damage caused by bit-fiddling bugs is much greater than the risk of bringing the universe’s thermodynamic heat-death ever so slightly nearer...

I would think it's fairly easy to keep all calculations in C in 128-bit and just mask out the top few bytes when retrieving and storing. You could also argue that >64 bit values will be rare enough that they warrant their own code path if they are encountered as an optimization (perhaps they already do this?).

Back in 2004, they were probably thinking, "Well, IPv6 is using 128bits, and that's going to be the standard any day now..."

> If 64 bits isn't enough, the next logical step is 128 bits.

Can someone explain this? Is there some kind of awkwardness/waste with anything less than doubling the number of bits?

Yes, any operation is easy in 2^(2^n). For instance, take addition of two 128-bit numbers x and y (seen as 64-bit int arrays) on a 64-bit big-endian CPU:

  sum[1] = x[1] + y[1]
  sum[0] = x[0] + y[0] + carry from previous operation
In contrast, if you'd use 96 bits, you couldn't just use 64 bit integer operations. Instead, you'd have to cast a lot:

  sum[4..11] = *((int64*) x) + *((int64*) y)
  sum[0..3] = (int32) ( (int64) *((int32*) x) + (int64) *((int32*) y) + carry)
So you'd read 32 bit-values into 64 bit registers, set the top 32 bits to zero, perform the addition, and then write out a 32bit value again.

It gets much worse if your CPU architecture does not support the addition to 2^(2^n); if you were to use 100 bits, you'd have to AND the values with a bitmask, and write out single bytes.

So 128 is far easier to implement, faster on many CPU architectures, plus you get the peace of mind that your code works for a long time. For instance, let's assume the lower bound of 9 months per doubling (which is unrealistic as described in this article), then you're going to hit:

  50 bits (baseline from article): 2004
  64 bits: 2014
  80 bits: 2026
  92 bits: 2035
  100 bits: 2040
  128 bits: 2062
Now, what's the expected lifetime of a long-term storage system? It's well-known that the US nuclear force uses 8 inch floppy disks. Those were designed around 1970. So a lifetime of roughly 50 years is to be expected. For ZFS, that would be 2054. By this (admittedly very conservative) calculation, 128 bits is only barely more than required.

Don't 64-bit CPUs usually have efficient instructions for operating on narrower values?

For instance, consider this C code for adding two 96-bit numbers on a 64-bit machine (ignoring carry for now):

  #include <stdint.h>

  extern void mark(void);

  int sum(uint64_t * a, uint64_t * b, uint64_t * c)
      *c++ = *a++ + *b++;
      *(uint32_t *)c = *(uint32_t *)a + *(uint32_t *)b;
      return 17;
The purpose of the mark() function is to make it easier to see the code for the additions in the assembly output from the compiler. Here is what "cc -S -O3" (whatever cc comes with MacOS High Sierra) produces for my 64-bit Intel Core i5 for the parts that actually do the math:

  callq   _mark
  movq    (%rbx), %rax
  addq    (%r15), %rax
  movq    %rax, (%r14)
  callq   _mark
  movl    8(%rbx), %eax
  addl    8(%r15), %eax
  movl    %eax, 8(%r14)
  callq   _mark
I'm not too familiar with x86-64 assembly, but I am assuming that this could be made to handle carry by changing the "addl" to whatever the 32-bit version of adding with carry is.

Taking out the (uint32_t * ) casts to turn the C code from 96-bit adding into 128-bit adding generates assembly code that only differs in that both movl instruction become movq instructions, and addl becomes addq.

So, if you were writing in C it looks like a 96-bit add would be a little uglier than a 128-bit add because of the casts but isn't slower or bigger under the hood. But note that this is assuming accessing the 96-bit number as an array of variable sized parts. It's that assumption that introduces the need for ugly casts.

If a struct is used, then there is no need for casts:

  #include <stdint.h>

  typedef struct {
      uint64_t low;
      uint32_t high;
  } addr;

  extern void mark(void);

  int sum(addr * a, addr * b, addr * c)
      c->low = a->low + b->low;
      c->high = a->high + b->high;
      return 17;
This generates the same code as the earlier version.

(I still have no idea how to handle the carry in C, or at least no idea that is not ridiculously inefficient. When I've implemented big integer libraries I've either used a type for my "digits" that is smaller than the native integer size so that I could detect a carry by a simple AND, or I've handled low level addition in assembly).

1. Accesses through pointers type-punned to something other than `(un(signed)) char` are undefined behavior.

  uint64_t n = 0xdeadbeef;

  uint32_t foo = (uint32_t)n; // OK

  uint32_t *bar = (uint32_t*)&n; // "OK" but useless
  foo = *bar; // undefined behavior!!!

  uint8_t *baz = (uint8_t*)&n;
  uint8_t byte = *baz; // OK, uint8_t is `unsigned char`

  // Same-size integral types are OK
  const volatile long long p = (const volatile long long*)&n;
  const volatile long long cvll = *p; // well-defined
2. Structs are aligned to the member with the strictest alignment requirement, so a struct of a `uint64_t` and a `uint32_t` will be aligned on an 8-byte boundary, meaning its size will be 128 bits.

> Structs are aligned to the member with the strictest alignment requirement, so a struct of a `uint64_t` and a `uint32_t` will be aligned on an 8-byte boundary, meaning its size will be 128 bits.

Don't most C compilers support a pragma to control this? "#pragma pack(4)" for clang and gcc, I believe.

Given this (where I've made it add two arrays of 96-bit integers to make it easier to figure out the sizes in the assemply):

  #include <stdint.h>

  #pragma pack(4)
  struct block_addr {
      uint64_t low;
      uint32_t high;

  int sum(struct block_addr * a, struct block_addr * b, struct block_addr * c)
      for (int i = 0; i < 8; ++i)
          c->low = a->low + b->low;
          c++->high = a++->high + b++->high;
      return 17;
here is the code for the loop body, which the compiler unrolled to make it even easier to see how the structure is laid out:

  movq    (%rbx), %rax
  addq    (%r15), %rax
  movq    %rax, (%r14)
  movl    8(%rbx), %eax
  addl    8(%r15), %eax
  movl    %eax, 8(%r14)
  movq    12(%rbx), %rax
  addq    12(%r15), %rax
  movq    %rax, 12(%r14)
  movl    20(%rbx), %eax
  addl    20(%r15), %eax
  movl    %eax, 20(%r14)
  movq    24(%rbx), %rax
  addq    24(%r15), %rax
  movq    %rax, 24(%r14)
  movl    32(%rbx), %eax
  addl    32(%r15), %eax
  movl    %eax, 32(%r14)
  movq    84(%rbx), %rax
  addq    84(%r15), %rax
  movq    %rax, 84(%r14)
  movl    92(%rbx), %eax
  addl    92(%r15), %eax
  movl    %eax, 92(%r14)
(Some white space added, and the middle cut out). The 96-bit inters are now only taking up 96-bits.

Packed structs are possible, to be sure, but inhibit numerous optimizations, such as (relevant to this case) the use of vector instructions and vector registers.

Changing the loop to 4 iterations for compactness' sake, (aligned) structs of two u64s generate the following, vectorized code:


  vmovdqu (%rsi), %xmm0
  vpaddq  (%rdi), %xmm0, %xmm0
  vmovdqu %xmm0, (%rdx)
  vmovdqu 16(%rsi), %xmm0
  vpaddq  16(%rdi), %xmm0, %xmm0
  vmovdqu %xmm0, 16(%rdx)
  vmovdqu 32(%rsi), %xmm0
  vpaddq  32(%rdi), %xmm0, %xmm0
  vmovdqu %xmm0, 32(%rdx)
  vmovdqu 48(%rsi), %xmm0
  vpaddq  48(%rdi), %xmm0, %xmm0
  vmovdqu %xmm0, 48(%rdx)
And if the pointer arguments are declared `restrict`, the loop can be vectorized even more aggressively:

  vmovdqu64       (%rsi), %zmm0
  vpaddq  (%rdi), %zmm0, %zmm0
  vmovdqu64       %zmm0, (%rdx)
Either of which is much more efficient than the code generated for unaligned, packed 96-bit structs:

  movq    (%rsi), %rax
  addq    (%rdi), %rax
  movq    %rax, (%rdx)
  movl    8(%rsi), %eax
  addl    8(%rdi), %eax
  movl    %eax, 8(%rdx)
  movq    16(%rsi), %rax
  addq    16(%rdi), %rax
  movq    %rax, 16(%rdx)
  movl    24(%rsi), %eax
  addl    24(%rdi), %eax
  movl    %eax, 24(%rdx)
  movq    32(%rsi), %rax
  addq    32(%rdi), %rax
  movq    %rax, 32(%rdx)
  movl    40(%rsi), %eax
  addl    40(%rdi), %eax
  movl    %eax, 40(%rdx)
  movq    48(%rsi), %rax
  addq    48(%rdi), %rax
  movq    %rax, 48(%rdx)
  movl    56(%rsi), %eax
  addl    56(%rdi), %eax
  movl    %eax, 56(%rdx)
A smaller cost is that in non-vector code, using a 64-bit register (rax) in 32-bit mode (eax) is wasting half of the register.

IIRC, unaligned loads and stores will also, at the hardware level, stall the pipeline and inhibit out-of-order execution.

Oops, I used `#pragma pack` incorrectly in my code, but it doesn't change the codegen for the 96-bit structs other than offsets. Also `restrict` is only needed on the output argument to enable full vectorization of the 128-bit structs.

New link: https://godbolt.org/g/8uGn4h

See my little test program at https://godbolt.org/g/53SAMq

I believe this program properly handles carry from the low to high part.

The 96- and 128-bit code have the same number of instructions, but the 128-bit code has more instruction bytes due to "REX prefixes" (i.e., 32-bit register add is 3 bytes of opcode, 64-bit register add is 4)

Here's a better variant: https://godbolt.org/g/xRtr4i


You can do it like this: https://godbolt.org/g/r6WruQ

On the other hand they could have used 96bits of block pointer and 32bits of meta data in a sort of tagged reference or capability system, instead of shuffling around a bunch of high order zero bytes forever.

Assuming the tagged reference or capability system was built, wouldn't it need software to take advantage of it? If it's not actively used, no real point having it over more block pointer space - and I doubt significant amounts of software would use such a filesystem-specific feature.

Is there any advantage in a CPU to doing 32-bit math instead of 128-bit? My first guess is that this would make pointer operations much slower.

Most CPUs do not support 128-bit integer math. They would do do 64-bit integer ops with carry. In most architectures that would be no different in code size from a 64-bit op followed by a 32-bit op.

Very complex compilers and/or cisc decoders on superscaler processors could theoretically rewrite some 128-bit to 32-bit and run the computations concurrently with other 128-bit computations.

Memory alignment? It's also the next power of 2. Anything non-power-of-two would be cumbersome.

Having an integer/memory address size that isn't a power of two is possible, and has been done before, but it's awkward. Lacking a specific reason not to go to 128 bits, they did that.

You mean like doubling the overhead of every pointer into the filesystem? Seems directly related to the rather high ram overhead (I've heard 1GB ram per TB of disk).

Others mentioned that those rules of thumb are for running the deduplication system, which one really shouldn't[1].

In real-world use memory needs will vary by use like anything else, but are entirely reasonable. I have an old box with 4 gigs of RAM and about 20T disk that performs just fine for modest file-serving needs. If I had more than a few users for that system, it would need more at some point, but mostly for client access software, not the filesystem.

ZFS will use as much memory as you want, and benefits from lots of it in many use cases. But it doesn't require it.

I find it to be the current sweet-spot between useful features, performance, and stability - snapshots, trivial filesystem serialization; not the fastest, but acceptable; and rock solid.

Anyone approaching it from scratch, I highly recommend thoroughly going through the operations one does without your irreplaceable data on it. Everyone, including me, ignores this recommendation. So at the very least, allow me to suggest that you think very long and hard before running that -f (force) command the first time you replace a disk on the array with the baby pics.

[1] Maybe on a system with a lot of ram but constrained storage; I haven't encountered those, but I'm sure they're out there somewhere.

That 1GB of RAM per 1TB of disk is a recommendation from the ZFS documentation, but let's remember the audience, and the feature sets enabled when we talk about it. In particular, that suggestion stemmed from running in a configuration with file deduplication enabled and heavy amounts of caching, which ZFS is made to take advantage of.

The high memory usage profile definitely isn't from an extra 4 bytes on a pointer, but from design and features of the filesystem.

The recommendation of RAM to disk is for storage servers expecting to take advantage of the ZFS ARC and possibly deduplication. We've got something like 192GB in our TrueNAS appliances at work and it's always full of cached data, we don't bother with dedup since it doesn't benefit our workload (ECM image storage) and would bring the system to a halt with tens of millions of 4KB image files.

Pointers in the filesystem would want to be word-aligned anyway, so once you make them larger than 8 bytes you might as well make them 16 bytes.

Right, but the entire point is that 64 bits is PLENTY. The default block size is 128KB.

So 2^64 * block size = 2417851639229258349412352 bytes

Or 2147483648 PB. Sure there might be distributed systems that are mind boggling large, but those aren't in a single filesystem attached to a single node.

ZFS would be a better filesystem with "only" 64 bit blocks.

Today's higher density nodes are something like 48 disks * 12TB, so approximately 2 per PB. 12 per rack would be 6PB. So 178,956,970 racks consuming 300 times the annual production on earth and you'd start wishing for that 65th bit. All in a single OS connecting to a single storage system.

Or you could assume that in the next few decades you'd be batshit crazy to want to install that much storage on a single zfs system and half the overhead of every pointer into the filesystem.

Computers are binary. After 2^6 the next step is 2^7. Anything in between would be awkward.

So the next step after 2^64 would be 2^65 then, or 65-bit.

The number of bits in the length of the word should be a power of 2; thus the next logical 'full word' after 64 bits is 128 bits.

    2 to the n...
    0 = 1
    1 = 2
    2 = 4
    3 = 8
    4 = 16
    5 = 32
    6 = 64
    7 = 128
For large files it probably doesn't make sense to worry about the size of metadata. For smaller files, particularly ones that haven't changed recently, it might make more sense to use some kind of compression scheme.

A decade later, we have a lot of decent compression schemes out of patent and, likely based on their now public methods, a revitalization of development towards schemes that are more optimal for different use cases. Some combination of pre-filtering stages and compression for large caches of small files (like a directory of source code objects or configurations that are infrequently read) might make sense today.

> Thus, fully populating a 128-bit storage pool would, literally, require more energy than boiling the oceans.

So we should be good until OceanCoin is introduced.

"Thus, fully populating a 128-bit storage pool would, literally, require more energy than boiling the oceans.". Could we instead boil some Jupiter Moons?

Boiling the oceans is an alternative energy use, not a source.

Is it true zfs never saw widespread adoption because of licensing compat with Linux and patents or has come and gone?

People are reluctant to use openZFS on Linux as it can't be included in the kernel due to license issues. I'd still say it has widespread adoption when you consider the people willing to run it on Linux and how popular it is on FreeBSD/some other systems.

I think that certainly hurt it. Also, when Oracle bought Sun, they started restricting Solaris use to Sun hardware only. Previously, we were happily using Solaris 10_x86 + ZFS on SuperMicro boxes.

We still have north of 7PB of ZFS-based storage at $DAYJOB (NexentaStor based and Dell hardware) after introducing it nearly eight years ago. ZFS itself has been awesome.

FWIW it has been included in stock Ubuntu since 16.04 LTS so it is widely available, and likely more in use than assumed.

>That's enough to survive Moore's Law until I'm dead, and after that, it's not my problem.


No, I don't think that was entirely serious. 128 bytes should be more than enough to last for a very long time, not just a little bit longer than the author's life.

Damn, just noticed I wrote 128 bytes and not 128 bits, too late to edit.

I wasn't talking about the number of bits, my friend. I was talking about the attitude that "it's not my problem after I'm dead." That's exactly why we are stuck with so many societal problems. For example, the Federal Reserve Act of 1913 was a true victory for the sons-of-guns who got it enacted, but then they died and now we're all left with the aftermath. It's this shortsighted attitude that is the problem, not the number of bits in the implementation.

It could run the file system of the death star (Star Wars).

Why can't they have a 64 bit and 128 bit ZFS offering side-by-side. When somebody actually needs 128bit they can migrate?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact